Search Alternatives to Verity
Google search (thanks Mike for link and for sample code) now has a beta API, but you don’t have complete control over the branding so it didn’t work for my use case.
Zoom was only $99 per server with no limit on the number of sites spidered and searched. It includes sample code in the FAQs for “integrating” into CF using CFHTTP. Seemed to have a nice range of features – especially given the price.
Searchblox runs on JRun and the paid edition (starting at $600 for up to 10,000 documents) will return XML which you could call and format in CF. There is also a free version for smaller sites (no XML) and this seems like a pretty sophisticated engine.
DT Search is the search engine that I seem to remember seeing ads for in the back of computing magazines for years now. It is an extremely through solution and seems to be optimized for fast searching of large sites, but with the (fairly reasonable) price of $1,000 per server it was outside of my budget for the problem.
If you just need a quick hosted search solution, you might also want to consider Picosearch and Freefind. I used FreeFind recently on a project and it worked out great. You have to upload a template to set the look of the results page, the results page is hosted so the URL is different, but it is an extremely quick way of adding search to a site.
Please feel free to post any other recommendations you might have. You might also want to check out the previous recommendations when I asked about this a little while back (scroll down to the comments).


I saw Doug Hughes had played with this. Any chance of a blog entry or report for a CF'er on how to get this installed, working and talking to CF? I couldn't find any such resources but would love to link to them as I think it'd be great for the CF world.
Main reason I avoided Lucene is that I don't have a background in Java. I can read and write short code samples but wouldn't have a clue how to compile and deploy a project - last time I compiled code explicitly was in 1991 compiling my ComSci C projects at Uni!
I have some code implemented. I spent weeks trying to find samples.
CF_Lucene was written for DRK3... I don't have a copy, and couldn't find it on the net even though I believe the license said it should be available freely by now.
Doug lost all his Lucene code. No longer has it last he told me. I asked for his code, no go there.
The only person to do extensive work that I could find was Joseph Laromee, who built CFLucene. Last I spoke to him, he was going to re-factor, but it was extensive, had unit tests, built with UML, etc... Check out http://www.cflucene.org
For me, a flat cfscript implementation was fast, however I didn't delve too much as I finally got my Verity to a stable point.
I can share a simple example if you'd like...
If you can get CFLucene to a good point, people would pay for that!
I'd love anything you could share - either as a comment or just give me a URL to point to or whatever is easier. I'll try to get with Joseph to see if he's going to work on this and if anyone else has any interest, drop Joseph a line at jlamoree@gmail.com and we'll see if we can get enough interest in this to get something moving!
My search page Lucene1.cfm:
<form action="lucene2.cfm" method="get" name="myForm">
<input name="q" type="text" size="35" maxlength="35"> <input name="submit" type="submit" value="Submit">
</form>
My Result page Lucene2.cfm
<cfparam name="myIndexDir" type="string" default="d:\wwwroot\temp\index\">
<cfparam name="url.q" type="string" default="fire">
<!---<cf_luceneindex action="optimize" indexpath="#myIndexDir#">--->
<cf_lucenesearch index="#myIndexDir#" keyword="#url.q#" r_query="qobject">
<a href="lucene.cfm">Click here</a> to enter new search terms.<br>
<div id="searchBody">
<cfdump var="#qobject#">
</div>
The custom tag called LuceneSearch.cfm ( I believe I worked on this a bit...)
<cfsetting enablecfoutputonly="1">
<!---
Template: lucenesearch.cfm
Author: Aaron Johnson
Source Control: \\server\wwwroot\lucene\lucenesearch.cfm
Change History:
creation: 06/08/2003 -- ASJ
Description:
Custom tag that searches a given lucene index for a given 'keyword'
using a the standard lucene analzyer and returns a ColdFusion query
object that the caller can then iterate over.
Usage:
<cf_lucenesearch
index="c:\cfusionmx\wwwroot\lucene\cfdocsindex\"
keyword="verity"
r_query="qobject">
Attributes:
index: String, the path to the lucene index
keyword: String, the keyword(s) you want to search
r_query: String, the name of the variable you want the results fed back into
--->
<!--- default attributes --->
<cfparam name="attributes.index" default="">
<cfparam name="attributes.keyword" default="">
<cfparam name="attributes.r_query" default="">
<!--- make sure we have an index, a keyword and a r_query --->
<cfif len(attributes.index) EQ 0>
<cfabort showerror="You must provide the index for this search.">
</cfif>
<cfif len(attributes.keyword) EQ 0>
<cfabort showerror="You must provide the keyword to search against.">
</cfif>
<cfif len(attributes.r_query) EQ 0>
<cfabort showerror="You must provide the name of a variable to return the results into.">
</cfif>
<cfscript>
index = attributes.index; // lucene index to search against
keyword = attributes.keyword; // the keyword we're looking for
r_query = attributes.r_query; // the name of the variable to return the query to
localQuery = ""; // query to return to the caller
// get an IndexReader object to use in the constructor to the searcher var
indexReader = CreateObject("java", "org.apache.lucene.index.IndexReader");
// get an IndexSearcher object, call the constructor
searcher = CreateObject("java", "org.apache.lucene.search.IndexSearcher");
searcher = searcher.init(indexReader.open(index));
// get an Analyzer object, in this case we're using the StopAnalyzer object
analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer");
analyzer.init();
luceneQuery = CreateObject("java", "org.apache.lucene.search.Query");
queryParser = CreateObject("java", "org.apache.lucene.queryParser.QueryParser");
luceneQuery = queryParser.parse(keyword, "body", analyzer);
// run the search
hits = CreateObject("java", "org.apache.lucene.search.Hits");
hits = searcher.search(luceneQuery);
// Create a query which contains these columns
localQuery = QueryNew("URL, TITLE, SUMMARY");
// create a Document object so that we can retrieve the url, title & summary fields
doc = CreateObject("java", "org.apache.lucene.document.Document");
// loop over all the results, add each to the query
for (i=0; i LT hits.length(); i=i+1) { // for each element
doc = hits.doc(javacast("int", i)); //get the next document
QueryAddRow(localQuery); // add a row to the query
QuerySetCell(localQuery, "url", doc.get("url"), i+1); // add the url property
QuerySetCell(localQuery, "title", doc.get("title"), i+1); // add the title property
QuerySetCell(localQuery, "summary", doc.get("summary"), i+1); // add the summary property
}
// return the query to the caller
SetVariable("caller.#attributes.r_query#" , localQuery);
</cfscript>
<cfsetting enablecfoutputonly="0">
The indexer custom tag LuceneIndex.cfm:
<cfsetting enablecfoutputonly="1">
<!---
Template: luceneindex.cfm
Author: Aaron Johnson
Source Control: \\server\wwwroot\lucene\luceneindex.cfm
Change History:
creation: 06/08/2003 ASJ
Description:
Custom tag that will populate a given Lucene index using
a given directory (possibly recursively) or a document. Also
allows the developer to optimize a Lucene index
Usage:
1) Index a directory:
<cf_luceneindex
action="index"
indexpath="c:\cfusionmx\wwwroot\lucene\cfdocsindex\"
bCreateIndex="true"
directory="c:\cfusionmx\wwwroot\cfdocs\"
urlpath="http://localhost:8500/cfdocs/"
recursive="true">
2) Index a document:
<cf_luceneindex
action="index"
indexpath="c:\cfusionmx\wwwroot\lucene\cfdocsindex\"
bCreateIndex="true"
file="c:\cfusionmx\wwwroot\cfdocs\dochome.htm"
urlpath="http://localhost:8500/cfdocs/">
3) Optimize an index:
<cf_luceneindex
action="optimize"
indexpath="c:\cfusionmx\wwwroot\lucene\cfdocsindex\">
Attributes:
--->
<!--- include the necessary functions --->
<cftry><cfinclude template="DirectoryList.cfm"><cfcatch type="any"></cfcatch></cftry>
<!--- default attributes --->
<cfparam name="attributes.action" default="index">
<cfparam name="attributes.indexpath" default="">
<cfparam name="attributes.bCreateIndex" default="false">
<cfparam name="attributes.directory" default="">
<cfparam name="attributes.file" default="">
<cfparam name="attributes.urlpath" default="">
<cfparam name="attributes.recursive" default="true">
<!--- make sure we at least have an indexpath, a file||directory, and and urlpath --->
<cfif len(attributes.indexpath) EQ 0>
<cfabort showerror="You must supply the path to the Lucene index.">
</cfif>
<cfif attributes.action EQ "index" AND len(attributes.directory) EQ 0 AND len(attributes.file) EQ 0>
<cfabort showerror="You must either provide a file or a directory to index.">
</cfif>
<cfif attributes.action EQ "index" AND len(attributes.urlpath) EQ 0>
<cfabort showerror="You must provide the URL path for this document or file.">
</cfif>
<!--- now either do the indexing or the optimization --->
<cfswitch expression="#attributes.action#">
<cfcase value="index">
<cfscript>
// get a SimpleAnalyzer object
analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer");
// call the SimpleAnalyzer constructor
analyzer.init();
// get an IndexWriter java object
writer = CreateObject("java", "org.apache.lucene.index.IndexWriter");
// call the IndexWriter constructor
writer.init(attributes.indexpath, analyzer, attributes.bCreateIndex);
// if we're indexing a directory, we must loop over all the documents we find...
if (len(attributes.directory) GT 0) {
// make sure that the directory exists on the file system
if (DirectoryExists(attributes.directory)) {
// index this directory
indexDirectory(attributes.urlpath, attributes.directory, attributes.recursive, writer);
} else {
// show an error message
WriteOutput("Directory '" & attributes.directory & "' does not exist. Please select a different directory.");
}
// otherwise, index a single file
} else {
indexFile(attributes.urlpath, attributes.file, writer);
}
/* finally, call the close() method of the writer object, which flushes
all changes to an index, closes all associated files, and closes the
directory that the index is stored in. */
writer.close();
</cfscript>
</cfcase>
<cfcase value="optimize">
<cfscript>
// get a SimpleAnalyzer object
analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer");
// call the SimpleAnalyzer constructor
analyzer.init();
// get an IndexWriter java object
writer = CreateObject("java", "org.apache.lucene.index.IndexWriter");
// call the IndexWriter constructor
writer.init(attributes.indexpath, analyzer, false);
/* call the writer optimize() method ( Merges all segments together into a
single segment, optimizing an index for search. */
writer.optimize();
</cfscript>
</cfcase>
</cfswitch>
<cfscript>
//function to read a file, source from cflib.org
function FileRead(filename) {
var fileStr = "";
var fileReaderClass = createObject("java", "java.io.FileReader");
var fileReader = fileReaderClass.init(filename);
var lineNumberReaderClass = createObject("java","java.io.LineNumberReader");
var lineReader = lineNumberReaderClass.init(fileReader);
var run = true;
while (run) {
tempStr = lineReader.readLine();
if (NOT IsDefined("tempStr") OR tempStr EQ -1) {
run = false;
} else {
fileStr = fileStr & tempStr;
}
}
return fileStr;
}
/* function that loops over all the directories & files in a given directory
calling indexFile() on each file, and indexDirectory() on each directory */
function indexDirectory(url, directory, brecursive, writer) {
var urlpath = arguments[1];
var dirpath = arguments[2];
var recursive = arguments[3];
var theWriter = arguments[4];
var pathSep = "";
var system = CreateObject("java", "java.lang.System");
// check the url & dir for trailing slashes
if (right(urlpath, 1) NEQ "/") urlpath = urlpath & "/";
// get a system object so that we can accurately determine what the system sep is
pathSep = system.getProperty("file.separator").charAt(0);
if (right(dirpath, 1) NEQ pathSep) dirpath = dirpath & pathSep;
// get all the elements in this directory
qFiles = directoryList(dirpath, "*.*", "", false);
// loop over all the elements in this directory
for (i=1; i LTE qFiles.recordcount; i=i+1) {
// if we have a directory and we want to go recursive
if (qFiles.type[i] EQ "dir" AND recursive) {
// index this diretory
indexDirectory(urlpath & qFiles.name[i], dirpath & qFiles.name[i], brecursive, writer);
// we do have a file, index it...
} else {
// index this file
indexFile(urlpath, dirpath & qFiles.name[i], theWriter);
}
}
}
// function to index a file
function indexFile(url, file, writer) {
var urlPath = arguments[1];
var filePath = arguments[2];
var fileContent = "";
var theWriter = arguments[3];
var pathSep = "";
// create a document object and add the appropriate fields
var document = CreateObject("java", "org.apache.lucene.document.Document");
// get a Field object so that we can add fields to this document
var field = CreateObject("java", "org.apache.lucene.document.Field");
// get a system object so that we can accurately determine what the system sep is
var system = CreateObject("java", "java.lang.System");
// add the url field and the content of the file to the document
// Keyword(String name, String value) = Constructs a String-valued Field that is not tokenized, but is indexed and stored
// Text(String name, String value) = Constructs a String-valued Field that is tokenized and indexed, and is stored in the index, for return with hits.
// UnIndexed(String name, String value) = Constructs a String-valued Field that is not tokenized nor indexed, but is stored in the index, for return with hits.
// UnStored(String name, String value) = Constructs a String-valued Field that is tokenized and indexed, but that is not stored in the index.
var content = FileRead(filePath);
var title = "";
var startTitle = FindNoCase("<title>", content);
var endTitle = FindNoCase("</title>", content);
pathSep = system.getProperty("file.separator").charAt(0);
if (endTitle GT 0) {
title = trim(Mid(content, startTitle + 7, endTitle - startTitle - 7));
}
document.add(field.Keyword("url", urlPath & listLast(filePath, pathSep)));
document.add(field.Text("title", title));
document.add(field.UnIndexed("summary", content));
document.add(field.UnStored("body", content));
// index this document
theWriter.addDocument(document);
}
</cfscript>
<cfsetting enablecfoutputonly="0">
And finally, I tried to create a simpler version but stopped with this:
<cfparam name="myIndexDir" type="string" default="d:\wwwroot\temp\index\">
<cfset analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer")>
<cfset analyzer.init()>
<cfset writer = CreateObject("java", "org.apache.lucene.index.IndexWriter")>
<cfset writer.init(myIndexDir, analyzer, "true")>
<cfquery name="contentIndex" datasource="sami">
select *
FROM tblGrant
</cfquery>
<cfset field = CreateObject("java", "org.apache.lucene.document.Field")>
<cfloop query="contentIndex">
<cfset document = CreateObject("java", "org.apache.lucene.document.Document")>
<cfset content = contentIndex.summary>
<cfset title = contentIndex.grantTitle>
<cfset urlpath = "/products/detail.cfm?id=" & contentIndex.grantID>
<cfset document.add(field.Keyword("url", urlpath))>
<cfset document.add(field.Text("title", title))>
<cfset document.add(field.UnIndexed("summary", content))>
<cfset document.add(field.UnStored("body", content))>
<cfset writer.addDocument(document)>
</cfloop>
<cfset writer.close()>
Will play with this and (if you don't mind) ask questions!
I'm using it for LightWire and it is working great and I have another project to post there later this week when I can figure out what the heck to call it (a set of base classes that allow you to declaritively specify most of your model methods - initially using programmable config and shortly thereafter sidegrading to XML option). Kind of like an M2 or MG for the model. Obviously mainly for CRUD with transformation and validations but way more than scaffolding would provide . . .
http://coldfusion.sys-con.com/read/42053.htm?CFID=...
HTH
Kola