By Peter Bell

Search Alternatives to Verity

Over the weekend when I was having problems with Verity I spent some time researching mid-priced alternatives. I finally got Verity working, so this isn’t an issue for me now, but I thought I’d post what I found just in case it might be of use to anyone else.

Google search (thanks Mike for link and for sample code) now has a beta API, but you don’t have complete control over the branding so it didn’t work for my use case.

Zoom was only $99 per server with no limit on the number of sites spidered and searched. It includes sample code in the FAQs for “integrating” into CF using CFHTTP. Seemed to have a nice range of features – especially given the price.

Searchblox runs on JRun and the paid edition (starting at $600 for up to 10,000 documents) will return XML which you could call and format in CF. There is also a free version for smaller sites (no XML) and this seems like a pretty sophisticated engine.

DT Search is the search engine that I seem to remember seeing ads for in the back of computing magazines for years now. It is an extremely through solution and seems to be optimized for fast searching of large sites, but with the (fairly reasonable) price of $1,000 per server it was outside of my budget for the problem.

If you just need a quick hosted search solution, you might also want to consider Picosearch and Freefind. I used FreeFind recently on a project and it worked out great. You have to upload a template to set the look of the results page, the results page is hosted so the URL is different, but it is an extremely quick way of adding search to a site.

Please feel free to post any other recommendations you might have. You might also want to check out the previous recommendations when I asked about this a little while back (scroll down to the comments).

Comments
Lucene is really the easiest most flexible choice. A few hours of coding and you can create a very very scalable search engine. The greatest advantange is that you have complete control. Also very easy to implement with Coldfusion.
# Posted By David Sparkman | 10/24/06 11:52 AM
Hi David,

I saw Doug Hughes had played with this. Any chance of a blog entry or report for a CF'er on how to get this installed, working and talking to CF? I couldn't find any such resources but would love to link to them as I think it'd be great for the CF world.

Main reason I avoided Lucene is that I don't have a background in Java. I can read and write short code samples but wouldn't have a clue how to compile and deploy a project - last time I compiled code explicitly was in 1991 compiling my ComSci C projects at Uni!
# Posted By Peter Bell | 10/24/06 12:01 PM
Peter,

I have some code implemented. I spent weeks trying to find samples.

CF_Lucene was written for DRK3... I don't have a copy, and couldn't find it on the net even though I believe the license said it should be available freely by now.

Doug lost all his Lucene code. No longer has it last he told me. I asked for his code, no go there.

The only person to do extensive work that I could find was Joseph Laromee, who built CFLucene. Last I spoke to him, he was going to re-factor, but it was extensive, had unit tests, built with UML, etc... Check out http://www.cflucene.org

For me, a flat cfscript implementation was fast, however I didn't delve too much as I finally got my Verity to a stable point.

I can share a simple example if you'd like...

If you can get CFLucene to a good point, people would pay for that!
# Posted By Sami Hoda | 10/24/06 3:44 PM
Hi Sami,

I'd love anything you could share - either as a comment or just give me a URL to point to or whatever is easier. I'll try to get with Joseph to see if he's going to work on this and if anyone else has any interest, drop Joseph a line at jlamoree@gmail.com and we'll see if we can get enough interest in this to get something moving!
# Posted By Peter Bell | 10/24/06 3:48 PM
Ok, this is going to be messy.

My search page Lucene1.cfm:

<form action="lucene2.cfm" method="get" name="myForm">

<input name="q" type="text" size="35" maxlength="35"> <input name="submit" type="submit" value="Submit">
</form>

My Result page Lucene2.cfm

<cfparam name="myIndexDir" type="string" default="d:\wwwroot\temp\index\">
<cfparam name="url.q" type="string" default="fire">
<!---<cf_luceneindex action="optimize" indexpath="#myIndexDir#">--->

<cf_lucenesearch index="#myIndexDir#" keyword="#url.q#" r_query="qobject">

<a href="lucene.cfm">Click here</a> to enter new search terms.<br>

<div id="searchBody">
<cfdump var="#qobject#">
</div>

The custom tag called LuceneSearch.cfm ( I believe I worked on this a bit...)

<cfsetting enablecfoutputonly="1">
<!---
   Template:         lucenesearch.cfm
   Author:            Aaron Johnson
   Source Control:      \\server\wwwroot\lucene\lucenesearch.cfm
   Change History:      
                  creation: 06/08/2003 -- ASJ
   Description:
      Custom tag that searches a given lucene index for a given 'keyword'
      using a the standard lucene analzyer and returns a ColdFusion query
      object that the caller can then iterate over.
   Usage:
      <cf_lucenesearch
         index="c:\cfusionmx\wwwroot\lucene\cfdocsindex\"
         keyword="verity"
         r_query="qobject">
   Attributes:
      index: String, the path to the lucene index
      keyword: String, the keyword(s) you want to search
      r_query: String, the name of the variable you want the results fed back into
--->

<!--- default attributes --->
<cfparam name="attributes.index" default="">
<cfparam name="attributes.keyword" default="">
<cfparam name="attributes.r_query" default="">

<!--- make sure we have an index, a keyword and a r_query --->
<cfif len(attributes.index) EQ 0>
   <cfabort showerror="You must provide the index for this search.">
</cfif>

<cfif len(attributes.keyword) EQ 0>
   <cfabort showerror="You must provide the keyword to search against.">
</cfif>

<cfif len(attributes.r_query) EQ 0>
   <cfabort showerror="You must provide the name of a variable to return the results into.">
</cfif>


<cfscript>

   index = attributes.index; // lucene index to search against
   keyword = attributes.keyword; // the keyword we're looking for
   r_query = attributes.r_query; // the name of the variable to return the query to
   
   localQuery = ""; // query to return to the caller
   
   // get an IndexReader object to use in the constructor to the searcher var
   indexReader = CreateObject("java", "org.apache.lucene.index.IndexReader");
   // get an IndexSearcher object, call the constructor
   searcher = CreateObject("java", "org.apache.lucene.search.IndexSearcher");
   searcher = searcher.init(indexReader.open(index));

// get an Analyzer object, in this case we're using the StopAnalyzer object
analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer");
   analyzer.init();
   
   luceneQuery = CreateObject("java", "org.apache.lucene.search.Query");
   queryParser = CreateObject("java", "org.apache.lucene.queryParser.QueryParser");
   luceneQuery = queryParser.parse(keyword, "body", analyzer);
   
   // run the search
   hits = CreateObject("java", "org.apache.lucene.search.Hits");
   hits = searcher.search(luceneQuery);
   
   // Create a query which contains these columns
   localQuery = QueryNew("URL, TITLE, SUMMARY");
   
   // create a Document object so that we can retrieve the url, title & summary fields
   doc = CreateObject("java", "org.apache.lucene.document.Document");
   
   // loop over all the results, add each to the query
   for (i=0; i LT hits.length(); i=i+1) { // for each element
      doc = hits.doc(javacast("int", i)); //get the next document
   QueryAddRow(localQuery); // add a row to the query
      QuerySetCell(localQuery, "url", doc.get("url"), i+1); // add the url property
      QuerySetCell(localQuery, "title", doc.get("title"), i+1); // add the title property
      QuerySetCell(localQuery, "summary", doc.get("summary"), i+1); // add the summary property
   }
   
// return the query to the caller
SetVariable("caller.#attributes.r_query#" , localQuery);

</cfscript>

<cfsetting enablecfoutputonly="0">

The indexer custom tag LuceneIndex.cfm:

<cfsetting enablecfoutputonly="1">
<!---
   Template:         luceneindex.cfm
   Author:            Aaron Johnson
   Source Control:      \\server\wwwroot\lucene\luceneindex.cfm
   Change History:      
                  creation: 06/08/2003 ASJ
   Description:
      Custom tag that will populate a given Lucene index using
      a given directory (possibly recursively) or a document. Also
      allows the developer to optimize a Lucene index
   Usage:
      1) Index a directory:
         <cf_luceneindex
            action="index"
            indexpath="c:\cfusionmx\wwwroot\lucene\cfdocsindex\"
            bCreateIndex="true"
            directory="c:\cfusionmx\wwwroot\cfdocs\"
            urlpath="http://localhost:8500/cfdocs/"
            recursive="true">
            
      2) Index a document:
         <cf_luceneindex
            action="index"
            indexpath="c:\cfusionmx\wwwroot\lucene\cfdocsindex\"
            bCreateIndex="true"
            file="c:\cfusionmx\wwwroot\cfdocs\dochome.htm"
            urlpath="http://localhost:8500/cfdocs/">
         
      3) Optimize an index:
         <cf_luceneindex
            action="optimize"
            indexpath="c:\cfusionmx\wwwroot\lucene\cfdocsindex\">
         
   Attributes:
--->

<!--- include the necessary functions --->
<cftry><cfinclude template="DirectoryList.cfm"><cfcatch type="any"></cfcatch></cftry>

<!--- default attributes --->
<cfparam name="attributes.action" default="index">
<cfparam name="attributes.indexpath" default="">
<cfparam name="attributes.bCreateIndex" default="false">
<cfparam name="attributes.directory" default="">
<cfparam name="attributes.file" default="">
<cfparam name="attributes.urlpath" default="">
<cfparam name="attributes.recursive" default="true">

<!--- make sure we at least have an indexpath, a file||directory, and and urlpath --->
<cfif len(attributes.indexpath) EQ 0>
   <cfabort showerror="You must supply the path to the Lucene index.">
</cfif>

<cfif attributes.action EQ "index" AND len(attributes.directory) EQ 0 AND len(attributes.file) EQ 0>
   <cfabort showerror="You must either provide a file or a directory to index.">
</cfif>

<cfif attributes.action EQ "index" AND len(attributes.urlpath) EQ 0>
   <cfabort showerror="You must provide the URL path for this document or file.">
</cfif>


<!--- now either do the indexing or the optimization --->

<cfswitch expression="#attributes.action#">
   
   <cfcase value="index">
      
      <cfscript>
         
         // get a SimpleAnalyzer object
         analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer");
         
         // call the SimpleAnalyzer constructor
         analyzer.init();
         
         // get an IndexWriter java object
         writer = CreateObject("java", "org.apache.lucene.index.IndexWriter");

         // call the IndexWriter constructor
         writer.init(attributes.indexpath, analyzer, attributes.bCreateIndex);
         
         
         // if we're indexing a directory, we must loop over all the documents we find...
         if (len(attributes.directory) GT 0) {
            
            // make sure that the directory exists on the file system
            if (DirectoryExists(attributes.directory)) {
            
               // index this directory
               indexDirectory(attributes.urlpath, attributes.directory, attributes.recursive, writer);
               
            } else {
               
               // show an error message
               WriteOutput("Directory '" & attributes.directory & "' does not exist. Please select a different directory.");
            }
               
            
         
         // otherwise, index a single file
         } else {
            
            indexFile(attributes.urlpath, attributes.file, writer);
            
         }
            
         
         /* finally, call the close() method of the writer object, which flushes
         all changes to an index, closes all associated files, and closes the
         directory that the index is stored in. */
         writer.close();
      </cfscript>
      
   </cfcase>
   
   
   <cfcase value="optimize">
      
      <cfscript>
         // get a SimpleAnalyzer object
         analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer");
         
         // call the SimpleAnalyzer constructor
         analyzer.init();
         
         // get an IndexWriter java object
         writer = CreateObject("java", "org.apache.lucene.index.IndexWriter");

         // call the IndexWriter constructor
         writer.init(attributes.indexpath, analyzer, false);
         
         /* call the writer optimize() method ( Merges all segments together into a
         single segment, optimizing an index for search. */
         writer.optimize();
      
      </cfscript>

   </cfcase>

</cfswitch>



<cfscript>
//function to read a file, source from cflib.org
   function FileRead(filename) {
      var fileStr = "";
      var fileReaderClass = createObject("java", "java.io.FileReader");
      var fileReader = fileReaderClass.init(filename);
      var lineNumberReaderClass = createObject("java","java.io.LineNumberReader");
      var lineReader = lineNumberReaderClass.init(fileReader);
      
      var run = true;
      
      while (run) {
         tempStr = lineReader.readLine();
         if (NOT IsDefined("tempStr") OR tempStr EQ -1) {
            run = false;
         } else {
            fileStr = fileStr & tempStr;
         }
      }
      return fileStr;
   }

   
   /* function that loops over all the directories & files in a given directory
      calling indexFile() on each file, and indexDirectory() on each directory */
   function indexDirectory(url, directory, brecursive, writer) {
   
      var urlpath = arguments[1];
      var dirpath = arguments[2];
      var recursive = arguments[3];
      var theWriter = arguments[4];
      var pathSep = "";
      var system = CreateObject("java", "java.lang.System");

      // check the url & dir for trailing slashes
      if (right(urlpath, 1) NEQ "/") urlpath = urlpath & "/";
      
      // get a system object so that we can accurately determine what the system sep is
      pathSep = system.getProperty("file.separator").charAt(0);
      
      if (right(dirpath, 1) NEQ pathSep) dirpath = dirpath & pathSep;
      // get all the elements in this directory
      qFiles = directoryList(dirpath, "*.*", "", false);
      
      // loop over all the elements in this directory
      for (i=1; i LTE qFiles.recordcount; i=i+1) {
         
         // if we have a directory and we want to go recursive
         if (qFiles.type[i] EQ "dir" AND recursive) {
            // index this diretory
            indexDirectory(urlpath & qFiles.name[i], dirpath & qFiles.name[i], brecursive, writer);
         
         // we do have a file, index it...
         } else {
            // index this file
            indexFile(urlpath, dirpath & qFiles.name[i], theWriter);
         }
      
      }
   }
   
   
   // function to index a file
   function indexFile(url, file, writer) {
      
      var urlPath = arguments[1];
      var filePath = arguments[2];
      var fileContent = "";
      var theWriter = arguments[3];
      var pathSep = "";
      
      // create a document object and add the appropriate fields
      var document = CreateObject("java", "org.apache.lucene.document.Document");
      
      // get a Field object so that we can add fields to this document
      var field = CreateObject("java", "org.apache.lucene.document.Field");
      
      // get a system object so that we can accurately determine what the system sep is
      var system = CreateObject("java", "java.lang.System");
      
      
      // add the url field and the content of the file to the document
      
      // Keyword(String name, String value) = Constructs a String-valued Field that is not tokenized, but is indexed and stored
      // Text(String name, String value) = Constructs a String-valued Field that is tokenized and indexed, and is stored in the index, for return with hits.
      // UnIndexed(String name, String value) = Constructs a String-valued Field that is not tokenized nor indexed, but is stored in the index, for return with hits.
      // UnStored(String name, String value) = Constructs a String-valued Field that is tokenized and indexed, but that is not stored in the index.

      var content = FileRead(filePath);
      var title = "";
      var startTitle = FindNoCase("<title>", content);
      var endTitle = FindNoCase("</title>", content);
      
      pathSep = system.getProperty("file.separator").charAt(0);
      
      if (endTitle GT 0) {
         title = trim(Mid(content, startTitle + 7, endTitle - startTitle - 7));
      }
      
      document.add(field.Keyword("url", urlPath & listLast(filePath, pathSep)));
      document.add(field.Text("title", title));
      document.add(field.UnIndexed("summary", content));
      document.add(field.UnStored("body", content));
      // index this document
      theWriter.addDocument(document);
   
   }

</cfscript>







<cfsetting enablecfoutputonly="0">

And finally, I tried to create a simpler version but stopped with this:

<cfparam name="myIndexDir" type="string" default="d:\wwwroot\temp\index\">
<cfset analyzer = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer")>
<cfset analyzer.init()>
<cfset writer = CreateObject("java", "org.apache.lucene.index.IndexWriter")>
<cfset writer.init(myIndexDir, analyzer, "true")>
         
<cfquery name="contentIndex" datasource="sami">
select *
FROM tblGrant
</cfquery>

<cfset field = CreateObject("java", "org.apache.lucene.document.Field")>
<cfloop query="contentIndex">

   <cfset document = CreateObject("java", "org.apache.lucene.document.Document")>
   
   <cfset content = contentIndex.summary>
   <cfset title = contentIndex.grantTitle>
   <cfset urlpath = "/products/detail.cfm?id=" & contentIndex.grantID>
   
   <cfset document.add(field.Keyword("url", urlpath))>
   <cfset document.add(field.Text("title", title))>
   <cfset document.add(field.UnIndexed("summary", content))>
   <cfset document.add(field.UnStored("body", content))>
   <cfset writer.addDocument(document)>
</cfloop>   
   
<cfset writer.close()>
# Posted By Sami Hoda | 10/24/06 3:59 PM
Messy (need to add support for code better in comments) , but very cool!

Will play with this and (if you don't mind) ask questions!
# Posted By Peter Bell | 10/24/06 4:11 PM
Ask away. If we can open-source a nice standard approach, and maybe get it on Google Code or RIAForge (which is a dump name if you ask me when almost everything is not RIA but just adobe product related)... then I'm all for it.
# Posted By Sami Hoda | 10/24/06 4:50 PM
That would be cool, and while the name isn't quite perfect (I think Ray and Ben madde it clear it wasn't there first choice and was still an open question), the site itself is pretty cool.

I'm using it for LightWire and it is working great and I have another project to post there later this week when I can figure out what the heck to call it (a set of base classes that allow you to declaritively specify most of your model methods - initially using programmable config and shortly thereafter sidegrading to XML option). Kind of like an M2 or MG for the model. Obviously mainly for CRUD with transformation and validations but way more than scaffolding would provide . . .
# Posted By Peter Bell | 10/24/06 4:55 PM
There was also a CFDJ article on using Lucene

http://coldfusion.sys-con.com/read/42053.htm?CFID=...

HTH

Kola
# Posted By kola | 10/25/06 3:09 PM
Thanks - great link!!!
# Posted By Peter Bell | 10/25/06 3:11 PM
Yeah, thats the same code I put in my entry refined... Aaron also has a blog...
# Posted By Sami Hoda | 10/25/06 3:29 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.005.