By Peter Bell

Is Search Really Easier in Ruby than in CF?

I was having some real problems getting full text site search to work using Verity's vspider, and from the information I could find out there, I wasn't the only one (eventually got it working with a bunch of help from Jim and Tony!).

Sami Hoda amongst others was using Lucene and while it wasn't difficult, it didn't seem to be trivial to do (check out the comments which include Samis code and a link to a CFDJ article).

To add insult to injury, I just read this post talking about how easy it is to integrate Lucene with Ruby. There are many words I associate with Ruby. Elegant, fun, terse and "unproven but promising" are all terms that come to mind (and I'll admit the last one is getting better, but if I needed 24x7 I'd still sleep better with solid Java middle-ware instead).

But easier? Than ColdFusion? The language that brough us cfquery, cffile and cfsearch?!

To be fair the article doesn't even MENTION spidering, so I think the simplicity is being overplayed, but it did hit a raw nerve. Why is spidered full text search so (relatively) difficult and (relatively) undocumented in CF? Are there really so few people who have that use case? I just never though of spidered full text search as an esoteric requirement or an edge case . . .

I'll be putting together some kind of simple tool for automating the vspider stuff, but if we're talking wishlist, adding vspider reliably to CF Admin is a lot higher for me than interfaces and nulls!

Comments
Hi Peter,

I know you didn't ask, but...

The CFSEARCH implementation in BlueDragon is based on Lucene, and includes the ability to spider web sites. See Section 4.2.10 of the BD CFML Enhancements Guide:

http://www.newatlanta.com/products/bluedragon/self...

BD 7.0 adds support for Word and PDF docs, and multiple languages (BD 6.2 only support text documents in English).

And, yes, all of this is included in all BD editions, including BD.NET and the free BD Server edition.

Cheers,

Vince
# Posted By vinceb | 10/25/06 10:59 PM
Hi Vince,

You are right that I didn't ask, but I'm glad that you pointed it out. BD is an important part of the community (as are you) and it's great to get a BD take on all such issues.
# Posted By Peter Bell | 10/25/06 11:11 PM
Spidering requires you to use a command line tool and is cryptic at best. But CF/Verity integration allows you to index an entire document library (in over 120 document formats) in a single line of code. Not to mention database content with the addition of a single CFQUERY. Like most things, if you use it for what it was designed for its great -- vspider is somewhat of an afterthought.
# Posted By Geoff Bowers | 10/26/06 1:03 AM
Hi Geoff,

Agreed 100%. The question is "why"?! The use cases for document based search and spider based search are completely different. Almost every site I build wants to have sidered search and then some also want content specific document (well, usually database) searches.

One of these is handled beautifully and the other - well, the other isn't yet handled beautifully!!!
# Posted By Peter Bell | 10/26/06 9:03 AM
Document search works great for chugging through database data or specific directories of documents - trying to kludge it into a site-wise search however was always frustrating.

I'd always run into problems where I'd see CF code in my search results, and the classic - how do I remove directories, files, etc from the results... It would work but it was never 100%.

That's when I started looking at vspider and it turned out to be an elegant solution - AFTER I struggled to get it working. The vspider is such a powerful tool and as Peter said I'm really surprised more people don't use it... Maybe in CF9 ?
# Posted By Jim | 10/26/06 10:30 AM
Hi Jim,

Agreed - I had exactly the same problems. Especially as people often expect a Google like page based search where groups of content have a context. The file based approach just doesn't work for that when you have multiple independent content areas dependent on the "page" you are on. The same article may display in different parts of the site with different ancillary content and even different formatting depending on the context of the page.

The pages don't really exist so you can't just search the html files, and the underlying content items don't map directly to a single page so you can't replicate page context using Verity against the db.

There are good use cases for document based (not spidered) Verity search, but I actually come across those less often than clients who just want a spidered site search.

As Jim says, maybe CF9, although I will try to package up a little utility/generator tool for creating and running the appropriate commands - kind of like a vspider.cfc that handles all of the grunt work. When I do that if it seems valuable enough I'll post it on RIAForge.
# Posted By Peter Bell | 10/26/06 10:37 AM
Sounds good! Wonder if you could leverage the CFAdmin API's to create the collections, etc? I haven't messed with those so I'm not sure how much is exposed.
# Posted By Jim | 10/26/06 10:46 AM
That is what I was depending on although I can always use the old unsupported servicefactory stuff if the API doesn't give me what I need. Haen't looked into it at all, but compared to getting vspider to work it should be a walk in the park.

Guessing I'll also need cfexecute for the bat files I generate and I'll have to tie into some kind of scheduling mechanism. I'll probably just generate a single scheduled task for a vspider.cfm that then runs all of the bat files in a given directory or something. Should only be an evenings worth of messing around as long as I don't run into any silly stuff. Will try to drop it in over the weekend or next week.
# Posted By Peter Bell | 10/26/06 10:58 AM
BlogCFC was created by Raymond Camden. This blog is running version 5.005.