By Peter Bell

What is UP with vspider?

All I want to do is create a simple spider based site wide search. I did this for a project in about 10 minutes using a third party hosted solution, but that’s $250/project/year and I’d rather have an in-house solution.

[LATEST UPDATE] Got it working - follow the comments to see the gotchas to look out for and how to use the utilities for testing. I'll try to wrap this craziness with some kind of simple generator, but it may be a while before I get around to it.

I have CFMX 7.0.2 so I don’t need to download the style files. I’m on Windows, so I don’t have to worry about the Linux problems. I don’t want to spend $3,000 for the standard edition of the ColdFusion Search Expansion Pack, but I can map all of my sites to http://localhost/site_name and with my architecture that works (subdirectory is OK and I don’t have any absolute URLs or SSL pages to index so I can do this), so no problems there. I can get both a batch file or a command line with the extended parameters in a .txt file using the –cmdfile syntax recommended by Adobe to create a collection successfully. I can add the collection to the CF Admin successfully and it shows 104 documents and 631KB so that is working.

But I’m getting “There was a problem executing the CFSearch tag with the following collections” with a 1705 error code. I read that Verity can require a restart after indexing, so I’ve tried that without luck.

When I have these kind of problems getting a simple search working, I feel there is something wrong. Especially when I see people like Doug Hughes (who is not a stupid person) give up in desperation and decide it would be easier to integrate Lucene (an OS Java search engine) than to run what should be a simple batch file and single command in the cf admin (plus cfsearch which I have no problems with at all for db or file based collections).

I’m also amazed how few people seem to have tried this. Doesn’t anyone need spider based search (which is fundamentally different from content item based search – each has different use cases)? There seem to be very few resources online for making this work. I’ve spend maybe 5-6 hours Googling and while I came across a fair number of links (including some old Daemon tutorials which refer to CF5 so may be out of date suggest future problems in terms of the collections from vspider not including the URLs correctly) I didn’t find anything useful for the 1705 errors.

Am I the only CF developer who’d like spider based search or did everyone just conclude vspider wasn’t up to the job and use a third party solution?! What is up with vspider?! I know Steve Erat used to be involved with Verity and Matt Woodwards blog says he uses it quite a bit, so maybe someone will be able to enlighten me on this. I’m sure it can’t be as difficult as it seems.

Related links:

Much as I like OO programming, maybe I should put a simple way to do spider based searches for Verity through the admin on my wishlist instead of all those interfaces and nulls and the like?!

[update] Isn't it sad that I am almost considering following this advice from 2001 to get around using vspider while still using Verity? It is actually quite an intelligent approach and I got most of the way through the process before I decided that in 2006 there should be a better solution (although I'm not convinced yet that I've found one!). Congrats to Michael Barr for a great article - especially given it is 5 years old and still seems to be a less bad approach than anything else I can think of!

[update 2] I was thinking for a very short time of writing my own spider until I realized that a single threaded language with sedate performance was not exactly the development platform for a spider (it would work for small sites, but writing a real spider is pretty non trivial). Still for anyone who wants to try, here is a "spider writing 101 article which covers some of the very basic features you need to write your own spider in .net. If you want to see why this is a bad idea, look at the configuration settings available for vspider and you'll get an idea of the kind of capabilities you'd have to write. Still, cool article anyway.

[update 3] Another sufferer. At least she got far enough that CFSearch returned (useless) results. Believe the Daemon article above speaks to her problem. [new update] Appears this was fixed in hotfix 3 in MX 7 according to this page (search for vspider).

[update 4] Someone ran into another issue moving 6 to 7 with the CF search not accepting a full path. Unfortunately I was already just putting the collection name in, so that wasn't my problem.

Comments
Peter,

I have vspider spidering www.sheriff.org and it works perfectly. I will admit that it did take me quite sometime to get working.

I think that the problem that you describing is because you created the collection BEFORE you did the spidering. The biggest gotcha with using vspider is that you must first spider the site that you want, then go into CF Administrator and create the Verity collection and point it to the all ready create vspider collection. If done correctly you should see the number of doucments and size of the collection on the administrator page.

If you want to contact me directly so I can help you over the phone, please email me at my email address I will send you my cell phone number.

Vspider can be very fustrating to get up and running. I really have to sit down one day and do a video tutorial on how I got it working since I've seen post like your's before.
# Posted By Tony Petruzzi | 10/21/06 6:20 PM
Amen! I've been looking for a simple solution to index the HTML of CF sites for years. Verity may be great at document search, but from what I've seen, in-site search using it is just too painful. This is definitely something I'd like to see in CF 8. Even if they have to break document search and site spidering into two different engines.

For years I've been using hacks to accomplish this same thing and it never works like I want it to. CF makes everything else simple and I think Adobe could help us out here to. It would also be nice for this to work in a shared hosting environment.
# Posted By Brad | 10/21/06 6:30 PM
Hi Tony,

Actually I followed the directions and used a command line/batch file (did both - both worked equally well) to create a completely new collection. I THEN went into cfadmin and "created" a collection with that name. It found the follection and showed both a number of documents and a size, so that worked well, but then when I tried my cfsearch it gave me the 1705 error. I then restarted the CF Search service (I'm on Windows) as some posts suggested that Verity needs to be restarted after indexing a collection using vspider. Still the same 1705 error.

I also made sure not to use ANY of the cfadmin commands, and just in case there were funky K2 caching issues I wasn't aware of, I created a new collection for every single test I ran just to be sure. Any help would be appreciated, though so I'll email you and if you want to sprak or IM some time or if there's anything I'm not thinking of I'd love any help you could provide!
# Posted By Peter Bell | 10/21/06 6:46 PM
Hi Brad,

Agreed 100%. If this was PHP or Ruby (no disrespect to either language intended) I'd kinda expect everything to be all command line and difficult to use. But this is from the company that brought us cffile and cfquery (and - for that matter - cfsearch!).
# Posted By Peter Bell | 10/21/06 6:49 PM
Are you using a .txt file to pass command line arguments to the spider. Can you post that?

It took awhile but I had great success with VSpider on CFMX 6.1. I basically started the spider with as few options as possible and then added options one at a time til I got everything working... There are some logging options you can turn on I think to get more info out of it... I've switched jobs since then and haven't really messed with VSpider in CFMX 7.

I do agree though that it should be easier, and I'm suprised more people don't use it...

Jim
# Posted By Jim | 10/21/06 7:18 PM
Hi Jim,

Good point.

I started by trying a bat file which I put into c:\cfusionmx7\verity\k2\_nti40\bin. It contained the following single line control (only pause was on second line):

C:\CFusionMX7\verity\k2\_nti40\bin\vspider -style C:\CFusionMX7\verity\Data\stylesets\ColdFusionVspider\ -collection C:\CFusionMX7\verity\collections\SPIDERTEST1 -start http://localhost/ -cgiok -abspath -reparse -indmimeinclude text/* -indmimeexclude text/css

pause
# Posted By Peter Bell | 10/21/06 7:33 PM
Then I tried using a command which was:

c:\cfusionmx7\verity\k2\_nti40\bin\vspider.exe -cmdfile c:\verity.txt

c:\verity.txt had the following content:
-style c:\cfusionmx7\verity\data\stylesets\coldfusionvspider\
-collection c:\cfusionmx7\verity\collections\SPIDERTEST2
-start http://localhost/generalfinishes
-indinclude "*/generalfinishes*"
-cgiok
# Posted By Peter Bell | 10/21/06 7:37 PM
I was doing it via a command line with a .txt file. You should be able to add vspider to your path and call that file from anywhere. I had a /verity directory in each of my projects and then would kick that off with a scheduled task once a night.

There is a command line tool (and I'll have to dig around to find the name) but with it you can check the collection directly vs. having to use cfsearch. A bit easier to debug that way - you can at least see if anything is in the collection. Think it's 'mkvdk' but it's been awhile.

Jim
# Posted By Jim | 10/21/06 7:47 PM
Hi Jim,

Looks like that one creates and indexes collections. Here is a good set of links to the documentation in version 7 although the docs are a little sparse in places . . .
http://livedocs.macromedia.com/coldfusion/7/htmldo...
# Posted By Peter Bell | 10/21/06 8:03 PM
One other thing I ran into problems with - English vs. EnglishX in regards to the language. I can't remember which one is the default - I'd specify that in your .txt file and then make sure you select the same one when you create the collection in CFAdmin.

Jim
# Posted By Jim | 10/21/06 8:12 PM
If you create in cf admin, it defaults english. If create in vspider (and don't set) it defalts to englishx.

That was mentioned as a possible issue here:
http://software.groupbrowser.com/archive/t-206192....

However, I just created in vspider so it was englishx and then cfadmin picked up the language correctly, so that doesn't seem to be the problem.

FYI, here are the latest comments on languages and locales in Verity:
http://livedocs.macromedia.com/coldfusion/7/htmldo...
-language
Syntax: -language name
Specifies the Verity locale to use in indexing. This option is being replaced by the semantically consistent the -locale option, and is still supported for backwards compatibility.

-locale
Syntax: -locale name
Specifies the Verity locale to use in indexing, such as German (deutsch) or French (français). The default is English (english). This option is identical to the -language option.
# Posted By Peter Bell | 10/21/06 8:25 PM
I had to dig a bit - I don't have CF setup on my Linux box at home (yet) - the command is 'rcvdk' - it'll allow you to connect to the collection and do searches, etc without running cfsearch. It would be intesting to see if that throws any errors.
# Posted By Jim | 10/21/06 9:13 PM
Hi Jim,

Many thanks! I just gave that a shot. For anyone who wants to follow along at home:

Go to Verity bin directory (or add it to your path)
rcvdk (starts util)
a c:\cfusionmx7\verity\cellections\spidertest4 (attaches the spidertest4 collection)
s (runs basic search and shows how many documents returned - 104 in my case)
r (returns all of the documents returned in the search)

So bottom line, collection looks good, it just seems to be cfsearch is having a problem speaking to it. Hmmm.
# Posted By Peter Bell | 10/21/06 9:38 PM
WORKING!!!!!

Last time I didn't explicitly set the language in the cf admin, but it picked up the language as englishx so it all seemed to be fine. However when I deleted the collection, created a new one using vspider and then explicitly set the collection language in CF admin as english x (advanced) then it all worked fine, so Jim - you were right a couple of comments back - sorry about that!!!

Thanks everyone for all the help, and please Adobe add this to the list for whatever nthe next version is after Scorpio. There has got to be an easier way to wrap this in the CF Admin - just drop me a line if you'd like input on the use cases and screens ->
# Posted By Peter Bell | 10/21/06 10:00 PM
@Peter,

Awesome to here you got it working. I totally forget to mention the englishx thing. Excellent cal Jim. Peter make sure you throw something up on USENET on how you solved this problem so people in the future will have a reference.
# Posted By Tony Petruzzi | 10/22/06 8:28 AM
Hi Tony,

The purpose of the blog post was so people would have a reference (I get picked up pretty well on Google). I know this is going to sound crazy, but I've not used usenet for years - I'm not sure I'd even know how (other than googling for a web based news reader).
# Posted By Peter Bell | 10/22/06 12:08 PM
So, would you believe, install 7.0.2 on another server, follow all of the instructions above and it ISN'T WORKING!!!

Collection creates, I register it fine using CF Admin afterwards with right language and it sees the the number of documents and their size correctly.

I access command line using rcvdk and the collection displayed just perfectly. My collection looks good.

But when I try to access it using cfsearch (which I tested against another collection created in CF which worked fine) I got the following error:

An error occurred while accessing a Verity collection.
Could not find the ColdFusion registered information for [test3].

where test 3 was the name of the collection that does exist, that I did successfully add to the cf admin using the correct language.

When I Google this the only thing I found was a couple of people ages ago with the same problem and no substantive answers.

Any ideas wildly appreciated - any thoughts on where to go next?
# Posted By Peter Bell | 10/25/06 12:27 PM
Yeah - I'd believe it :(

Is this server exactly like the other?? Have you tried copying your code from the working box to the new one? I think you can actually copy over the collection as well - I remember reading that somewhere - you could in theory build collections on one box and move them elsewhere to use.

Jim
# Posted By Jim | 10/25/06 1:24 PM
This help any:

http://groups.google.com/group/macromedia.coldfusi...

I do remember adding something to my Verity scripts to restart the search service - I'll try to dig up my batch files when I get home... I'm not sure if it helped - but was one of the things I tried...

Jim
# Posted By Jim | 10/25/06 1:35 PM
Is the server identical? Probably not. But both have CFMX 7.0.2 and the same directory structure for CF, both have localhost set up and working and both are able to create collections and in both cases I can add collection and see the right number of documents and file size in cf admin. Just can't search the vspider created collection on the second box.

It wouldn't solve my problem to copy collections across because of the localhost limitation, but for fun I did successfully copy a collection across. Again, I was able to add the vspider collection to the cfadmin just fine, and again I got the message (this time for collection "gf3":

An error occurred while accessing a Verity collection.
Could not find the ColdFusion registered information for [gf3].

Any other thoughts at all?
# Posted By Peter Bell | 10/25/06 1:39 PM
Hi Jim.

Thanks for the link! Link suggested calling cfcollection to persuade CF the new collection was really there. That didn't work, but I called

<cfcollection action="list" name="test">
<cfoutput query="test">#Name#<br></cfoutput>

Interestingly it didn't see the new collection even though the admin sees it. I restarted search service and even restarted CF - still the same problem even though the admin sees them.

However I upgraded 6.1-7 and I'm wondering if 6.1 is handling the page requests or something. Let me see if I can get something on which version of CF is doing what . . . maybe that is the issue . . .
# Posted By Peter Bell | 10/25/06 1:51 PM
OK, so it was a short circuit between the headphones to paraphrase an old saying.

I had 6.1 and 7 running. When I upgraded it didn't automaticlaly upgrade an extension I'd added (obviously) to let CF process .html files. Reason CF couldn't see the collections was they were registered with 7 but 6.1 was processing those pages.

All now working just fine.

Thanks for the help yet again!
# Posted By Peter Bell | 10/25/06 2:25 PM
"so it was a short circuit between the headphones to paraphrase an old saying."

:)

Glad you got it working (again)! I've never run two instances of CF at once before - interesting that it works at all! :)

Jim
# Posted By Jim | 10/25/06 2:47 PM
Yeah. I didn't KNOW I was runnign 2 instances at once until I looked at my list of running services.

Nice to know
# Posted By Peter Bell | 10/25/06 3:01 PM
Peter,

Good stuff here. I ran into almost this exact issue about a year or so ago and was pulling my hair out just as much as you are. I finally figured it out through trial and error. Ironically the project I need this for is probably never going to see the light of day.

I recently added a couple new pdf's to the site and ran vspider so they would be added to the index but now however vspider is suddenly choking on indexing pdf files. I've made no code or structural changes to the site. The only difference is that I believe when I did the initial development that I was running CF 7.0 and now I am running 7.02.

Here is the error I get: Warn: [vspider] (ind002006) VDK: Warn E0-1514 (Drvr): Tst
rIOFilter:flt_kv: KV failed on filtering document: error = 17.

Here is the command that I use:
vspider -style C:\CFusionMX7\verity\Data\stylesets\ColdFusionVspider -collection C:\CFusionMX7\verity\collections\splan4 -start http://127.0.0.1/sp/ -exclude PDF=true -cgiok -indinclude *

I found a couple people running into the same issue searching Google but it doesn't appear anyone has found a solution. I would be more than welcome to hear suggestions. This could forc

I have to agree with you that it would be nice if Adobe made a few enhancements to vspider to make it easier to use. I use Fusebox as my framework and the only effective solution using Verity is to use vspider so that pdf files get included in the index.

It will be interesting to see what happens with CF and Verity in the future since Verity was bought by Autonomy about a year ago. It looks like Autonomy is strongly committed to K2 but you don't know if that extends to their relationship with Adobe and ColdFusion.

Ray
# Posted By Ray Buechler | 11/2/06 2:53 PM
I'm not sure you have the syntax right for your PDFs...

- exclude PDF=true

See this on the Livedocs:
http://livedocs.macromedia.com/coldfusion/7/htmldo...

To specify a file, path, or URL that you want followed but not indexed, use the -indexclude option. For document types, use the -mimeexclude option instead; for example, specify
-mimeexclude application/pdf rather than -exclude *.pdf.

But that suggests you want to exclude PDFs ?

I would get rid of that statement and rerun the spider and see what happens.
# Posted By Jim | 11/2/06 3:55 PM
Hi Ray,

Good advice from Jim. Let us know what comes of it!
# Posted By Peter Bell | 11/2/06 3:59 PM
Jim,

- exclude PDF=true excludes any urls with that variable attached to it from being indexed. There is a link on each page of the site that when clicked will render the page as a pdf (using the cfdocument tag). All of those urls have pdf=true appended to the end of them. If I indexed those urls every page on the site would be indexed twice.

It definitely indexed static pdf files in the past using the command with the exclude PDF=true syntax in it.

Ray
# Posted By Ray Buechler | 11/2/06 4:03 PM
Hi Ray,

Did you try quotes around the exclude parameter - -exclude "PDF=true" ?

I remember something about windows version being picky about this or something. Could be completely wrong but might be worth a shot (long shot I know . . .)
# Posted By Peter Bell | 11/2/06 4:07 PM
Jim and Peter,

I tried it with quotes and completely removing the exclude parameter and still got the same error.

The only change from when I first figured this out and it was indexing pdf's correctly are the upgrades to 7.01 and then 7.02.
# Posted By Ray Buechler | 11/2/06 7:49 PM
Hi Ray,

Strange. I'm on 7.02 and all works fine for me, but clearly you're not the only person having problems. Let us know if you figure it out. Don't suppose you have a support contract with Adobe?!
# Posted By Peter Bell | 11/3/06 9:40 AM
I have never used VSpider, and this sounds daunting. I have created a simple flat-page Verity Search. I can create a collection, index the collection and use CFSearch to search the collection. My only problem is that I want to be able to "exclude" certain folders, because I'm getting rediculous search results....for example, I'm getting all the files in my "includes" folders for my navigation menus.

So, eventually, I'll want to search the database, but I have a separate search for that already. All I want is to be able to create an overall site search that doesn't return vti and include files. Also there are password protected directories on the site that people can't access, but I don't want them coming up in the list.

It was suggested that VSpider would allow me to exclude certain folders from the process.

Can you tell me if this is so? Also, can someone tell me if there is a simpler way to exclude folders from an index process? I can't imagine having to go through all this just to have something ignored.

I'm coming up on a deadline of Thursday, and I can see that VSpider is not happening by then.

Help! And Thanks! WCW
# Posted By Willow Wright | 2/13/07 1:20 PM
There used to be a custom tag that would allow you to exclude directories with CFSearch - but I always had intermittent luck with it - sometimes it would work - sometimes not.

That's why I ended up using Vspider...

Vspider simply follows links across your pages - just like any other spider (Google/Yahoo) so if your files are setup correctly you should have no issues.

Even with your tight deadline - I'd give yourself an hour and see if you can't get Vspider working - if it works - then you are set. If not you can always fall back on using CFSearch with a custom tag.

Unfortunately the ColdFusion exchange looks like it is down.
http://www.adobe.com/cfusion/exchange/

I can dig around tonight and see if I can a copy of that tag in my code at home.
# Posted By Jim Priest | 2/13/07 2:09 PM
As Jim said, give it an hour. If Verity works, it just works and you are sorted. The problem is when it doesn't that it becomes a pain. But at least this page lists a bunch of things you can try/check if it becomes a problem.

Good luck!
# Posted By Peter Bell | 2/13/07 2:22 PM
Willow - tonight go out and pay homage to the 'why should I save this really old code' gods... :)

http://www.thecrumb.com/wiki/code/coldfusion/dirse...

Should be enough to get you started. This tag is better than nothing - there may be better ways to accomplish this - and I'd still give VSpider a shot if you have time.
# Posted By Jim | 2/13/07 8:51 PM
I have a unique situation in which 99% of the pages on a site I inherited are all funneled through one page and use query parameters to build subsequent pages, i,e, http://mysite.com/basicpage.cfm?cat1=100&cat2=...

The querystring parameters tell the main page to load include of files with the parameter values. Long story short is that I want to exclude pages with certain parameters so I tried:

-exclude http://mysite.com/basicpage.cfm?cat1=100&cat2=...*" target="_blank">http://mysite.com/basicpage.cfm?cat1=100&cat2=...

and so on for each subparameter I wanted to excluded within my vspider cmdfile. Needless to say that this doesn't work presumably because the exclude doesn't contain a true directory.

Any way to files individual page either through a vspider cmd file or using cfindex?
# Posted By Keith | 3/8/07 7:18 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.005.