This section is for those wondering whether Harvest will run on their system, and whether it will do what they want it to. If you've already downloaded Harvest and are trying to get it working, then you're probably best skipping this bit.
Harvest should run on any UNIX based system, it has been successfully run on the following systems:
(Contributions of more systems to this list are gratefully received)
Harvest is *not* currently available for Windows NT, and there are no plans for a port in the near future.
No - Harvest can be installed in any area of the file system, and should not be run as root. However, access to the web server configuration files is necessary for the initial installation
Harvest can gatherer HTTP objects across a firewall by using a proxy server, support for proxies for other services is not currently supported. To export indexes from the gatherer through the firewall a hole would have to be made for the TCP/IP port used by the gatherer.
Yes. Harvest will work with systems where the NIS servers have been configured to query DNS servers for unknown hostnames. See also the section on `What does "Can't get my own host info!?" mean?' for details of what to do on Sun OS systems running without DNS.
No, the gatherer and broker can both be run on different machines from the web server, and need not be run on the same machine as each other.
The Harvest gatherer can be use to fetch objects for use with the Catalog Server. Details of how to incorporate these objects is available from http://help.netscape.com/kb/server/961023-1.html
The gather command referred to in the above document is included with the harvest distribution, and is also available as a standalone package.
Yes. Both robots.txt files and META robots tags are supported. The correct format for robots.txt files is documented at http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html, Harvest may have problems gathering from sites which have incorrectly formed robots.txt files.
The format for META robots tags, which give users control over indexing on a page by page basis, is available from http://info.webcrawler.com/mak/projects/robots/meta-user.html
This section covers commonly asked questions about the process of compiling and installing Harvest.
The main Harvest distribution site for versions 1.5 and later is http://www.tardis.ed.ac.uk/harvest/ with code available for ftp from ftp://ftp.tardis.ed.ac.uk/pub/harvest/
In addition the following sites mirror the Harvest code, please use which ever site is closest.
Full installation instructions are available from http://www.tardis.ed.ac.uk/harvest/docs/install.html
By default Harvest will install all of its files in /usr/local/harvest, however you can alter this by editing the top level Makefile and changing the line that reads
prefix=/usr/local/harvestto the location that you want Harvest installed. If you do this you'll need to be careful in following the rest of the installation instructions, and subsitute /usr/local/harvest with the location you're using. In addition, you have to change all of the Harvest cgi's to reference the correct location - in particular BrokerQuery.pl and nph-search.
make reconfigure only works within the core Harvest code (the files in the src/ subdirectory) and not on any of the components. This is a bug, which will be fixed. However, a similar result can be obtained by using 'make realclean', on the next 'make' the software will be reconfigured.
This section contains questions about, and solutions to, some of the problems that have been encountered when running Harvest. In addition to consulting this section, check your log files for errors and look at section 4 on Error Messages, below.
This is a very broad category that covers most of the problems that are experienced with Harvest. The problem can be broken down into three main areas which are covered in sections below. The problem may be with the gatherer, the broker or with the communication between the two.
Core dumps can occur for a variety of reasons. However - a number of people have reported core dumps occuring in the gatherer before any objects are fetched.
This is most often caused by the site having a malformed robots.txt file. If you're attempting to fetch a page http://www.blob.org/blobby/blah.html then Harvest will first check the robots.txt file at http://www.blob.org/robots.txt
The correct format of this file is documented at http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html
Under some circumstances Harvest may dump core if this file is corrupt.
If the SOIF produced from an object (or the output displayed by the broker) contains raw HTML tags, this indicates that the HTML in the page has not been correctly parsed. Harvest uses a strict, SGML based parser by default. This parser will fail if the page contains invalid HTML. You can turn on errors to see which pages are failing, and why, by altering a line in the file $HARVEST_HOME/lib/gatherer/SGML.sum from
$syntax_check = 0;to
$syntax_check = 1;
If you don't want to fix these problems then you might like to use the less strict HTML summariser, however the quality of the SOIF produced by this summariser is poorer than the SGML based one. You can enable this lax summariser by following the instructions in $HARVEST_HOME/lib/gatherer/HTML.sum
To ensure that the objects contained in the broker and gatherer are up to date a time to live value is associated with them at the time of gatherering. Objects which are older than this value are automatically removed from both the broker and gatherer. This means that the gatherer must be re-run at least every 28 days in order to ensure that objects are not incorrectly removed.
The Time-To-Live value may be altered by including the following in your gatherer's .cf file (but see the section on TTLs for news articles below):
Time-To-Live: <time-to-live in seconds>
The enumerator stores its working list in memory. With the default breadth first enumerator this list rapidly becomes large, leading to the memory problems. The depth first enumerator has considerably less memory overhead, you can switch to using this by including Search=Depth with the rest of the options on your RootNodes lines.
For example:
<RootNodes> http://www.tardis.ed.ac.uk/ Search=Depth </RootNodes>
The TTL value is hardcoded into the news article summarisers. If you want to change this from its default (7 days) then edit the file $HARVEST_HOME/lib/gatherer/NewsArticle.sum and change the line reading
$TTL = 86400 * 7;
The news group summariser has an even shorter TTL of 2 days, this can be changed in the NewsGroup.sum file.
If the first URL of the RootNode specified is on a different server than the root node, then versions of Harvest prior to 1.5 will follow this URL and index the second server, never returning to the first one. Either upgrade to 1.5, or change your rootnode page so that the first link is an internal one.
In short, yes - due to the way that the broker in all versions of Harvest up to 1.5 parse the responses from glimpse, brokers which are stored in a directory with a name beginning OBJ will not return any results.
[Bit about gdb / dbx and backtraces to go here ]
This is due to an incompatibility between Harvest and glimpse. When matched lines are disabled Harvest uses the faster -N option to glimpse, which tells it to only use its index. However, as is discussed on the glimpse man page this leads to false matches :
In other words, with -N you will not miss any files but you may get extra files. For example, since the index stores everything in lower case, a case- sensitive query may match a file that has only a case- insensitive match. Boolean queries may match a file that has all the keywords but not in the same line.
Glimpse has a limit of 32 characters on any regular expression matches. Including any of the regular expression meta-characters will cause glimpse to interpret the query as a regular expression and impose its limit.
The most common culprit here is searches with a "." in them, where the user didn't want to perform a regular expression search - just to search for a phrase containing a dot (such as www.tardis.ed.ac.uk). In this case the dot should be escaped with a \ - ie www\.tardis\.ed\.ac\.uk
This section contains an explanation of some of the commonly found error messages in Harvest log files, and, where appropriate, means of solving them.
This is usually a perl error, implying that the socket.ph file you are using is incorrect. As of Harvest 1.5, a correct socket.ph file should be generated on all architectures. You can test the socket.ph file by using the test-socket-ph.pl utility located in $HARVEST_HOME/lib/
If this utility does not report "Perl and socket.ph tests okay" then you have a problem. Try copying the socket.ph file from your system perl installation into $HARVEST_HOME/lib/socket.ph
The error message "Invalid HTTP/1.0 response ..." in the gatherers log file means that you are using a version of Harvest with broken http support to index a web server that is HTTP/1.1 (or higher) compliant. The version number returned by the web server indicates the highest version it is capable of serving, unfortunately older Harvest code interprets this as being the version number of the response.
The ideal fix is to upgrade to Harvest 1.5, however if your web server is Apache you can force it to return messages of the form Harvest wants by including the following in your httpd.conf:
BrowserMatch Harvest/1.4.pl2 force-response-1.0
It means that for one of the RootNodes specified in your configuration file the search engine was unable to successfully fetch and index any pages. This could be because the URL is incorrect or the server containing the URL is down, or refusing to serve the page. This would show in the log.error file.
If you can fetch the page normally then this indicates there is a problem with the gathering process. Versions of Harvest prior to 1.5 were unable to fetch documents correctly from certain http servers (including the NCSA httpd). Try upgrading to 1.5 and see if this solves your problem.
Despite appearing in the broker log file, this is in fact an error with the gatherd. Ensure that gzip is in the PATH of the shell from which gatherd is run or alter the Gzip directive in the gatherd.cf file to point to the correct location of gzip. This bug should not be present in Harvest 1.5
First check that your Brokers.cf file (located in $HARVEST_HOME/brokers/) is readable by the user that your CGI is running as, and that there is a correct entry in this file for the broker you are attempting to query.
Versions of Harvest before 1.5 contained a bug in the CGI that would cause this error with broker names containing certain characters (including dots). Either upgrade to Harvest 1.5, or download a fix from http://harvest.transarc.com/harvest/Fixes/BrokerQuery.pl.patch
This means that the script in question is dying before it gets a chance to send the Content-Type line expected by the web server. Look in the error log of your web server for any error output produced when the script is run.
Some servers may not produce meaningful error messages. In these cases try turning on debuging in the script by including &debug=1 at the end of the query URL, or running the script through a wrapper that redirects stderr to stdout
This means that the script is failing before getting a chance to output any data. Look at the error log of your web server for any error output produced when the script is run, and see section 4.6 (above) for suggestions on what to do if your web server doesn't log meaningful error output.
This means that for some reason Harvest has been unable to look up the IP address of the hostname of the machine its running on.
This problem should be unusual, but the following has been reported on systems running Sun OS which use NIS for their name service, and do not have their local hostname in the DNS.
Harvest builds by default with the "resolv" library, which causes it to attempt to resolve hostnames in the DNS. If for some reason your local hostname is not present in the DNS then it will not be found, even if it is present in /etc/hosts or in the NIS hosts map. The solution is to not build with the resolv library, which you can do by removing the -lresolv flag from the list of libraries in the Makefiles.
If the broker has been compiled with lex rather than flex then you will find problems when searching for strings such as descripción To check to see which lexer you used to compile Harvest check in src/broker/Makefile.
Direct disk access, rather than use of the HTTP/ FTP / NTTP server can be enabled through the use of Local-Mapping directives, see section 4.7.2 of the user manual for details.
From Harvest 1.5 onwards it is possible to use the * character as a wildcard in these mappings. For instance
Local-Mapping: http://www.tardis.ed.ac.uk/~*/* /public/*/pages/*would map the URL http://www.tardis.ed.ac.uk/~sxw/harvest/faq/index.html to /public/sxw/pages/harvest/faq/index.html
[Write bit about Host Filters to go here]
[Write bit about URL filters here]
The key here is to have the LANG variable setup correctly. If necessary set the LANG variable in the startup scripts for the Gatherer/Gatherd and Brokerd. Note that on many machines the character sets for the different european languages are the same allowing the correct searching of mixed german/french/spanish ... with the same gatherer/broker.
example
LANG=de; export LANG
$HARVEST_HOME/gatherers/my_gatherer/RunGatherer
To gather pdf files you must first install the "acroread" program on your system (Acroread is available from http://www.adobe.com/). Then install the Pdf summariser on your system by carrying out the following actions
cd harvest/components/gatherer/standard/summarizers install Pdf.sum /usr/local/harvest/lib/gathererFinally - uncomment the Pdf line in /usr/local/harvest/lib/gatherer/byname.cf
Or, you can use the pdftotext program provided as part of the freely available xpdf package. This is reported to be a more reliable solution than using acroread. To do this, download and install pdftotext, then create a file in /usr/local/harvest/lib/gatherer called Pdf.sum which contains the following
#!/bin/sh pdftotext $1 - | FullText.sumAgain, you will need to uncomment the Pdf line in the byname.cf file, as above.
(Thanks to Bert Helthuis for part of this answer)
If you include the following in the search submission page within the query form, then the output from the broker will be sorted by the number of lines on which glimpse found matches.
<input type="hidden" name="sort" value="by-NML">
In your gatherers configuration file include the line
HTTP-Proxy: my.proxy.serverIt is not currently possible to use a proxy for any services other than http.
WARNING: It is not recommended to use Harvest with a cache that is also used by users for web browsing. This is because Harvest's pattern of cache accesses is likely to flush the cache of many useful pages.
For each gatherer that you run, run the RunGatherd script in that brokers directory, for each broker run the RunBroker script with the -nocol option to disable automatic collections.
sxw@tardis.ed.ac.uk