Friday, February 4, 2011

Heritrix Evaluation/Review

Heritrix Evaluation/Review: "

This is the third installment in a series of evaluations of website harvesting software on the Practical E-records blog.  The first two installments were reviews of the HTTrack open source software and the GNU Wget free utility.  This third installment is a review of Heritrix, the Internet Archive’s open source web archiving software.



Heritrix is an open source, extensible web crawler designed by the Internet Archive for website capture.  It can be downloaded by individual archives to be used for in-house web archiving.   The Heritrix program is written in Java so that it can be run on any platform, but only Linux is supported.  The manual and documentation assume that you have basic Linux knowledge.  On starting this project I had no knowledge of Linux and while I learned enough to install and run the program, my lack of knowledge and the lack of easily accessible output files made my foray into Heritrix ultimately unfruitful.


Systems developers and those who have more experience with technology such as Heritrix will likely find it a great tool, but compared to the other two web harvesting software tools I evaluated—GNU Wget and HTTrack—it was much less user friendly for those with limited technical knowledge.


Before downloading the program and installing it I read over the first couple of sections of the User Manual and skimmed over a couple of tutorials for working in Linux.  The later sections of the manual make more sense if you go over them once you have the user interface set up.  The way Heritrix works on this platform is that you first have to install the program using a Linux command line, but then you can launch a more user friendly Web User Interface for actually setting the parameters of the crawl (unlike with GNU Wget where all the parameters are set at the command line).


To download the program I went to the download page and downloaded the Linux/Mac version (heritrix-1.14.4.tar.gz).  Just like with GNU Wget you need to first open up a command line interface and from there you can install Heritrix to your computer following the instructions included in the manual.   There are also instructions for launching the web user interface in a browser.


Once you have signed in with the username and password that you choose during your installation, you can adjust the parameters necessary to run a crawl in Heritrix.  In Heritrix there are profiles and there are jobs.  Profiles are created to be a template for a crawl job and you can adjust parameters under a profile that could be used for multiple jobs.  In order to run a crawl though you need to also set up a job as well (which can be based off of the profile).  I first set up a new profile called “Default I” and configured the job using the Modules tab and the Settings tab.  Under the Modules tab I adjusted Crawlscope, URI Frontier, PreProcessors, Fetchers, Extractors, Writers, PostProcessors, and StatisticsTracking.  Under the Settings tab I adjusted Crawl Organizer, Max-Toe-Threads, User Agent, Max Retries, and Total Bandwitdth Usage.  These terms are all described in detail in the User Manual and I found the best use of my time was to read the details of each parameter in the manual as I was setting up the job (rather than before) and judging whether it needed to be adjusted.  Having the system up in front of you makes the jargon in the manual make more sense.


Once I set up and ran the crawl it took roughly 8.5 hours to complete and it ran overnight on the computer.  Unfortunately, the output files proved to be too impractical for our use here.   The files are output as .arc files and you need to be able to extract the information from these .arc files in order to read and navigate the captured pages.  This was something I did not have the technical expertise to carry out and we could not find software to render the arc files.  The file format itself is described here and the documentation notes that “, the best way to retrieve a specific object from an archive file is to maintain an external database of object names, the files they are located in, their offsets within the files, and the sizes of the objects. Then, to retrieve the object, one need only open the file, seek to the offset, and do a single read of <size> bytes.”  However, it was unclear how the database is generated and even if we did have a database, we would need a method to access the files using the offsets.  This is a job, in other words, for a systems developer and we  therefore could not look at the pages to determine whether Heritrix captured the pages of our website with any greater or lesser fidelity than HTTrack or GNU Wget.


Evaluation Criteria:



  • Installation/Configuration/Supported Platforms: While available for download for Mac, Windows, and Linux, Heritrix is only supported on the Linux platform.  Downloading the program was easy, but installation requires some basic knowledge of using a command line interface.  Instructions on how to install the program are written out line by line in the manual which facilitates installation (especially for a nonprogrammer).  15/20



  • Functionality/Reliability:  Once the program was set up to run, it ran without crashing or freezing and completed in a timely manner.  I am unable to judge how reliably it captured the site however since I have not been able to view the capture.  10/20



  • Usability:  Not very user friendly.  You need to be familiar with a command line to install the program, but even when working with the browser user interface the terminology can be confusing even with the manual.  It is not very straightforward.   4/10



  • Scalability:  Though I only used it in the capture of one domain, it was specifically designed to be able to handle large scale crawls.  10/10



  • Documentation:  There is a user manual, FAQ, and wiki.  The user manual is absolutely necessary in setting up the program and I suggest spending some time with it.  Although the terminology can be confusing, the manual is still quite good.  I was able to set up an initial crawl by following the manual as I set up my first crawl. 8/10



  • Interoperability/Metadata Support:  Similar to HTTrack and GNU Wget, there is no metadata support that I could find to attach to your finished crawl. The program is designed to capture the site, but does not provide additional areas to add metadata surrounding the capture that I could find. While it would be nice to have this with the program, it doesn’t seem necessary to the function of the program (even though it is necessary to its use in archives).  Most likely, you would be adding ‘collection level’ metadata for the harvested site in an archival descriptive system.  However, it is unclear how one can even find the original files, since the documentation mentions that a external database should provide pointers to the file locations, in the arc file, but there are no instructions for how to capture information into such a database  0/10



  • Flexibility/Customizability:    Heritrix is very dynamic; there are many adjustable parameters when setting up a crawl.  For a novice, this can be quite intimidating and confusing, but for someone with more experience this is a very useful aspect of the program.  10/10



  • License/Support/Sustainability/Community:  Heritrix is an open source software that appears to be widely used.  In addition it was created by the Internet Archive which has wide support and is committed to issues related to digital archiving. 10/10


Final Score: 67/100

"

2 comments:

  1. Hi all,

    Heritrix is the internet achieve's open source extensible ,webscale, archival quality webcrawler project. The internet archive started heritrix development in the early part of 2003.Thanks a lot....

    Web Harvesting

    ReplyDelete