This is the fourth installment in a series of evaluations of website harvesting software on the Practical E-records blog. The first three installments were reviews of open source software that you can download and install locally—HTTrack, GNU Wget free utility, and Heritrix. This fourth installment is a review of the Web Archiving Service (WAS) developed by the California Digital Library, which is a fee based service for capturing and storing websites.
The Web Archiving Service provides tools and support for harvesting websites and preserving them, as well as providing tools for analyzing the captured content. For example, you can check between two captures to see how many web pages (and what pages) were changed or deleted. Unlike the other web archiving software previously evaluated here, WAS is a fee based service for those outside the University of California (University of California organizations only pay for storage). For those outside the University of California system, a yearly fee is required, but there are discounted rates for consortia of three or more institutions. Thankfully, WAS also provides a free trial subscription to those institutions wishing to try out the software.
This is an excellent choice for those institutions that can afford a subscription. It is very user friendly and produces a capture that has a high fidelity to the original site. The largest concern I have is that the data is stored with WAS rather than on an internal server. This might bring up long term preservation concerns if the subscriber is not able to commit to long-term support for the service or if CDL is not able to maintain the service for whatever reason.
Chris Prom contacted the WAS to get our trial subscription (on the right hand side of the page is info on how to contact them about getting a trial subscription). We were then given a username and password to access the service over our internet browser. I read over the WAS User Guide, skimmed some of the other documentation, and watched the user videos before beginning, but all of this is not necessary in order to begin. Familiarizing yourself with the general system is important, but overall the service is straightforward enough that you could follow the manual as you are going through your first capture if you need to.
Once you’ve signed into the WAS site with your username and password, there are two main things you need to do to get the capture started. You first need to create the site information and then capture the site.
Under the Create Site section you are able to adjust capture settings, scheduling, and add descriptive data. Unlike the other web capturing software profiled previously, WAS has narrowed down many of the parameters that need to be adjusted for the capture. This simplifies the process for those archivists who don’t have a lot of experience with more technical computer applications. For example, with WAS, the capture settings that need to be adjusted are only scope, whether to capture linked pages, the maximum amount of time spent on the capture (1 hour or 36 hours), how frequent to make the capture (daily, weekly, monthly, custom), whether the capture will be made public, and added descriptive data about the site. This is one of the many aspects of the WAS software that is well designed. There are only a few easy to understand parameters to adjust, you can schedule the program to automatically capture the site on a regular basis, and you can add descriptive metadata associated with your capture.
After you have set up your site information you can click on the Capture Sites option which takes you to a “Manage Sites” page. This page includes both your site as well as other sites that are on the WAS system. In order to start your capture you can click the Capture Icon under the name you gave your site and it will start capturing the site according to your specifications. An email is sent to you once the site has finished capturing. Once the capture has finished WAS provides a variety of options for looking at your captured site. These include an ability to review the entire captured site and navigate around it as if it were live as well as an ability to search the site by keyword while narrowing your search results by file type (such as pdf, images, audio, video, etc.). Once you have captured your site more than once there are additional options available to compare the results of the captures to see what items have changed or been removed from the site.
In my case, I chose the 36 hour capture option (recommended for first time captures). Once the capture was done, I signed back on to navigate the harvested site. The harvested website, as far as I am able to tell, has the highest fidelity to the original site (compared to the other web capturing software I evaluated). I had no problems running the program or setting up the capture and the documentation was clear and easy to understand. Considering the ease of use, the quality of the result, and the options for navigating and comparing the finished captures, I am recommending this program as the best of the four programs I have tried. My only concern for the WAS software is whether there are long term preservation issues because the captured site resides with WAS rather than our own internal server.
- Installation/Configuration/Supported Platforms: Because WAS is run through a web browser there is no installation necessary. All you need is access to the internet in order to use the service. Configuration of each capture is minimal. All you need is a subscription name and password in order to get started. 20/20
- Functionality/Reliability: There were no problems in running the capture and on an initial look through the captured site it has a very high fidelity to the original live site. 20/20
- Usability: Incredibly user friendly. There are very few parameters to adjust, so it is not overwhelming to a novice. 10/10
- Scalability: I’m not sure how this would work since I was only capturing one site. However, my guess is that if you were trying to capture a large number of websites it might not scale well since the max capturing time is 36 hours. It seems very well designed to capture specific hosts rather than single broad captures of numerous host sites. 5/10
- Documentation: There are a number of helpful video tutorials as well as pdf manuals available. The manual and videos were easy to understand and follow. 10/10
- Interoperability/Metadata support: There is a description section available to add metadata associated with your captured site. 10/10
- Flexiblity/Customizability: Unfortunately the thing that makes WAS user friendly is also what makes it less flexible in terms of adjustable parameters. There are only a few parameters that can be adjusted when setting up the crawl so this program may be frustrating to those who have a lot of programming experience who want to be able to make small adjustments. To balance this though, there is a lot of flexibility in terms of the output of the program since the site can be searched by keyword, file type, can be compared to previous crawls, or can be navigated as if it were live. 7/10
- License/Support/Sustainability/Community: WAS is part of the University of California Library system and seems to be very actively used. There are regular “Introduction to WAS” web conferences held for those interested in learning more about how to use the program. They provide contact information for service support in case there are issues. There is also a mailing list, Facebook page, and RSS feed for WAS. The downside is that it is not an open source program and users outside of the University of California system must pay a yearly fee for use. While it is likely to be sustainable over the long run since it is part of a large public university system, I do have concerns over the long term preservation of individual sites since they are not available for download (as far as I could tell) from the WAS site to our own internal servers. 8/10
Final Score: 90/100
Bottom Line: If you can afford to subscribe to this service and are okay with not hosting your captured sites in house, this is the program to use. It is incredibly user friendly, easy to use, and provides output that is of high quality in addition to having features that allow you to compare your captures."