Web Archiving: Crawling Tools

2 March 2018 by Thomas Preece

If you found my content helpful then please consider supporting me to do even more crazy projects and writeups. Just click the button below to donate.

Below I’ve listed tools I’ve come across for archiving live websites. All of these tools act as Crawlers and vary in the quality they obtain and the amount of user interaction they require.

ArchiveWeb.page

ArchiveWeb.page is the replacement for WebRecorder. It is a chrome plugin (offline web app) and also comes as a standalone electron app. Like WebRecorder it can record websites using the user to guide the capture. Output from the tool is in the WACZ format. ArchiveWeb.page also has some limited support for automatically interacting with the webpage to improve the capture.

This is the tool I’d recommend new users get started with as it’s simple to setup and use.

Brozzler

Brozzler is a distributed web crawler that uses a real browser to fetch pages and embedded URLs and to extract links. It works with WarcProx to produce WARCs. It supports the features of most web crawlers such as maximum number of hops and seed lists but has the added bonus of capturing significantly more resources as it executes JavaScript and performs custom interactions on the page to make sure most resources needed for playback and interaction with the page are captured.

The software works by instructing a chrome browser to browse requested content through the warcprox proxy. This makes it less suitable for corporate environments where connection to the internet is through a proxy as
at the time of writing warcprox is not capable of making requests through a proxy.

WebRecorder (depreciated)

WebRecorder is an integrated platform for creating high-fidelity web archives while browsing, sharing, and disseminating archived content. It works by capturing websites in the background whilst the user browses the website. WebRecorder can produce much higher quality crawls then Heritrix as JavaScript code is run and any resources loaded due to interaction are also captured. It gets this higher quality at the cost of having to have a person manually browse a website.

WebRecorder is depreciated and has been replaced by ArchiveWeb.page so it is suggested to use that instead.

Heritrix (heretrix)

Heritrix is a website crawler. It seems to be used by most archiving entities albeit with different code bolted on top. Out of the box Heritrix doesn’t have any scheduling capabilities, a GUI for submitting jobs and is configured entirely via XML files. Another issue is that Heritrix doesn’t load webpages when it crawls them so JavaScript is not executed and so many JavaScript resources will be missed by the crawler which results in poor quality captures as almost all sites nowadays load resources via JavaScript.

All of the above issues are worked around by custom software added to Heritrix by each archiving entity. The only open source solution to these issues that the author could find was NetarchiveSuite. However this does not solve the Javascript issue, extra modules from the British Library’s version of Heritrix may solve this. Alternatively Umbra modules will fix this.

Ultimately Heritrix is setup more for large institutions and not really suited to the home user doing archiving so it may be difficult to get started with so I’d suggest one of the tools above

HTTrack

HTTrack crawls a website and creates an offline version. It is recommended against using as it is old software and doesn’t use the industry standard WARC format.

Crawling Suites

NetarchiveSuite

The primary function of the NetarchiveSuite is to plan, schedule and run web harvests of parts of the Internet. It scales to a wide range of tasks, from small, thematic harvests (e.g. related to special events, or special domains) to harvesting and archiving the content of an entire national domain.

Web Curator Tool (WCT)

The Web Curator Tool (WCT) is an open-source workflow management application for selective web archiving. It is designed for use in libraries and other collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process. It is integrated with the Heritrix web crawler and supports key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata.