If you found my content helpful then please consider supporting me to do even more crazy projects and writeups. Just click the button below to donate.
Below I’ve listed tools I’ve come across for archiving live websites. All of these tools act as Crawlers and vary in the quality they obtain and the amount of user interaction they require.
ArchiveWeb.page is the replacement for WebRecorder. It is a chrome plugin (offline web app) and also comes as a standalone electron app. Like WebRecorder it can record websites using the user to guide the capture. Output from the tool is in the WACZ format. ArchiveWeb.page also has some limited support for automatically interacting with the webpage to improve the capture.
This is the tool I’d recommend new users get started with as it’s simple to setup and use.
The software works by instructing a chrome browser to browse requested content through the warcprox proxy. This makes it less suitable for corporate environments where connection to the internet is through a proxy as
at the time of writing warcprox is not capable of making requests through a proxy.
WebRecorder is depreciated and has been replaced by ArchiveWeb.page so it is suggested to use that instead.
Ultimately Heritrix is setup more for large institutions and not really suited to the home user doing archiving so it may be difficult to get started with so I’d suggest one of the tools above
HTTrack crawls a website and creates an offline version. It is recommended against using as it is old software and doesn’t use the industry standard WARC format.
The primary function of the NetarchiveSuite is to plan, schedule and run web harvests of parts of the Internet. It scales to a wide range of tasks, from small, thematic harvests (e.g. related to special events, or special domains) to harvesting and archiving the content of an entire national domain.
Web Curator Tool (WCT)
The Web Curator Tool (WCT) is an open-source workflow management application for selective web archiving. It is designed for use in libraries and other collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process. It is integrated with the Heritrix web crawler and supports key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata.