Website archival tool designed as an alternative for httrack and wget's mirroring capabilities.

Find a file

Grzesiek11 6227e04906 Improve speed of cleaning unreachable body data		2025-10-24 23:40:01 +02:00
cfsolver	Fix timing issue with cfsolver	2025-09-03 05:15:36 +02:00
docs	Use current command names in docs	2025-10-10 22:29:44 +02:00
src	Improve speed of cleaning unreachable body data	2025-10-24 23:40:01 +02:00
.gitignore	Initial commit	2024-09-20 03:09:07 +02:00
build.rs	Move all commands into a single executable	2025-09-25 02:50:30 +02:00
Cargo.lock	Update exacthttp	2025-10-17 00:26:08 +02:00
Cargo.toml	Use stable release of openssl like exacthttp	2025-10-15 23:40:38 +02:00
LICENSE	Initial commit	2024-09-20 03:09:07 +02:00
README.md	Use ps1 instead of ps for PowerShell code block	2025-09-26 03:57:24 +02:00
schema.sql	Fix name error in database schema	2025-09-12 22:44:58 +02:00

README.md

belweder

Website archival tool designed as an alternative for httrack and wget's mirroring capabilities.

The project is in a somewhat mature state by now, though be aware that it is not a polished experience. Website-specific fixes might be required, feel free to report problems on the issue tracker.

Features

Traverses websites to find pages and resources to save. Searches HTML documents for common elements and CSS stylesheets for referenced resources.
Allows non-recursive downloads entirely if only specific URLs are needed, with or without resources.
Saves original copies of pages as well as their metadata (capture time, headers, original URL etc.) in a custom repository.
Can export saved websites as a local HTTP server with rewritten URLs for local viewing.
Can export to directory structure mimicking HTTP (like wget and httrack).
Allows handling cookies and loading them from a provided cookies.txt file, to save authenticated versions of websites etc.
Allows configuring a regex-based URL ruleset for ignoring certain URLs to avoid saving unnecessary pages.
Uses already saved pages for traversal, so it's possible to use improvements to the configuration or the program without starting over.
Saves checkpoint information so cancelled downloads don't need to waste time traversing the website again.
Allows whitelisting external domains.
Can save resources from external domains.
Can make multiple concurrent requests to speed up downloading in some cases.
Allows setting the User-Agent header.
Hackable to suit your needs.

Current limitations

Treats all text as being UTF-8 encoded, which might not work for some websites. The download should be handled correctly regardless, however the HTTP export is likely to give weird results.
Probably doesn't handle other edge cases. Some edge cases will crash on purpose due to not being handled properly yet.
Doesn't support JavaScript due to it's complexity (will archive directly referenced scripts, and simple scripts might work when exported).
The HTTP server export is rudimentary.
Cannot create captures at different points in time without creating a separate repository. (This is somewhat possible with the force_recrawl option, but not recommended.)
Cannot rewrite URLs in directory structure exports.
Most aspects of the functionality cannot be configured.

Setup

Download a build for your platform from Releases. Place the contents of the bin subdirectory (at least the belweder executable is required) on your PATH.

Basic usage

Downloading

Create an empty directory that will contain the belweder repository to use. It's currently recommended to use separate repositories for each download project, because there's no way to split downloads.

Change into the new directory, then run belweder repo init to initialize the repository.

Information about the download project is read from config.toml. A sample config.toml file is created by belweder repo init:

urls = [ { url = "https://example.com", load_context = "browsing" } ]
allowed_hosts = []
url_rules = []
recurse = true
download_resources = true

Change urls[0].url to the entry URL of the website to download - you can add multiple entry URLs in the same way.

recurse controls whether to try downloading the entire website, disable this if you want to download just a single page. download_resources controls whether to download external resource URLs referenced by pages, disable this if you only want to download resources from the same domain. Disabling both allows to download only specific URLs listed in urls without any further crawling.

For the program to use cookies, set cookies to true. If a cookies.txt file is present in the project directory, it will be loaded. The program will also write new cookies to the file, creating it if needed.

You can also add rules to url_rules, and additional allowed domains to allowed_hosts.

For a real-world config.toml example, see one I used for downloading a certain forum: docs/examples/ipboard3-config.toml.

Set the SSL_CERT_DIR or SSL_CERT_FILE environment variable before running belweder download, so it can find TLS root certificates. This is a required step, because we do not currently ship root certificates ourselves. On Linux, you usually have certificates in appropriate format in /etc/ssl/certs, so setting SSL_CERT_DIR to /etc/ssl/certs should work. If you have no certificates in appropriate format installed, which is normally the case on Windows, download the curl CA bundle file and set SSL_CERT_FILE to the path of the file.

The exact commands might look like this:

Linux

SSL_CERT_DIR=/etc/ssl/certs belweder download

Windows (PowerShell)

$env:ssl_cert_file="C:\Users\user\cacert.pem"
belweder download

Windows (cmd)

set ssl_cert_file=C:\Users\user\cacert.pem
belweder download

Remember to do this every time you run belweder download. To confirm that your setup is correct, try downloading https://example.com and seeing if there are any warnings like Certificate chain verification error: IssuerCertNotFound.

Run belweder download from the repository directory to start the download. It will print status information on the terminal while it runs.

After every 1000 downloaded items a checkpoint is created in the form of processed and queue files in the repository directory. To traverse the website from the start, for example after a belweder update or after changing the URL ruleset, remove those files.

Exporting

HTTP (recommended)

belweder can host an HTTP server that serves captures for local viewing in a browser. Create a serve.toml file in the repository directory:

[[instances]]
[[instances.listeners]]
port = 8000

Run belweder export http from the repository directory. It will start an HTTP server at 127.0.0.1:8000. Browse to http://127.0.0.1:8000/archive/https:%2F%2Fexample.com to view the page.

Currently, the URL has to be manually percent-encoded, there will be an easier way to input URLs in the future.

This configuration will rewrite URLs on the fly to point to the archive. To avoid that, emulation mode can be used, which works just like the original server:

[[instances]]
[[instances.listeners]]
port = 8000
[instances.emulation]
host_source = "static"
host = "example.com"
schema = "any"

This can also be used to properly browse websites with encodings other than UTF-8.

This configuration is limited to serving captures from a single host for what are probably obvious reasons. If that's not desirable, there's also the option to use the Host header to serve captures from various hosts:

[[instances]]
[[instances.listeners]]
port = 8000
[instances.emulation]
host_source = "header"
schema = "any"

This requires additional setup out of scope of this explanation, such as editing /etc/hosts to have multiple domain names direct to the host with belweder-serve running, as well as having port 80 and/or 443 proxied to it's port.

Directory structure

Alternatively, captures can be exported to a directory structure based on config.toml. Run belweder export directory from the repository directory, which will create an export subdirectory containing exported captures. Currently, no URLs are rewritten, and the export will contain unmodified files - a way to rewrite URLs for local viewing without an HTTP server is planned.

Name

belweder is a pun on Belweder, a famous palace in Warsaw, Poland that used to be the residency of the Polish president. It stands for likely better website downloader, as it was created to be a (hopefully) better alternative for httrack and wget.

Foundation libraries

html5ever for browser-grade HTML parsing.
kuchikiki for a simple html5ever interface.
rust-cssparser for browser-grade CSS parsing.
Lightning CSS for a simple rust-cssparser interface.
Servo's mime_classifier (specifically Charles Samborski's mime_classifier library, which extracts the functionality for use in standalone projects) for browser-grade MIME sniffing.
SQLite for capture information storage.
rusqlite for SQLite bindings.
axum for the export HTTP server.

Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.