cfsolver | ||
docs | ||
src | ||
.gitignore | ||
build.rs | ||
Cargo.lock | ||
Cargo.toml | ||
LICENSE | ||
README.md | ||
schema.sql |
belweder
Website archival tool designed as an alternative for httrack and wget's mirroring capabilities.
The project is in a somewhat mature state by now, though be aware that it is not a polished experience. Website-specific fixes might be required, feel free to report problems on the issue tracker.
Features
- Traverses websites to find pages and resources to save. Searches HTML documents for common elements and CSS stylesheets for referenced resources.
- Allows non-recursive downloads entirely if only specific URLs are needed, with or without resources.
- Saves original copies of pages as well as their metadata (capture time, headers, original URL etc.) in a custom repository.
- Can export saved websites as a local HTTP server with rewritten URLs for local viewing.
- Can export to directory structure mimicking HTTP (like wget and httrack).
- Allows handling cookies and loading them from a provided
cookies.txt
file, to save authenticated versions of websites etc. - Allows configuring a regex-based URL ruleset for ignoring certain URLs to avoid saving unnecessary pages.
- Uses already saved pages for traversal, so it's possible to use improvements to the configuration or the program without starting over.
- Saves checkpoint information so cancelled downloads don't need to waste time traversing the website again.
- Allows whitelisting external domains.
- Can save resources from external domains.
- Can make multiple concurrent requests to speed up downloading in some cases.
- Allows setting the
User-Agent
header. - Hackable to suit your needs.
Current limitations
- Treats all text as being UTF-8 encoded, which might not work for some websites. The download should be handled correctly regardless, however the HTTP export is likely to give weird results.
- Probably doesn't handle other edge cases. Some edge cases will crash on purpose due to not being handled properly yet.
- Doesn't support JavaScript due to it's complexity (will archive directly referenced scripts, and simple scripts might work when exported).
- The HTTP server export is rudimentary.
- Cannot create captures at different points in time without creating a separate repository. (This is somewhat possible with the
force_recrawl
option, but not recommended.) - Cannot rewrite URLs in directory structure exports.
- Most aspects of the functionality cannot be configured.
Setup
Download a build for your platform from Releases. Place the contents of the bin
subdirectory (at least the belweder
executable is required) on your PATH.
Basic usage
Downloading
Create an empty directory that will contain the belweder repository to use. It's currently recommended to use separate repositories for each download project, because there's no way to split downloads.
Change into the new directory, then run belweder repo init
to initialize the repository.
Information about the download project is read from config.toml
. A sample config.toml
file is created by belweder repo init
:
urls = [ { url = "https://example.com", load_context = "browsing" } ]
allowed_hosts = []
url_rules = []
recurse = true
download_resources = true
Change urls[0].url
to the entry URL of the website to download - you can add multiple entry URLs in the same way.
recurse
controls whether to try downloading the entire website, disable this if you want to download just a single page. download_resources
controls whether to download external resource URLs referenced by pages, disable this if you only want to download resources from the same domain. Disabling both allows to download only specific URLs listed in urls
without any further crawling.
For the program to use cookies, set cookies
to true
. If a cookies.txt
file is present in the project directory, it will be loaded. The program will also write new cookies to the file, creating it if needed.
You can also add rules to url_rules
, and additional allowed domains to allowed_hosts
.
For a real-world config.toml
example, see one I used for downloading a certain forum: docs/examples/ipboard3-config.toml.
Set the SSL_CERT_DIR
or SSL_CERT_FILE
environment variable before running belweder download
, so it can find TLS root certificates. This is a required step, because we do not currently ship root certificates ourselves. On Linux, you usually have certificates in appropriate format in /etc/ssl/certs
, so setting SSL_CERT_DIR
to /etc/ssl/certs
should work. If you have no certificates in appropriate format installed, which is normally the case on Windows, download the curl CA bundle file and set SSL_CERT_FILE
to the path of the file.
The exact commands might look like this:
Linux
SSL_CERT_DIR=/etc/ssl/certs belweder download
Windows (PowerShell)
$env:ssl_cert_file="C:\Users\user\cacert.pem"
belweder download
Windows (cmd)
set ssl_cert_file=C:\Users\user\cacert.pem
belweder download
Remember to do this every time you run belweder download
. To confirm that your setup is correct, try downloading https://example.com and seeing if there are any warnings like Certificate chain verification error: IssuerCertNotFound
.
Run belweder download
from the repository directory to start the download. It will print status information on the terminal while it runs.
After every 1000 downloaded items a checkpoint is created in the form of processed
and queue
files in the repository directory. To traverse the website from the start, for example after a belweder update or after changing the URL ruleset, remove those files.
Exporting
HTTP (recommended)
belweder can host an HTTP server that serves captures for local viewing in a browser. Create a serve.toml
file in the repository directory:
[[instances]]
[[instances.listeners]]
port = 8000
Run belweder export http
from the repository directory. It will start an HTTP server at 127.0.0.1:8000
. Browse to http://127.0.0.1:8000/archive/https:%2F%2Fexample.com to view the page.
Currently, the URL has to be manually percent-encoded, there will be an easier way to input URLs in the future.
This configuration will rewrite URLs on the fly to point to the archive. To avoid that, emulation mode can be used, which works just like the original server:
[[instances]]
[[instances.listeners]]
port = 8000
[instances.emulation]
host_source = "static"
host = "example.com"
schema = "any"
This can also be used to properly browse websites with encodings other than UTF-8.
This configuration is limited to serving captures from a single host for what are probably obvious reasons. If that's not desirable, there's also the option to use the Host
header to serve captures from various hosts:
[[instances]]
[[instances.listeners]]
port = 8000
[instances.emulation]
host_source = "header"
schema = "any"
This requires additional setup out of scope of this explanation, such as editing /etc/hosts
to have multiple domain names direct to the host with belweder-serve
running, as well as having port 80 and/or 443 proxied to it's port.
Directory structure
Alternatively, captures can be exported to a directory structure based on config.toml
. Run belweder export directory
from the repository directory, which will create an export
subdirectory containing exported captures. Currently, no URLs are rewritten, and the export will contain unmodified files - a way to rewrite URLs for local viewing without an HTTP server is planned.
Name
belweder is a pun on Belweder, a famous palace in Warsaw, Poland that used to be the residency of the Polish president. It stands for likely better website downloader, as it was created to be a (hopefully) better alternative for httrack and wget.
Foundation libraries
- html5ever for browser-grade HTML parsing.
- kuchikiki for a simple html5ever interface.
- rust-cssparser for browser-grade CSS parsing.
- Lightning CSS for a simple rust-cssparser interface.
- Servo's mime_classifier (specifically Charles Samborski's mime_classifier library, which extracts the functionality for use in standalone projects) for browser-grade MIME sniffing.
- SQLite for capture information storage.
- rusqlite for SQLite bindings.
- axum for the export HTTP server.
Copyright
Copyright © 2024-2025 Grzesiek11
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.