Update documentation and project configuration

- Add architecture documentation for versioned snapshots - Update Makefile with improved build commands - Update dependency versions in go.mod - Add project notes and development guidelines - Improve README with new features and instructions
2025-05-22 13:26:11 +03:00
parent bfaa857fae
commit 51f94c90b2
7 changed files with 193 additions and 38 deletions
--- a/README.md
+++ b/README.md
@@ -8,6 +8,7 @@ Easily extendable as a "wayback machine" of Gemini.
 - [x] Concurrent downloading with configurable number of workers
 - [x] Connection limit per host
 - [x] URL Blacklist
+- [x] URL Whitelist (overrides blacklist and robots.txt)
 - [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
 - [x] Configuration via environment variables
 - [x] Storing capsule snapshots in PostgreSQL
@@ -16,6 +17,9 @@ Easily extendable as a "wayback machine" of Gemini.
 - [x] Handle redirects (3X status codes)
 - [x] Crawl Gopher holes

+## Security Note
+This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all certificates. This is a common approach for crawlers but makes the application vulnerable to MITM attacks. This trade-off is made to enable crawling self-signed certificates widely used in the Gemini ecosystem.
+
 ## How to run

 Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler.
@@ -30,11 +34,12 @@ Bool can be `true`,`false` or `0`,`1`.
 	MaxResponseSize        int // Maximum size of response in bytes
 	NumOfWorkers           int // Number of concurrent workers
 	ResponseTimeout        int // Timeout for responses in seconds
-	WorkerBatchSize        int // Batch size for worker processing
 	PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
 	BlacklistPath          string // File that has blacklisted strings of "host:port"
+	WhitelistPath          string // File with URLs that should always be crawled regardless of blacklist or robots.txt
 	DryRun                 bool // If false, don't write to disk
-	PrintWorkerStatus      bool // If false, print logs and not worker status table
+	SkipIdenticalContent   bool // When true, skip storing snapshots with identical content
+	SkipIfUpdatedDays      int  // Skip re-crawling URLs updated within this many days (0 to disable)
 ```

 Example:
@@ -42,8 +47,8 @@ Example:
 ```shell
 LOG_LEVEL=info \
 NUM_OF_WORKERS=10 \
-WORKER_BATCH_SIZE=10 \
 BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
+WHITELIST_PATH="./whitelist.txt" \ # URLs that override blacklist and robots.txt
 MAX_RESPONSE_SIZE=10485760 \
 RESPONSE_TIMEOUT=10 \
 PANIC_ON_UNEXPECTED_ERROR=true \
@@ -54,6 +59,8 @@ PG_PORT=5434 \
 PG_USER=test \
 PG_PASSWORD=test \
 DRY_RUN=false \
+SKIP_IDENTICAL_CONTENT=false \
+SKIP_IF_UPDATED_DAYS=7 \
 ./gemini-grc
 ```

@@ -65,8 +72,30 @@ go install mvdan.cc/gofumpt@v0.7.0
 go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4
 ```

+## Snapshot History
+
+The crawler now supports versioned snapshots, storing multiple snapshots of the same URL over time. This allows you to view how content changes over time, similar to the Internet Archive's Wayback Machine.
+
+### Accessing Snapshot History
+
+You can access the snapshot history using the included `snapshot_history.sh` script:
+
+```bash
+# Get the latest snapshot
+./snapshot_history.sh -u gemini://example.com/
+
+# Get a snapshot from a specific point in time
+./snapshot_history.sh -u gemini://example.com/ -t 2023-05-01T12:00:00Z
+
+# Get all snapshots for a URL
+./snapshot_history.sh -u gemini://example.com/ -a
+
+# Get snapshots in a date range
+./snapshot_history.sh -u gemini://example.com/ -r 2023-01-01T00:00:00Z 2023-12-31T23:59:59Z
+```
+
 ## TODO
- [ ] Add snapshot history
+- [x] Add snapshot history
 - [ ] Add a web interface
 - [ ] Provide to servers a TLS cert for sites that require it, like Astrobotany
 - [ ] Use pledge/unveil in OpenBSD hosts