antanst/gemini-grc

Go to file

antanst 2357135d5a Fix snapshot overwrite logic to preserve successful responses

- Prevent overwriting snapshots that have valid response codes
- Ensure URL is removed from queue when snapshot update is skipped
- Add last_crawled timestamp tracking for better crawl scheduling
- Remove SkipIdenticalContent flag, simplify content deduplication logic
- Update database schema with last_crawled column and indexes

2025-06-29 22:38:38 +03:00

Fix snapshot overwrite logic to preserve successful responses

2025-06-29 22:38:38 +03:00

Fix snapshot overwrite logic to preserve successful responses

2025-06-29 22:38:38 +03:00

Fix snapshot overwrite logic to preserve successful responses

2025-06-29 22:38:38 +03:00

Implement structured logging with slog

2025-06-29 22:38:38 +03:00

Fix snapshot overwrite logic to preserve successful responses

2025-06-29 22:38:38 +03:00

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

Fix snapshot overwrite logic to preserve successful responses

2025-06-29 22:38:38 +03:00

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

Update and refactor core functionality

2025-06-29 22:38:38 +03:00

.gitignore

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

ARCHITECTURE.md

Refine content deduplication and improve configuration

2025-06-29 22:38:38 +03:00

go.mod

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

go.sum

Update documentation and project configuration

2025-06-29 22:38:38 +03:00

LICENSE

Update license and readme.

2025-02-26 10:39:51 +02:00

Makefile

Update documentation and project configuration

2025-06-29 22:38:38 +03:00

NOTES.md

Update documentation and project configuration

2025-06-29 22:38:38 +03:00

README.md

Update documentation and project configuration

2025-06-29 22:38:38 +03:00

seed_urls.txt

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

test.txt

Enhance crawler with seed list and SQL utilities

2025-06-29 22:38:38 +03:00

TODO.md

Fix snapshot overwrite logic to preserve successful responses

2025-06-29 22:38:38 +03:00

README.md

gemini-grc

A crawler for the Gemini network. Easily extendable as a "wayback machine" of Gemini.

Features

Save image/* and text/* files
Concurrent downloading with configurable number of workers
Connection limit per host
URL Blacklist
URL Whitelist (overrides blacklist and robots.txt)
Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
Configuration via environment variables
Storing capsule snapshots in PostgreSQL
Proper response header & body UTF-8 and format validation
Proper URL normalization
Handle redirects (3X status codes)
Crawl Gopher holes

Security Note

This crawler uses InsecureSkipVerify: true in TLS configuration to accept all certificates. This is a common approach for crawlers but makes the application vulnerable to MITM attacks. This trade-off is made to enable crawling self-signed certificates widely used in the Gemini ecosystem.

How to run

Spin up a PostgreSQL, check db/sql/initdb.sql to create the tables and start the crawler. All configuration is done via environment variables.

Configuration

Bool can be true,false or 0,1.

	LogLevel               string // Logging level (debug, info, warn, error)
	MaxResponseSize        int // Maximum size of response in bytes
	NumOfWorkers           int // Number of concurrent workers
	ResponseTimeout        int // Timeout for responses in seconds
	PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
	BlacklistPath          string // File that has blacklisted strings of "host:port"
	WhitelistPath          string // File with URLs that should always be crawled regardless of blacklist or robots.txt
	DryRun                 bool // If false, don't write to disk
	SkipIdenticalContent   bool // When true, skip storing snapshots with identical content
	SkipIfUpdatedDays      int  // Skip re-crawling URLs updated within this many days (0 to disable)

Example:

LOG_LEVEL=info \
NUM_OF_WORKERS=10 \
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
WHITELIST_PATH="./whitelist.txt" \ # URLs that override blacklist and robots.txt
MAX_RESPONSE_SIZE=10485760 \
RESPONSE_TIMEOUT=10 \
PANIC_ON_UNEXPECTED_ERROR=true \
PG_DATABASE=test \
PG_HOST=127.0.0.1 \
PG_MAX_OPEN_CONNECTIONS=100 \
PG_PORT=5434 \
PG_USER=test \
PG_PASSWORD=test \
DRY_RUN=false \
SKIP_IDENTICAL_CONTENT=false \
SKIP_IF_UPDATED_DAYS=7 \
./gemini-grc

Development

Install linters. Check the versions first.

go install mvdan.cc/gofumpt@v0.7.0
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4

Snapshot History

The crawler now supports versioned snapshots, storing multiple snapshots of the same URL over time. This allows you to view how content changes over time, similar to the Internet Archive's Wayback Machine.

Accessing Snapshot History

You can access the snapshot history using the included snapshot_history.sh script:

# Get the latest snapshot
./snapshot_history.sh -u gemini://example.com/

# Get a snapshot from a specific point in time
./snapshot_history.sh -u gemini://example.com/ -t 2023-05-01T12:00:00Z

# Get all snapshots for a URL
./snapshot_history.sh -u gemini://example.com/ -a

# Get snapshots in a date range
./snapshot_history.sh -u gemini://example.com/ -r 2023-01-01T00:00:00Z 2023-12-31T23:59:59Z

TODO

Add snapshot history
Add a web interface
Provide to servers a TLS cert for sites that require it, like Astrobotany
Use pledge/unveil in OpenBSD hosts

TODO (lower priority)

More protocols? http://dbohdan.sdf.org/smolnet/

Notes

Good starting points:

gemini://warmedal.se/~antenna/ gemini://tlgs.one/ gopher://i-logout.cz:70/1/bongusta/ gopher://gopher.quux.org:70/