2025-02-26 13:31:46 +02:00
2025-02-26 13:31:46 +02:00
2025-02-26 10:35:11 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2025-01-16 22:37:39 +02:00
2025-03-10 16:54:06 +02:00
2025-02-26 13:31:46 +02:00
2025-02-26 13:31:46 +02:00
2025-02-26 10:39:51 +02:00
2025-02-26 13:31:46 +02:00
2025-03-10 16:54:06 +02:00
2025-02-26 10:39:51 +02:00

gemini-grc

A crawler for the Gemini network. Easily extendable as a "wayback machine" of Gemini.

Features

  • Save image/* and text/* files
  • Concurrent downloading with configurable number of workers
  • Connection limit per host
  • URL Blacklist
  • Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
  • Configuration via environment variables
  • Storing capsule snapshots in PostgreSQL
  • Proper response header & body UTF-8 and format validation
  • Proper URL normalization
  • Handle redirects (3X status codes)
  • Crawl Gopher holes

How to run

Spin up a PostgreSQL, check db/sql/initdb.sql to create the tables and start the crawler. All configuration is done via environment variables.

Configuration

Bool can be true,false or 0,1.

	LogLevel               string // Logging level (debug, info, warn, error)
	MaxResponseSize        int // Maximum size of response in bytes
	NumOfWorkers           int // Number of concurrent workers
	ResponseTimeout        int // Timeout for responses in seconds
	WorkerBatchSize        int // Batch size for worker processing
	PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
	BlacklistPath          string // File that has blacklisted strings of "host:port"
	DryRun                 bool // If false, don't write to disk
	PrintWorkerStatus      bool // If false, print logs and not worker status table

Example:

LOG_LEVEL=info \
NUM_OF_WORKERS=10 \
WORKER_BATCH_SIZE=10 \
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
MAX_RESPONSE_SIZE=10485760 \
RESPONSE_TIMEOUT=10 \
PANIC_ON_UNEXPECTED_ERROR=true \
PG_DATABASE=test \
PG_HOST=127.0.0.1 \
PG_MAX_OPEN_CONNECTIONS=100 \
PG_PORT=5434 \
PG_USER=test \
PG_PASSWORD=test \
DRY_RUN=false \
./gemini-grc

Development

Install linters. Check the versions first.

go install mvdan.cc/gofumpt@v0.7.0
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4

TODO

  • Add snapshot history
  • Add a web interface
  • Provide to servers a TLS cert for sites that require it, like Astrobotany
  • Use pledge/unveil in OpenBSD hosts

TODO (lower priority)

Notes

Good starting points:

gemini://warmedal.se/~antenna/ gemini://tlgs.one/ gopher://i-logout.cz:70/1/bongusta/ gopher://gopher.quux.org:70/

Description
A crawler for the Gemini network.
Readme ISC 783 KiB
Languages
Go 99.1%
Makefile 0.5%
PLpgSQL 0.4%