83 lines
2.4 KiB
Markdown
83 lines
2.4 KiB
Markdown
# gemini-grc
|
|
|
|
A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) network.
|
|
Easily extendable as a "wayback machine" of Gemini.
|
|
|
|
## Features
|
|
- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
|
|
- [x] Save image/* and text/* files
|
|
- [x] Concurrent downloading with configurable number of workers
|
|
- [x] Connection limit per host
|
|
- [x] URL Blacklist
|
|
- [x] Configuration via environment variables
|
|
- [x] Storing capsule snapshots in PostgreSQL
|
|
- [x] Proper response header & body UTF-8 and format validation
|
|
- [x] Proper URL normalization
|
|
- [x] Handle redirects (3X status codes)
|
|
|
|
## How to run
|
|
|
|
Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler.
|
|
All configuration is done via environment variables.
|
|
|
|
## Configuration
|
|
|
|
Bool can be `true`,`false` or `0`,`1`.
|
|
|
|
```text
|
|
LogLevel string // Logging level (debug, info, warn, error)
|
|
MaxResponseSize int // Maximum size of response in bytes
|
|
NumOfWorkers int // Number of concurrent workers
|
|
ResponseTimeout int // Timeout for responses in seconds
|
|
WorkerBatchSize int // Batch size for worker processing
|
|
PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
|
|
BlacklistPath string // File that has blacklisted strings of "host:port"
|
|
DryRun bool // If false, don't write to disk
|
|
PrintWorkerStatus bool // If false, print logs and not worker status table
|
|
```
|
|
|
|
Example:
|
|
|
|
```shell
|
|
LOG_LEVEL=info \
|
|
NUM_OF_WORKERS=10 \
|
|
WORKER_BATCH_SIZE=10 \
|
|
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
|
|
MAX_RESPONSE_SIZE=10485760 \
|
|
RESPONSE_TIMEOUT=10 \
|
|
PANIC_ON_UNEXPECTED_ERROR=true \
|
|
PG_DATABASE=test \
|
|
PG_HOST=127.0.0.1 \
|
|
PG_MAX_OPEN_CONNECTIONS=100 \
|
|
PG_PORT=5434 \
|
|
PG_USER=test \
|
|
PG_PASSWORD=test \
|
|
DRY_RUN=false \
|
|
./gemini-grc
|
|
```
|
|
|
|
## Development
|
|
|
|
Install linters. Check the versions first.
|
|
```shell
|
|
go install mvdan.cc/gofumpt@v0.7.0
|
|
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4
|
|
```
|
|
|
|
## TODO
|
|
- [ ] Add snapshot history
|
|
- [ ] Add a web interface
|
|
- [ ] Provide to servers a TLS cert for sites that require it, like Astrobotany
|
|
- [ ] Use pledge/unveil in OpenBSD hosts
|
|
|
|
## TODO (lower priority)
|
|
- [ ] Gopher
|
|
- [ ] More? http://dbohdan.sdf.org/smolnet/
|
|
|
|
## Notes
|
|
Good starting points:
|
|
|
|
gemini://warmedal.se/~antenna/
|
|
gemini://tlgs.one/
|
|
gopher://i-logout.cz:70/1/bongusta/
|
|
gopher://gopher.quux.org:70/ |