Update README to reflect command-line flag configuration

- Replace environment variables with command-line flags
- Update run example with proper flag syntax
- Fix database schema path to misc/sql/initdb.sql
- Add missing configuration options (gopher, seed-url-path, max-db-connections)
- Remove outdated configuration options
This commit is contained in:
antanst
2025-06-29 22:25:38 +03:00
parent 1ba432c127
commit c386d5eb14

View File

@@ -4,13 +4,13 @@ A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) netw
Easily extendable as a "wayback machine" of Gemini. Easily extendable as a "wayback machine" of Gemini.
## Features ## Features
- [x] Save image/* and text/* files
- [x] Concurrent downloading with configurable number of workers - [x] Concurrent downloading with configurable number of workers
- [x] Save image/* and text/* files
- [x] Connection limit per host - [x] Connection limit per host
- [x] URL Blacklist - [x] URL Blacklist
- [x] URL Whitelist (overrides blacklist and robots.txt) - [x] URL Whitelist (overrides blacklist and robots.txt)
- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi - [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
- [x] Configuration via environment variables - [x] Configuration via command-line flags
- [x] Storing capsule snapshots in PostgreSQL - [x] Storing capsule snapshots in PostgreSQL
- [x] Proper response header & body UTF-8 and format validation - [x] Proper response header & body UTF-8 and format validation
- [x] Proper URL normalization - [x] Proper URL normalization
@@ -22,46 +22,55 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
## How to run ## How to run
Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler. Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
All configuration is done via environment variables. All configuration is done via command-line flags.
## Configuration ## Configuration
Bool can be `true`,`false` or `0`,`1`. Available command-line flags:
```text ```text
LogLevel string // Logging level (debug, info, warn, error) -blacklist-path string
MaxResponseSize int // Maximum size of response in bytes File that has blacklist regexes
NumOfWorkers int // Number of concurrent workers -dry-run
ResponseTimeout int // Timeout for responses in seconds Dry run mode
PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL -gopher
BlacklistPath string // File that has blacklisted strings of "host:port" Enable crawling of Gopher holes
WhitelistPath string // File with URLs that should always be crawled regardless of blacklist or robots.txt -log-level string
DryRun bool // If false, don't write to disk Logging level (debug, info, warn, error) (default "info")
SkipIdenticalContent bool // When true, skip storing snapshots with identical content -max-db-connections int
SkipIfUpdatedDays int // Skip re-crawling URLs updated within this many days (0 to disable) Maximum number of database connections (default 100)
-max-response-size int
Maximum size of response in bytes (default 1048576)
-pgurl string
Postgres URL
-response-timeout int
Timeout for network responses in seconds (default 10)
-seed-url-path string
File with seed URLs that should be added to the queue immediately
-skip-if-updated-days int
Skip re-crawling URLs updated within this many days (0 to disable) (default 60)
-whitelist-path string
File with URLs that should always be crawled regardless of blacklist
-workers int
Number of concurrent workers (default 1)
``` ```
Example: Example:
```shell ```shell
LOG_LEVEL=info \ ./dist/crawler \
NUM_OF_WORKERS=10 \ -pgurl="postgres://test:test@127.0.0.1:5434/test?sslmode=disable" \
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty -log-level=info \
WHITELIST_PATH="./whitelist.txt" \ # URLs that override blacklist and robots.txt -workers=10 \
MAX_RESPONSE_SIZE=10485760 \ -blacklist-path="./blacklist.txt" \
RESPONSE_TIMEOUT=10 \ -whitelist-path="./whitelist.txt" \
PANIC_ON_UNEXPECTED_ERROR=true \ -max-response-size=10485760 \
PG_DATABASE=test \ -response-timeout=10 \
PG_HOST=127.0.0.1 \ -max-db-connections=100 \
PG_MAX_OPEN_CONNECTIONS=100 \ -skip-if-updated-days=7 \
PG_PORT=5434 \ -gopher \
PG_USER=test \ -seed-url-path="./seed_urls.txt"
PG_PASSWORD=test \
DRY_RUN=false \
SKIP_IDENTICAL_CONTENT=false \
SKIP_IF_UPDATED_DAYS=7 \
./gemini-grc
``` ```
## Development ## Development