Update README to reflect command-line flag configuration
- Replace environment variables with command-line flags - Update run example with proper flag syntax - Fix database schema path to misc/sql/initdb.sql - Add missing configuration options (gopher, seed-url-path, max-db-connections) - Remove outdated configuration options
This commit is contained in:
73
README.md
73
README.md
@@ -4,13 +4,13 @@ A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) netw
|
||||
Easily extendable as a "wayback machine" of Gemini.
|
||||
|
||||
## Features
|
||||
- [x] Save image/* and text/* files
|
||||
- [x] Concurrent downloading with configurable number of workers
|
||||
- [x] Save image/* and text/* files
|
||||
- [x] Connection limit per host
|
||||
- [x] URL Blacklist
|
||||
- [x] URL Whitelist (overrides blacklist and robots.txt)
|
||||
- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
|
||||
- [x] Configuration via environment variables
|
||||
- [x] Configuration via command-line flags
|
||||
- [x] Storing capsule snapshots in PostgreSQL
|
||||
- [x] Proper response header & body UTF-8 and format validation
|
||||
- [x] Proper URL normalization
|
||||
@@ -22,46 +22,55 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
|
||||
|
||||
## How to run
|
||||
|
||||
Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler.
|
||||
All configuration is done via environment variables.
|
||||
Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
|
||||
All configuration is done via command-line flags.
|
||||
|
||||
## Configuration
|
||||
|
||||
Bool can be `true`,`false` or `0`,`1`.
|
||||
Available command-line flags:
|
||||
|
||||
```text
|
||||
LogLevel string // Logging level (debug, info, warn, error)
|
||||
MaxResponseSize int // Maximum size of response in bytes
|
||||
NumOfWorkers int // Number of concurrent workers
|
||||
ResponseTimeout int // Timeout for responses in seconds
|
||||
PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
|
||||
BlacklistPath string // File that has blacklisted strings of "host:port"
|
||||
WhitelistPath string // File with URLs that should always be crawled regardless of blacklist or robots.txt
|
||||
DryRun bool // If false, don't write to disk
|
||||
SkipIdenticalContent bool // When true, skip storing snapshots with identical content
|
||||
SkipIfUpdatedDays int // Skip re-crawling URLs updated within this many days (0 to disable)
|
||||
-blacklist-path string
|
||||
File that has blacklist regexes
|
||||
-dry-run
|
||||
Dry run mode
|
||||
-gopher
|
||||
Enable crawling of Gopher holes
|
||||
-log-level string
|
||||
Logging level (debug, info, warn, error) (default "info")
|
||||
-max-db-connections int
|
||||
Maximum number of database connections (default 100)
|
||||
-max-response-size int
|
||||
Maximum size of response in bytes (default 1048576)
|
||||
-pgurl string
|
||||
Postgres URL
|
||||
-response-timeout int
|
||||
Timeout for network responses in seconds (default 10)
|
||||
-seed-url-path string
|
||||
File with seed URLs that should be added to the queue immediately
|
||||
-skip-if-updated-days int
|
||||
Skip re-crawling URLs updated within this many days (0 to disable) (default 60)
|
||||
-whitelist-path string
|
||||
File with URLs that should always be crawled regardless of blacklist
|
||||
-workers int
|
||||
Number of concurrent workers (default 1)
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```shell
|
||||
LOG_LEVEL=info \
|
||||
NUM_OF_WORKERS=10 \
|
||||
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
|
||||
WHITELIST_PATH="./whitelist.txt" \ # URLs that override blacklist and robots.txt
|
||||
MAX_RESPONSE_SIZE=10485760 \
|
||||
RESPONSE_TIMEOUT=10 \
|
||||
PANIC_ON_UNEXPECTED_ERROR=true \
|
||||
PG_DATABASE=test \
|
||||
PG_HOST=127.0.0.1 \
|
||||
PG_MAX_OPEN_CONNECTIONS=100 \
|
||||
PG_PORT=5434 \
|
||||
PG_USER=test \
|
||||
PG_PASSWORD=test \
|
||||
DRY_RUN=false \
|
||||
SKIP_IDENTICAL_CONTENT=false \
|
||||
SKIP_IF_UPDATED_DAYS=7 \
|
||||
./gemini-grc
|
||||
./dist/crawler \
|
||||
-pgurl="postgres://test:test@127.0.0.1:5434/test?sslmode=disable" \
|
||||
-log-level=info \
|
||||
-workers=10 \
|
||||
-blacklist-path="./blacklist.txt" \
|
||||
-whitelist-path="./whitelist.txt" \
|
||||
-max-response-size=10485760 \
|
||||
-response-timeout=10 \
|
||||
-max-db-connections=100 \
|
||||
-skip-if-updated-days=7 \
|
||||
-gopher \
|
||||
-seed-url-path="./seed_urls.txt"
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
Reference in New Issue
Block a user