Update README to reflect command-line flag configuration
- Replace environment variables with command-line flags - Update run example with proper flag syntax - Fix database schema path to misc/sql/initdb.sql - Add missing configuration options (gopher, seed-url-path, max-db-connections) - Remove outdated configuration options
This commit is contained in:
73
README.md
73
README.md
@@ -4,13 +4,13 @@ A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) netw
|
|||||||
Easily extendable as a "wayback machine" of Gemini.
|
Easily extendable as a "wayback machine" of Gemini.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
- [x] Save image/* and text/* files
|
|
||||||
- [x] Concurrent downloading with configurable number of workers
|
- [x] Concurrent downloading with configurable number of workers
|
||||||
|
- [x] Save image/* and text/* files
|
||||||
- [x] Connection limit per host
|
- [x] Connection limit per host
|
||||||
- [x] URL Blacklist
|
- [x] URL Blacklist
|
||||||
- [x] URL Whitelist (overrides blacklist and robots.txt)
|
- [x] URL Whitelist (overrides blacklist and robots.txt)
|
||||||
- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
|
- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
|
||||||
- [x] Configuration via environment variables
|
- [x] Configuration via command-line flags
|
||||||
- [x] Storing capsule snapshots in PostgreSQL
|
- [x] Storing capsule snapshots in PostgreSQL
|
||||||
- [x] Proper response header & body UTF-8 and format validation
|
- [x] Proper response header & body UTF-8 and format validation
|
||||||
- [x] Proper URL normalization
|
- [x] Proper URL normalization
|
||||||
@@ -22,46 +22,55 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
|
|||||||
|
|
||||||
## How to run
|
## How to run
|
||||||
|
|
||||||
Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler.
|
Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
|
||||||
All configuration is done via environment variables.
|
All configuration is done via command-line flags.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
Bool can be `true`,`false` or `0`,`1`.
|
Available command-line flags:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
LogLevel string // Logging level (debug, info, warn, error)
|
-blacklist-path string
|
||||||
MaxResponseSize int // Maximum size of response in bytes
|
File that has blacklist regexes
|
||||||
NumOfWorkers int // Number of concurrent workers
|
-dry-run
|
||||||
ResponseTimeout int // Timeout for responses in seconds
|
Dry run mode
|
||||||
PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
|
-gopher
|
||||||
BlacklistPath string // File that has blacklisted strings of "host:port"
|
Enable crawling of Gopher holes
|
||||||
WhitelistPath string // File with URLs that should always be crawled regardless of blacklist or robots.txt
|
-log-level string
|
||||||
DryRun bool // If false, don't write to disk
|
Logging level (debug, info, warn, error) (default "info")
|
||||||
SkipIdenticalContent bool // When true, skip storing snapshots with identical content
|
-max-db-connections int
|
||||||
SkipIfUpdatedDays int // Skip re-crawling URLs updated within this many days (0 to disable)
|
Maximum number of database connections (default 100)
|
||||||
|
-max-response-size int
|
||||||
|
Maximum size of response in bytes (default 1048576)
|
||||||
|
-pgurl string
|
||||||
|
Postgres URL
|
||||||
|
-response-timeout int
|
||||||
|
Timeout for network responses in seconds (default 10)
|
||||||
|
-seed-url-path string
|
||||||
|
File with seed URLs that should be added to the queue immediately
|
||||||
|
-skip-if-updated-days int
|
||||||
|
Skip re-crawling URLs updated within this many days (0 to disable) (default 60)
|
||||||
|
-whitelist-path string
|
||||||
|
File with URLs that should always be crawled regardless of blacklist
|
||||||
|
-workers int
|
||||||
|
Number of concurrent workers (default 1)
|
||||||
```
|
```
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
LOG_LEVEL=info \
|
./dist/crawler \
|
||||||
NUM_OF_WORKERS=10 \
|
-pgurl="postgres://test:test@127.0.0.1:5434/test?sslmode=disable" \
|
||||||
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
|
-log-level=info \
|
||||||
WHITELIST_PATH="./whitelist.txt" \ # URLs that override blacklist and robots.txt
|
-workers=10 \
|
||||||
MAX_RESPONSE_SIZE=10485760 \
|
-blacklist-path="./blacklist.txt" \
|
||||||
RESPONSE_TIMEOUT=10 \
|
-whitelist-path="./whitelist.txt" \
|
||||||
PANIC_ON_UNEXPECTED_ERROR=true \
|
-max-response-size=10485760 \
|
||||||
PG_DATABASE=test \
|
-response-timeout=10 \
|
||||||
PG_HOST=127.0.0.1 \
|
-max-db-connections=100 \
|
||||||
PG_MAX_OPEN_CONNECTIONS=100 \
|
-skip-if-updated-days=7 \
|
||||||
PG_PORT=5434 \
|
-gopher \
|
||||||
PG_USER=test \
|
-seed-url-path="./seed_urls.txt"
|
||||||
PG_PASSWORD=test \
|
|
||||||
DRY_RUN=false \
|
|
||||||
SKIP_IDENTICAL_CONTENT=false \
|
|
||||||
SKIP_IF_UPDATED_DAYS=7 \
|
|
||||||
./gemini-grc
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Development
|
## Development
|
||||||
|
|||||||
Reference in New Issue
Block a user