diff --git a/COPYING b/COPYING deleted file mode 100644 index 6beff53..0000000 --- a/COPYING +++ /dev/null @@ -1,14 +0,0 @@ - - Copyright (c) Antanst - - Permission to use, copy, modify, and distribute this software for any - purpose with or without fee is hereby granted, provided that the above - copyright notice and this permission notice appear in all copies. - - THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES - WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF - MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR - ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES - WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN - ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF - OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..3145076 --- /dev/null +++ b/LICENSE @@ -0,0 +1,15 @@ +ISC License + +Copyright (c) Antanst 2014-2015 + +Permission to use, copy, modify, and distribute this software for any +purpose with or without fee is hereby granted, provided that the above +copyright notice and this permission notice appear in all copies. + +THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES +WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR +ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES +WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN +ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF +OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. diff --git a/README.md b/README.md index 800cab5..a2aab69 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,83 @@ # gemini-grc -A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) network. Easily extendable as a "wayback machine" of Gemini. +A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) network. +Easily extendable as a "wayback machine" of Gemini. -## Features done -- [x] URL normalization -- [x] Handle redirects (3X status codes) +## Features - [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi - [x] Save image/* and text/* files -- [x] Concurrent downloading with workers +- [x] Concurrent downloading with configurable number of workers - [x] Connection limit per host - [x] URL Blacklist - [x] Configuration via environment variables -- [x] Storing snapshots in PostgreSQL +- [x] Storing capsule snapshots in PostgreSQL - [x] Proper response header & body UTF-8 and format validation +- [x] Proper URL normalization +- [x] Handle redirects (3X status codes) + +## How to run + +Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler. +All configuration is done via environment variables. + +## Configuration + +Bool can be `true`,`false` or `0`,`1`. + +```text + LogLevel string // Logging level (debug, info, warn, error) + MaxResponseSize int // Maximum size of response in bytes + NumOfWorkers int // Number of concurrent workers + ResponseTimeout int // Timeout for responses in seconds + WorkerBatchSize int // Batch size for worker processing + PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL + BlacklistPath string // File that has blacklisted strings of "host:port" + DryRun bool // If false, don't write to disk + PrintWorkerStatus bool // If false, print logs and not worker status table +``` + +Example: + +```shell +LOG_LEVEL=info \ +NUM_OF_WORKERS=10 \ +WORKER_BATCH_SIZE=10 \ +BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty +MAX_RESPONSE_SIZE=10485760 \ +RESPONSE_TIMEOUT=10 \ +PANIC_ON_UNEXPECTED_ERROR=true \ +PG_DATABASE=test \ +PG_HOST=127.0.0.1 \ +PG_MAX_OPEN_CONNECTIONS=100 \ +PG_PORT=5434 \ +PG_USER=test \ +PG_PASSWORD=test \ +DRY_RUN=false \ +./gemini-grc +``` + +## Development + +Install linters. Check the versions first. +```shell +go install mvdan.cc/gofumpt@v0.7.0 +go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4 +``` ## TODO - [ ] Add snapshot history - [ ] Add a web interface - [ ] Provide to servers a TLS cert for sites that require it, like Astrobotany +- [ ] Use pledge/unveil in OpenBSD hosts ## TODO (lower priority) - [ ] Gopher -- [ ] Scroll gemini://auragem.letz.dev/devlog/20240316.gmi -- [ ] Spartan -- [ ] Nex -- [ ] SuperTXT https://supertxt.net/00-intro.html +- [ ] More? http://dbohdan.sdf.org/smolnet/ + +## Notes +Good starting points: + +gemini://warmedal.se/~antenna/ +gemini://tlgs.one/ +gopher://i-logout.cz:70/1/bongusta/ +gopher://gopher.quux.org:70/ \ No newline at end of file