Update license and readme.
This commit is contained in:
14
COPYING
14
COPYING
@@ -1,14 +0,0 @@
|
||||
|
||||
Copyright (c) Antanst
|
||||
|
||||
Permission to use, copy, modify, and distribute this software for any
|
||||
purpose with or without fee is hereby granted, provided that the above
|
||||
copyright notice and this permission notice appear in all copies.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
|
||||
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
|
||||
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
|
||||
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
|
||||
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
|
||||
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
|
||||
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
|
||||
15
LICENSE
Normal file
15
LICENSE
Normal file
@@ -0,0 +1,15 @@
|
||||
ISC License
|
||||
|
||||
Copyright (c) Antanst 2014-2015
|
||||
|
||||
Permission to use, copy, modify, and distribute this software for any
|
||||
purpose with or without fee is hereby granted, provided that the above
|
||||
copyright notice and this permission notice appear in all copies.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
|
||||
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
|
||||
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
|
||||
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
|
||||
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
|
||||
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
|
||||
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
|
||||
76
README.md
76
README.md
@@ -1,27 +1,83 @@
|
||||
# gemini-grc
|
||||
|
||||
A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) network. Easily extendable as a "wayback machine" of Gemini.
|
||||
A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) network.
|
||||
Easily extendable as a "wayback machine" of Gemini.
|
||||
|
||||
## Features done
|
||||
- [x] URL normalization
|
||||
- [x] Handle redirects (3X status codes)
|
||||
## Features
|
||||
- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
|
||||
- [x] Save image/* and text/* files
|
||||
- [x] Concurrent downloading with workers
|
||||
- [x] Concurrent downloading with configurable number of workers
|
||||
- [x] Connection limit per host
|
||||
- [x] URL Blacklist
|
||||
- [x] Configuration via environment variables
|
||||
- [x] Storing snapshots in PostgreSQL
|
||||
- [x] Storing capsule snapshots in PostgreSQL
|
||||
- [x] Proper response header & body UTF-8 and format validation
|
||||
- [x] Proper URL normalization
|
||||
- [x] Handle redirects (3X status codes)
|
||||
|
||||
## How to run
|
||||
|
||||
Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler.
|
||||
All configuration is done via environment variables.
|
||||
|
||||
## Configuration
|
||||
|
||||
Bool can be `true`,`false` or `0`,`1`.
|
||||
|
||||
```text
|
||||
LogLevel string // Logging level (debug, info, warn, error)
|
||||
MaxResponseSize int // Maximum size of response in bytes
|
||||
NumOfWorkers int // Number of concurrent workers
|
||||
ResponseTimeout int // Timeout for responses in seconds
|
||||
WorkerBatchSize int // Batch size for worker processing
|
||||
PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
|
||||
BlacklistPath string // File that has blacklisted strings of "host:port"
|
||||
DryRun bool // If false, don't write to disk
|
||||
PrintWorkerStatus bool // If false, print logs and not worker status table
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```shell
|
||||
LOG_LEVEL=info \
|
||||
NUM_OF_WORKERS=10 \
|
||||
WORKER_BATCH_SIZE=10 \
|
||||
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
|
||||
MAX_RESPONSE_SIZE=10485760 \
|
||||
RESPONSE_TIMEOUT=10 \
|
||||
PANIC_ON_UNEXPECTED_ERROR=true \
|
||||
PG_DATABASE=test \
|
||||
PG_HOST=127.0.0.1 \
|
||||
PG_MAX_OPEN_CONNECTIONS=100 \
|
||||
PG_PORT=5434 \
|
||||
PG_USER=test \
|
||||
PG_PASSWORD=test \
|
||||
DRY_RUN=false \
|
||||
./gemini-grc
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
Install linters. Check the versions first.
|
||||
```shell
|
||||
go install mvdan.cc/gofumpt@v0.7.0
|
||||
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4
|
||||
```
|
||||
|
||||
## TODO
|
||||
- [ ] Add snapshot history
|
||||
- [ ] Add a web interface
|
||||
- [ ] Provide to servers a TLS cert for sites that require it, like Astrobotany
|
||||
- [ ] Use pledge/unveil in OpenBSD hosts
|
||||
|
||||
## TODO (lower priority)
|
||||
- [ ] Gopher
|
||||
- [ ] Scroll gemini://auragem.letz.dev/devlog/20240316.gmi
|
||||
- [ ] Spartan
|
||||
- [ ] Nex
|
||||
- [ ] SuperTXT https://supertxt.net/00-intro.html
|
||||
- [ ] More? http://dbohdan.sdf.org/smolnet/
|
||||
|
||||
## Notes
|
||||
Good starting points:
|
||||
|
||||
gemini://warmedal.se/~antenna/
|
||||
gemini://tlgs.one/
|
||||
gopher://i-logout.cz:70/1/bongusta/
|
||||
gopher://gopher.quux.org:70/
|
||||
Reference in New Issue
Block a user