diff --git a/README.md b/README.md index 9b17907..2464823 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ A Gemini crawler. +URLs to visit as well as data from visited URLs are stored into "snapshots" in the database. + ## Done - [x] Concurrent downloading with workers - [x] Concurrent connection limit per host @@ -10,22 +12,16 @@ A Gemini crawler. - [x] Configuration via environment variables - [x] Storing snapshots in PostgreSQL - [x] Proper response header & body UTF-8 and format validation -- [x] Follow robots.txt +- [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi +- [x] Handle redirects (3X status codes) ## TODO -- [ ] Take into account gemini://geminiprotocol.net/docs/companion/robots.gmi -- [ ] Proper handling of all response codes - - [ ] Handle 3X redirects properly -- [ ] Handle URLs that need presentation of a TLS cert, like astrobotany - + [ ] Probably have a common "grc" cert for all? -- [ ] Proper input and response validations: - + [ ] When making a request, the URI MUST NOT exceed 1024 bytes -- [ ] Subscriptions to gemini pages? gemini://geminiprotocol.net/docs/companion/ +- [ ] Better URL normalization +- [ ] Provide a TLS cert for sites that require it, like Astrobotany ## TODO for later -- [ ] Add other protocols - + [ ] Gopher - + [ ] Scroll gemini://auragem.letz.dev/devlog/20240316.gmi - + [ ] Spartan - + [ ] Nex - + [ ] SuperTXT https://supertxt.net/00-intro.html +- [ ] Gopher +- [ ] Scroll gemini://auragem.letz.dev/devlog/20240316.gmi +- [ ] Spartan +- [ ] Nex +- [ ] SuperTXT https://supertxt.net/00-intro.html