Add robots.txt checking

Still needs periodic cache refresh
This commit is contained in:
2024-10-23 14:24:10 +03:00
parent 1ac250ca6e
commit 094394afc2
5 changed files with 108 additions and 37 deletions

View File

@@ -10,10 +10,10 @@ A Gemini crawler.
- [x] Configuration via environment variables
- [x] Storing snapshots in PostgreSQL
- [x] Proper response header & body UTF-8 and format validation
- [x] Follow robots.txt
## TODO
- [ ] Follow robots.txt gemini://geminiprotocol.net/docs/companion/
- [ ] Test with gemini://alexey.shpakovsky.ru/maze
- [ ] Take into account gemini://geminiprotocol.net/docs/companion/robots.gmi
- [ ] Proper handling of all response codes
- [ ] Handle 3X redirects properly
- [ ] Handle URLs that need presentation of a TLS cert, like astrobotany