43f22425580d4a484672e5e320430a0e950b452b
gemini-grc
A Gemini crawler.
URLs to visit as well as data from visited URLs are stored into "snapshots" in the database.
Done
- Concurrent downloading with workers
- Concurrent connection limit per host
- URL Blacklist
- Save image/* and text/* files
- Configuration via environment variables
- Storing snapshots in PostgreSQL
- Proper response header & body UTF-8 and format validation
- Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
- Handle redirects (3X status codes)
TODO
- Better URL normalization
- Provide a TLS cert for sites that require it, like Astrobotany
TODO for later
- Gopher
- Scroll gemini://auragem.letz.dev/devlog/20240316.gmi
- Spartan
- Nex
- SuperTXT https://supertxt.net/00-intro.html
Description
Languages
Go
98%
PLpgSQL
0.9%
Makefile
0.6%
Shell
0.5%