4e6fad873b142b45e45467c38360bc486e932f00
gemini-grc
A Gemini crawler.
URLs to visit as well as data from visited URLs are stored as "snapshots" in the database. This makes it easily extendable as a "wayback machine" of Gemini.
Done
- Concurrent downloading with workers
- Concurrent connection limit per host
- URL Blacklist
- Save image/* and text/* files
- Configuration via environment variables
- Storing snapshots in PostgreSQL
- Proper response header & body UTF-8 and format validation
- Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
- Handle redirects (3X status codes)
- Better URL normalization
TODO
- Add snapshot hash and support snapshot history
- Add web interface
- Provide a TLS cert for sites that require it, like Astrobotany
TODO with lower priority
- Gopher
- Scroll gemini://auragem.letz.dev/devlog/20240316.gmi
- Spartan
- Nex
- SuperTXT https://supertxt.net/00-intro.html
Languages
Go
99.1%
Makefile
0.5%
PLpgSQL
0.4%