Files
gemini-grc/README.md
2024-12-27 12:11:35 +02:00

973 B

gemini-grc

A Gemini crawler.

URLs to visit as well as data from visited URLs are stored as "snapshots" in the database. This makes it easily extendable as a "wayback machine" of Gemini.

Done

  • Concurrent downloading with workers
  • Concurrent connection limit per host
  • URL Blacklist
  • Save image/* and text/* files
  • Configuration via environment variables
  • Storing snapshots in PostgreSQL
  • Proper response header & body UTF-8 and format validation
  • Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
  • Handle redirects (3X status codes)
  • Better URL normalization

TODO

  • Add snapshot hash and support snapshot history
  • Add web interface
  • Provide a TLS cert for sites that require it, like Astrobotany

TODO with lower priority