Files
gemini-grs/README.md
2024-12-09 19:54:15 +02:00

822 B

gemini-grc

A Gemini crawler.

URLs to visit as well as data from visited URLs are stored into "snapshots" in the database.

Done

  • Concurrent downloading with workers
  • Concurrent connection limit per host
  • URL Blacklist
  • Save image/* and text/* files
  • Configuration via environment variables
  • Storing snapshots in PostgreSQL
  • Proper response header & body UTF-8 and format validation
  • Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
  • Handle redirects (3X status codes)

TODO

  • Better URL normalization
  • Provide a TLS cert for sites that require it, like Astrobotany

TODO for later