2024-12-27 12:11:35 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:11:35 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:09:55 +02:00
2024-12-27 12:11:35 +02:00
2024-12-27 12:11:35 +02:00

gemini-grc

A Gemini crawler.

URLs to visit as well as data from visited URLs are stored as "snapshots" in the database. This makes it easily extendable as a "wayback machine" of Gemini.

Done

  • Concurrent downloading with workers
  • Concurrent connection limit per host
  • URL Blacklist
  • Save image/* and text/* files
  • Configuration via environment variables
  • Storing snapshots in PostgreSQL
  • Proper response header & body UTF-8 and format validation
  • Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
  • Handle redirects (3X status codes)
  • Better URL normalization

TODO

  • Add snapshot hash and support snapshot history
  • Add web interface
  • Provide a TLS cert for sites that require it, like Astrobotany

TODO with lower priority

Description
A crawler for the Gemini network.
Readme ISC 783 KiB
Languages
Go 99.1%
Makefile 0.5%
PLpgSQL 0.4%