Files
gemini-grc/NOTES.md
antanst dfb050588c Update documentation and project configuration
- Add architecture documentation for versioned snapshots
- Update Makefile with improved build commands
- Update dependency versions in go.mod
- Add project notes and development guidelines
- Improve README with new features and instructions
2025-05-22 13:26:11 +03:00

459 B

Notes

Avoiding endless loops while crawling

  • Make sure we follow robots.txt
  • Announce our own agent so people can block us in their robots.txt
  • Put a limit on number of pages per host, and notify on limit reach.
  • Put a limit on the number of redirects (not needed?)

Heuristics:

  • Do not parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
  • Have a list of "whitelisted" hosts/urls that we visit in regular intervals.