Update documentation and project configuration

- Add architecture documentation for versioned snapshots
- Update Makefile with improved build commands
- Update dependency versions in go.mod
- Add project notes and development guidelines
- Improve README with new features and instructions
This commit is contained in:
2025-05-22 13:26:11 +03:00
committed by antanst
parent a8173544e7
commit 5e6dabf1e7
7 changed files with 193 additions and 38 deletions

13
NOTES.md Normal file
View File

@@ -0,0 +1,13 @@
# Notes
Avoiding endless loops while crawling
- Make sure we follow robots.txt
- Announce our own agent so people can block us in their robots.txt
- Put a limit on number of pages per host, and notify on limit reach.
- Put a limit on the number of redirects (not needed?)
Heuristics:
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.