Remove old file

This commit is contained in:
antanst
2025-06-29 23:03:44 +03:00
parent 26311a6d2b
commit 8e30a6a365

View File

@@ -1,13 +0,0 @@
# Notes
Avoiding endless loops while crawling
- Make sure we follow robots.txt
- Announce our own agent so people can block us in their robots.txt
- Put a limit on number of pages per host, and notify on limit reach.
- Put a limit on the number of redirects (not needed?)
Heuristics:
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.