Update documentation and project configuration
- Add architecture documentation for versioned snapshots - Update Makefile with improved build commands - Update dependency versions in go.mod - Add project notes and development guidelines - Improve README with new features and instructions
This commit is contained in:
13
NOTES.md
Normal file
13
NOTES.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Notes
|
||||
|
||||
Avoiding endless loops while crawling
|
||||
|
||||
- Make sure we follow robots.txt
|
||||
- Announce our own agent so people can block us in their robots.txt
|
||||
- Put a limit on number of pages per host, and notify on limit reach.
|
||||
- Put a limit on the number of redirects (not needed?)
|
||||
|
||||
Heuristics:
|
||||
|
||||
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
|
||||
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
|
||||
Reference in New Issue
Block a user