Update README.md

Remove old file
2025-06-29 23:10:14 +03:00 · 2025-06-29 23:04:03 +03:00 · 2025-06-29 23:03:44 +03:00
3 changed files with 6 additions and 53 deletions
--- a/NOTES.md
+++ b/NOTES.md
@@ -1,13 +0,0 @@
 # Notes
 Avoiding endless loops while crawling
 - Make sure we follow robots.txt
 - Announce our own agent so people can block us in their robots.txt
 - Put a limit on number of pages per host, and notify on limit reach.
 - Put a limit on the number of redirects (not needed?)
 Heuristics:
 - Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
 - Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
--- a/README.md
+++ b/README.md
@@ -22,8 +22,12 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
 ## How to run
-Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
+```shell
-All configuration is done via command-line flags.
+make build
 ./dist/crawler --help
 ```
 Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
 ## Configuration
--- a/TODO.md
+++ b/TODO.md
@@ -1,38 +0,0 @@
 # TODO
 ## Outstanding Issues
 ### 1. Ctrl+C Signal Handling Issue
 **Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
 **Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
 - Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
 - Job queueing when channel is full (`jobs <- url` can block if workers are slow)
 - Long-running database transactions
 **Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
 **Solution**: Add signal/context checking to blocking operations:
 - Use cancellable context instead of `context.Background()` for database operations
 - Make job queueing non-blocking or context-aware
 - Add timeout/cancellation to database operations
 ### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
 **Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
 **Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
 - Identical content → no new snapshot created
 - Query finds old snapshot timestamp → re-queues URL
 - Creates infinite loop of re-crawling unchanged content
 **Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
 **Solution Options**:
 1. Add `last_crawled` timestamp to URLs table
 2. Create separate `crawl_attempts` table  
 3. Always create snapshot entries (even for duplicates) but mark them as such
 4. Modify logic to work with existing schema constraints
 **Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
Author	SHA1	Message	Date
antanst	f362a1d2da	Update README.md	2025-06-29 23:10:14 +03:00
antanst	7b3ad38f03	Remove old file	2025-06-29 23:04:03 +03:00
antanst	8e30a6a365	Remove old file	2025-06-29 23:03:44 +03:00