- Prevent overwriting snapshots that have valid response codes - Ensure URL is removed from queue when snapshot update is skipped - Add last_crawled timestamp tracking for better crawl scheduling - Remove SkipIdenticalContent flag, simplify content deduplication logic - Update database schema with last_crawled column and indexes
1.6 KiB
1.6 KiB
TODO
Outstanding Issues
1. Ctrl+C Signal Handling Issue
Problem: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
Root Cause: The main thread gets stuck in blocking operations before it can check for signals:
- Database operations in the polling loop (
cmd/crawler/crawler.go:239-250) - Job queueing when channel is full (
jobs <- urlcan block if workers are slow) - Long-running database transactions
Location: cmd/crawler/crawler.go - main polling loop starting at line 233
Solution: Add signal/context checking to blocking operations:
- Use cancellable context instead of
context.Background()for database operations - Make job queueing non-blocking or context-aware
- Add timeout/cancellation to database operations
2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
Problem: When --skip-identical-content=true (default), URLs with unchanged content get continuously re-queued.
Root Cause: The function tracks when content last changed, not when URLs were last crawled:
- Identical content → no new snapshot created
- Query finds old snapshot timestamp → re-queues URL
- Creates infinite loop of re-crawling unchanged content
Location: cmd/crawler/crawler.go:388-470 - fetchSnapshotsFromHistory() function
Solution Options:
- Add
last_crawledtimestamp to URLs table - Create separate
crawl_attemptstable - Always create snapshot entries (even for duplicates) but mark them as such
- Modify logic to work with existing schema constraints
Current Status: Function assumes SkipIdenticalContent=false per original comment at line 391.