Update README to reflect command-line flag configuration

- Replace environment variables with command-line flags - Update run example with proper flag syntax - Fix database schema path to misc/sql/initdb.sql - Add missing configuration options (gopher, seed-url-path, max-db-connections) - Remove outdated configuration options
Improve crawler performance and logging
2025-06-29 22:34:15 +03:00 · 2025-06-29 22:34:15 +03:00 · 2025-06-29 22:34:15 +03:00 · 2025-06-29 22:34:15 +03:00 · 2025-06-29 22:34:15 +03:00 · 2025-06-29 22:34:15 +03:00
3 changed files with 53 additions and 6 deletions
--- a/NOTES.md
+++ b/NOTES.md
@@ -0,0 +1,13 @@
+# Notes
+
+Avoiding endless loops while crawling
+
+- Make sure we follow robots.txt
+- Announce our own agent so people can block us in their robots.txt
+- Put a limit on number of pages per host, and notify on limit reach.
+- Put a limit on the number of redirects (not needed?)
+
+Heuristics:
+
+- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
+- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
--- a/README.md
+++ b/README.md
@@ -22,12 +22,8 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all

 ## How to run

-```shell
-make build
-./dist/crawler --help
-```
-
-Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
+Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
+All configuration is done via command-line flags.

 ## Configuration

--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,38 @@
+# TODO
+
+## Outstanding Issues
+
+### 1. Ctrl+C Signal Handling Issue
+
+**Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
+
+**Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
+- Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
+- Job queueing when channel is full (`jobs <- url` can block if workers are slow)
+- Long-running database transactions
+
+**Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
+
+**Solution**: Add signal/context checking to blocking operations:
+- Use cancellable context instead of `context.Background()` for database operations
+- Make job queueing non-blocking or context-aware
+- Add timeout/cancellation to database operations
+
+### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
+
+**Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
+
+**Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
+- Identical content → no new snapshot created
+- Query finds old snapshot timestamp → re-queues URL
+- Creates infinite loop of re-crawling unchanged content
+
+**Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
+
+**Solution Options**:
+1. Add `last_crawled` timestamp to URLs table
+2. Create separate `crawl_attempts` table  
+3. Always create snapshot entries (even for duplicates) but mark them as such
+4. Modify logic to work with existing schema constraints
+
+**Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
Author	SHA1	Message	Date
antanst	c386d5eb14	Update README to reflect command-line flag configuration - Replace environment variables with command-line flags - Update run example with proper flag syntax - Fix database schema path to misc/sql/initdb.sql - Add missing configuration options (gopher, seed-url-path, max-db-connections) - Remove outdated configuration options	2025-06-29 22:34:15 +03:00
antanst	1ba432c127	Improve crawler performance and logging - Optimize job scheduler to use NumOfWorkers for URL limits - Clean up verbose logging in worker processing - Update log messages for better clarity	2025-06-29 22:34:15 +03:00
antanst	21b8769bc5	Update log message for clarity - Change "old content" to "old snapshot" for more accurate terminology	2025-06-29 22:34:15 +03:00
antanst	389b3615e2	Clean up logging in worker processing - Move response code logging to happen after successful snapshot save - Remove error message from log output for cleaner display - Consolidate logging logic in saveSnapshotAndRemoveURL function	2025-06-29 22:34:15 +03:00
antanst	af42383513	Improve crawler performance and worker coordination - Add WaitGroup synchronization for workers to prevent overlapping scheduler runs - Increase history fetch multiplier and sleep intervals for better resource usage - Simplify error handling and logging in worker processing - Update SQL query to exclude error snapshots from history selection - Fix worker ID variable reference in spawning loop - Streamline snapshot update logic and error reporting	2025-06-29 22:34:15 +03:00
antanst	59893efc3d	Update log message to reflect crawl date update behavior	2025-06-29 22:34:15 +03:00
antanst	621868f3b3	Update last_crawled timestamp when skipping duplicate content and improve error handling	2025-06-29 22:34:15 +03:00
antanst	967f371777	Improve error handling and add duplicate snapshot cleanup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-06-18 11:56:26 +03:00
antanst	ada6cda4ac	Fix snapshot overwrite logic to preserve successful responses - Prevent overwriting snapshots that have valid response codes - Ensure URL is removed from queue when snapshot update is skipped - Add last_crawled timestamp tracking for better crawl scheduling - Remove SkipIdenticalContent flag, simplify content deduplication logic - Update database schema with last_crawled column and indexes 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-06-18 11:23:56 +03:00
antanst	e9d7fa85ff	Fix infinite recrawl loop with skip-identical-content Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory() infinite loop when SkipIdenticalContent=true. Now tracks actual crawl attempts separately from content changes via database DEFAULT timestamps. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-06-17 10:41:17 +03:00
antanst	9938dc542b	Refine content deduplication and improve configuration	2025-06-16 17:09:26 +03:00
antanst	37d5e7cd78	Enhance crawler with seed list and SQL utilities Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.	2025-06-16 12:29:33 +03:00
antanst	dfb050588c	Update documentation and project configuration - Add architecture documentation for versioned snapshots - Update Makefile with improved build commands - Update dependency versions in go.mod - Add project notes and development guidelines - Improve README with new features and instructions	2025-05-22 13:26:11 +03:00
antanst	ecaa7f338d	Update and refactor core functionality - Update common package utilities - Refactor network code for better error handling - Remove deprecated files and functionality - Enhance blacklist and filtering capabilities - Improve snapshot handling and processing	2025-05-22 12:47:01 +03:00
antanst	6a5284e91a	Modernize host pool management - Add context-aware host pool operations - Implement rate limiting for host connections - Improve concurrency handling with mutexes - Add host connection tracking	2025-05-22 12:46:42 +03:00
antanst	6b22953046	Implement context-aware database operations - Add context support to database operations - Implement versioned snapshots for URL history - Update database queries to support URL timestamps - Improve transaction handling with context - Add utility functions for snapshot history	2025-05-22 12:46:36 +03:00
antanst	0821f78f2d	Add whitelist functionality - Implement whitelist package for filtering URLs - Support pattern matching for allowed URLs - Add URL validation against whitelist patterns - Include test cases for whitelist functionality	2025-05-22 12:46:28 +03:00
antanst	fe40874844	Add robots.txt parsing and matching functionality - Create separate robotsMatch package for robots.txt handling - Implement robots.txt parsing with support for different directives - Add support for both Allow and Disallow patterns - Include robots.txt matching with efficient pattern matching - Add test cases for robots matching	2025-05-22 12:46:21 +03:00
antanst	a7aa5cd410	Add context-aware network operations - Implement context-aware versions of network operations - Add request cancellation support throughout network code - Use structured logging with context metadata - Support timeout management with contexts - Improve error handling with detailed logging 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-22 12:45:58 +03:00
antanst	ef628eeb3c	Improve error handling with xerrors package - Replace custom error handling with xerrors package - Enhance error descriptions for better debugging - Add text utilities for string processing - Update error tests to use standard errors package - Add String() method to GeminiError 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-22 12:45:46 +03:00
antanst	376e1ced64	Implement structured logging with slog - Replace zerolog with Go's standard slog package - Add ColorHandler for terminal color output - Add context-aware logging system - Format attributes on the same line as log messages - Use green color for INFO level logs - Set up context value extraction helpers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-22 12:44:08 +03:00
antanst	94429b2224	Change errors to use xerrors package.	2025-05-12 20:37:58 +03:00
antanst	a6dfc25e25	Fix Makefile.	2025-03-10 16:54:06 +02:00
antanst	a2d5b04d58	Fix linter warnings in gemini/network.go Remove redundant nil checks before len() operations as len() for nil slices is defined as zero in Go. 🤖 Generated with [Claude Code](https://claude.ai/code)	2025-03-10 11:34:29 +02:00