Update README.md

Remove old file
2025-06-29 23:10:14 +03:00 · 2025-06-29 23:04:03 +03:00 · 2025-06-29 23:03:44 +03:00 · 2025-06-29 22:38:38 +03:00 · 2025-06-29 22:38:38 +03:00 · 2025-06-29 22:38:38 +03:00
3 changed files with 6 additions and 53 deletions
--- a/NOTES.md
+++ b/NOTES.md
@@ -1,13 +0,0 @@
 # Notes
 Avoiding endless loops while crawling
 - Make sure we follow robots.txt
 - Announce our own agent so people can block us in their robots.txt
 - Put a limit on number of pages per host, and notify on limit reach.
 - Put a limit on the number of redirects (not needed?)
 Heuristics:
 - Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
 - Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
--- a/README.md
+++ b/README.md
@@ -22,8 +22,12 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
 ## How to run
-Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
+```shell
-All configuration is done via command-line flags.
+make build
 ./dist/crawler --help
 ```
 Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
 ## Configuration
--- a/TODO.md
+++ b/TODO.md
@@ -1,38 +0,0 @@
 # TODO
 ## Outstanding Issues
 ### 1. Ctrl+C Signal Handling Issue
 **Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
 **Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
 - Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
 - Job queueing when channel is full (`jobs <- url` can block if workers are slow)
 - Long-running database transactions
 **Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
 **Solution**: Add signal/context checking to blocking operations:
 - Use cancellable context instead of `context.Background()` for database operations
 - Make job queueing non-blocking or context-aware
 - Add timeout/cancellation to database operations
 ### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
 **Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
 **Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
 - Identical content → no new snapshot created
 - Query finds old snapshot timestamp → re-queues URL
 - Creates infinite loop of re-crawling unchanged content
 **Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
 **Solution Options**:
 1. Add `last_crawled` timestamp to URLs table
 2. Create separate `crawl_attempts` table  
 3. Always create snapshot entries (even for duplicates) but mark them as such
 4. Modify logic to work with existing schema constraints
 **Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
Author	SHA1	Message	Date
antanst	f362a1d2da	Update README.md	2025-06-29 23:10:14 +03:00
antanst	7b3ad38f03	Remove old file	2025-06-29 23:04:03 +03:00
antanst	8e30a6a365	Remove old file	2025-06-29 23:03:44 +03:00
antanst	26311a6d2b	Update README to reflect command-line flag configuration - Replace environment variables with command-line flags - Update run example with proper flag syntax - Fix database schema path to misc/sql/initdb.sql - Add missing configuration options (gopher, seed-url-path, max-db-connections) - Remove outdated configuration options	2025-06-29 22:38:38 +03:00
antanst	57eb2555c5	Improve crawler performance and logging - Optimize job scheduler to use NumOfWorkers for URL limits - Clean up verbose logging in worker processing - Update log messages for better clarity	2025-06-29 22:38:38 +03:00
antanst	453cf2294a	Update log message for clarity - Change "old content" to "old snapshot" for more accurate terminology	2025-06-29 22:38:38 +03:00
antanst	ee2076f337	Clean up logging in worker processing - Move response code logging to happen after successful snapshot save - Remove error message from log output for cleaner display - Consolidate logging logic in saveSnapshotAndRemoveURL function	2025-06-29 22:38:38 +03:00
antanst	acbac15c20	Improve crawler performance and worker coordination - Add WaitGroup synchronization for workers to prevent overlapping scheduler runs - Increase history fetch multiplier and sleep intervals for better resource usage - Simplify error handling and logging in worker processing - Update SQL query to exclude error snapshots from history selection - Fix worker ID variable reference in spawning loop - Streamline snapshot update logic and error reporting	2025-06-29 22:38:38 +03:00
antanst	ddbe6b461b	Update log message to reflect crawl date update behavior	2025-06-29 22:38:38 +03:00
antanst	55bb0d96d0	Update last_crawled timestamp when skipping duplicate content and improve error handling	2025-06-29 22:38:38 +03:00
antanst	349968d019	Improve error handling and add duplicate snapshot cleanup	2025-06-29 22:38:38 +03:00
antanst	2357135d5a	Fix snapshot overwrite logic to preserve successful responses - Prevent overwriting snapshots that have valid response codes - Ensure URL is removed from queue when snapshot update is skipped - Add last_crawled timestamp tracking for better crawl scheduling - Remove SkipIdenticalContent flag, simplify content deduplication logic - Update database schema with last_crawled column and indexes	2025-06-29 22:38:38 +03:00
antanst	98d3ed6707	Fix infinite recrawl loop with skip-identical-content Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory() infinite loop when SkipIdenticalContent=true. Now tracks actual crawl attempts separately from content changes via database DEFAULT timestamps.	2025-06-29 22:38:38 +03:00
antanst	8b498a2603	Refine content deduplication and improve configuration	2025-06-29 22:38:38 +03:00
antanst	8588414b14	Enhance crawler with seed list and SQL utilities Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.	2025-06-29 22:38:38 +03:00
antanst	5e6dabf1e7	Update documentation and project configuration - Add architecture documentation for versioned snapshots - Update Makefile with improved build commands - Update dependency versions in go.mod - Add project notes and development guidelines - Improve README with new features and instructions	2025-06-29 22:38:38 +03:00
antanst	a8173544e7	Update and refactor core functionality - Update common package utilities - Refactor network code for better error handling - Remove deprecated files and functionality - Enhance blacklist and filtering capabilities - Improve snapshot handling and processing	2025-06-29 22:38:38 +03:00
antanst	3d07b56e8c	Modernize host pool management - Add context-aware host pool operations - Implement rate limiting for host connections - Improve concurrency handling with mutexes - Add host connection tracking	2025-06-29 22:38:38 +03:00
antanst	c54c093a10	Implement context-aware database operations - Add context support to database operations - Implement versioned snapshots for URL history - Update database queries to support URL timestamps - Improve transaction handling with context - Add utility functions for snapshot history	2025-06-29 22:38:38 +03:00
antanst	57f5c0e865	Add whitelist functionality - Implement whitelist package for filtering URLs - Support pattern matching for allowed URLs - Add URL validation against whitelist patterns - Include test cases for whitelist functionality	2025-06-29 22:38:38 +03:00
antanst	dc6eb610a2	Add robots.txt parsing and matching functionality - Create separate robotsMatch package for robots.txt handling - Implement robots.txt parsing with support for different directives - Add support for both Allow and Disallow patterns - Include robots.txt matching with efficient pattern matching - Add test cases for robots matching	2025-06-29 22:38:38 +03:00
antanst	39e9ead982	Add context-aware network operations - Implement context-aware versions of network operations - Add request cancellation support throughout network code - Use structured logging with context metadata - Support timeout management with contexts - Improve error handling with detailed logging	2025-06-29 22:38:38 +03:00
antanst	5f4da4f806	Improve error handling with xerrors package - Replace custom error handling with xerrors package - Enhance error descriptions for better debugging - Add text utilities for string processing - Update error tests to use standard errors package - Add String() method to GeminiError	2025-06-29 22:38:38 +03:00
antanst	4ef3f70f1f	Implement structured logging with slog - Replace zerolog with Go's standard slog package - Add ColorHandler for terminal color output - Add context-aware logging system - Format attributes on the same line as log messages - Use green color for INFO level logs - Set up context value extraction helpers	2025-06-29 22:38:38 +03:00
antanst	b8ea6fab4a	Change errors to use xerrors package.	2025-06-29 22:38:38 +03:00
antanst	5fe1490f1e	Fix Makefile.	2025-06-29 22:38:38 +03:00
antanst	a41490f834	Fix linter warnings in gemini/network.go Remove redundant nil checks before len() operations as len() for nil slices is defined as zero in Go.	2025-06-29 22:38:38 +03:00
antanst	701a5df44f	Improvements in error handling & descriptions	2025-02-27 09:20:22 +02:00
antanst	5b84960c5a	Use go_errors library everywhere.	2025-02-26 13:31:46 +02:00
antanst	be38104f05	Update license and readme.	2025-02-26 10:39:51 +02:00
antanst	d70d6c35a3	update gitignore	2025-02-26 10:37:20 +02:00
antanst	8399225046	Improve main error handling	2025-02-26 10:37:09 +02:00
antanst	e8e26ec76a	Use Go race detector	2025-02-26 10:36:51 +02:00
antanst	f6ac5003b0	Tidy go mod	2025-02-26 10:36:41 +02:00
antanst	e626aabecb	Add gemget script that downloads Gemini pages	2025-02-26 10:35:54 +02:00
antanst	ebf59c50b8	Add Gopherspace crawling!	2025-02-26 10:35:28 +02:00
antanst	2a041fec7c	Simplify host pool	2025-02-26 10:35:11 +02:00
antanst	ca008b0796	Reorganize code for more granular imports	2025-02-26 10:34:46 +02:00
antanst	8350e106d6	Reorganize errors	2025-02-26 10:32:38 +02:00
antanst	9c7502b2a8	Improve blacklist to use regex matching	2025-02-26 10:32:01 +02:00
antanst	dda21e833c	Add regex matching function to util	2025-01-16 22:37:39 +02:00
antanst	b0e7052c10	Add tidy & update Makefile targets	2025-01-16 22:37:39 +02:00
antanst	43b207c9ab	Simplify duplicate code	2025-01-16 22:37:39 +02:00
antanst	285f2955e7	Proper package in tests	2025-01-16 10:04:02 +02:00
antanst	998b0e74ec	Add DB scan error	2025-01-16 10:04:02 +02:00
antanst	766ee26f68	Simplify IP pool and convert it to host pool	2025-01-16 10:04:02 +02:00
antanst	5357ceb04d	Break up Gemtext link parsing code and improve tests.	2025-01-16 10:04:02 +02:00
antanst	03e1849191	Add mode that prints multiple worker status in console	2025-01-16 10:04:02 +02:00