Compare commits

..

45 Commits

Author SHA1 Message Date
antanst
0db2557cfc Update README to reflect command-line flag configuration
- Replace environment variables with command-line flags
- Update run example with proper flag syntax
- Fix database schema path to misc/sql/initdb.sql
- Add missing configuration options (gopher, seed-url-path, max-db-connections)
- Remove outdated configuration options
2025-06-29 22:28:28 +03:00
antanst
db3448f448 Improve crawler performance and logging
- Optimize job scheduler to use NumOfWorkers for URL limits
- Clean up verbose logging in worker processing
- Update log messages for better clarity
2025-06-29 22:28:05 +03:00
antanst
9a09dd7735 Update log message for clarity
- Change "old content" to "old snapshot" for more accurate terminology

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-19 10:19:29 +03:00
antanst
0f62b0c622 Clean up logging in worker processing
- Move response code logging to happen after successful snapshot save
- Remove error message from log output for cleaner display
- Consolidate logging logic in saveSnapshotAndRemoveURL function

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-19 10:04:46 +03:00
antanst
3bdff0e22e Improve crawler performance and worker coordination
- Add WaitGroup synchronization for workers to prevent overlapping scheduler runs
- Increase history fetch multiplier and sleep intervals for better resource usage
- Simplify error handling and logging in worker processing
- Update SQL query to exclude error snapshots from history selection
- Fix worker ID variable reference in spawning loop
- Streamline snapshot update logic and error reporting

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-19 09:59:50 +03:00
antanst
a74f29d7b0 Update log message to reflect crawl date update behavior
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-18 12:03:37 +03:00
antanst
ffeef334e7 Update last_crawled timestamp when skipping duplicate content and improve error handling
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-18 12:02:55 +03:00
antanst
8bbe6efabc Improve error handling and add duplicate snapshot cleanup
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-18 11:56:26 +03:00
antanst
aa2658e61e Fix snapshot overwrite logic to preserve successful responses
- Prevent overwriting snapshots that have valid response codes
- Ensure URL is removed from queue when snapshot update is skipped
- Add last_crawled timestamp tracking for better crawl scheduling
- Remove SkipIdenticalContent flag, simplify content deduplication logic
- Update database schema with last_crawled column and indexes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-18 11:23:56 +03:00
antanst
4e225ee866 Fix infinite recrawl loop with skip-identical-content
Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory()
infinite loop when SkipIdenticalContent=true. Now tracks actual crawl
attempts separately from content changes via database DEFAULT timestamps.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-17 10:41:17 +03:00
antanst
f9024d15aa Refine content deduplication and improve configuration 2025-06-16 17:09:26 +03:00
antanst
330b596497 Enhance crawler with seed list and SQL utilities
Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.
2025-06-16 12:29:33 +03:00
51f94c90b2 Update documentation and project configuration
- Add architecture documentation for versioned snapshots
- Update Makefile with improved build commands
- Update dependency versions in go.mod
- Add project notes and development guidelines
- Improve README with new features and instructions
2025-05-22 13:26:11 +03:00
bfaa857fae Update and refactor core functionality
- Update common package utilities
- Refactor network code for better error handling
- Remove deprecated files and functionality
- Enhance blacklist and filtering capabilities
- Improve snapshot handling and processing
2025-05-22 12:47:01 +03:00
5cc82f2c75 Modernize host pool management
- Add context-aware host pool operations
- Implement rate limiting for host connections
- Improve concurrency handling with mutexes
- Add host connection tracking
2025-05-22 12:46:42 +03:00
eca54b2f68 Implement context-aware database operations
- Add context support to database operations
- Implement versioned snapshots for URL history
- Update database queries to support URL timestamps
- Improve transaction handling with context
- Add utility functions for snapshot history
2025-05-22 12:46:36 +03:00
7d27e5a123 Add whitelist functionality
- Implement whitelist package for filtering URLs
- Support pattern matching for allowed URLs
- Add URL validation against whitelist patterns
- Include test cases for whitelist functionality
2025-05-22 12:46:28 +03:00
8a9ca0b2e7 Add robots.txt parsing and matching functionality
- Create separate robotsMatch package for robots.txt handling
- Implement robots.txt parsing with support for different directives
- Add support for both Allow and Disallow patterns
- Include robots.txt matching with efficient pattern matching
- Add test cases for robots matching
2025-05-22 12:46:21 +03:00
5940a117fd Add context-aware network operations
- Implement context-aware versions of network operations
- Add request cancellation support throughout network code
- Use structured logging with context metadata
- Support timeout management with contexts
- Improve error handling with detailed logging

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-22 12:45:58 +03:00
d1c326f868 Improve error handling with xerrors package
- Replace custom error handling with xerrors package
- Enhance error descriptions for better debugging
- Add text utilities for string processing
- Update error tests to use standard errors package
- Add String() method to GeminiError

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-22 12:45:46 +03:00
a55f820f62 Implement structured logging with slog
- Replace zerolog with Go's standard slog package
- Add ColorHandler for terminal color output
- Add context-aware logging system
- Format attributes on the same line as log messages
- Use green color for INFO level logs
- Set up context value extraction helpers

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-22 12:44:08 +03:00
ad224a328e Change errors to use xerrors package. 2025-05-12 20:37:58 +03:00
a823f5abc3 Fix Makefile. 2025-03-10 16:54:06 +02:00
658c5f5471 Fix linter warnings in gemini/network.go
Remove redundant nil checks before len() operations as len() for nil slices is defined as zero in Go.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-03-10 11:34:29 +02:00
efaedcc6b2 Improvements in error handling & descriptions 2025-02-27 09:20:22 +02:00
9dc008cb0f Use go_errors library everywhere. 2025-02-26 13:31:46 +02:00
c82b436d32 Update license and readme. 2025-02-26 10:39:51 +02:00
4f47521401 update gitignore 2025-02-26 10:37:20 +02:00
96a39ec3b6 Improve main error handling 2025-02-26 10:37:09 +02:00
54474d45cd Use Go race detector 2025-02-26 10:36:51 +02:00
d306c44f3d Tidy go mod 2025-02-26 10:36:41 +02:00
79e3175467 Add gemget script that downloads Gemini pages 2025-02-26 10:35:54 +02:00
d89dd72fe9 Add Gopherspace crawling! 2025-02-26 10:35:28 +02:00
29877cb2da Simplify host pool 2025-02-26 10:35:11 +02:00
4bceb75695 Reorganize code for more granular imports 2025-02-26 10:34:46 +02:00
a9983f3531 Reorganize errors 2025-02-26 10:32:38 +02:00
5cf720103f Improve blacklist to use regex matching 2025-02-26 10:32:01 +02:00
b6dd77e57e Add regex matching function to util 2025-01-16 22:37:39 +02:00
973a4f3a2d Add tidy & update Makefile targets 2025-01-16 22:37:39 +02:00
b30b7274ec Simplify duplicate code 2025-01-16 22:37:39 +02:00
63adf73ef9 Proper package in tests 2025-01-16 10:04:02 +02:00
b3387ce7ad Add DB scan error 2025-01-16 10:04:02 +02:00
9ade26b6e8 Simplify IP pool and convert it to host pool 2025-01-16 10:04:02 +02:00
4a345a1763 Break up Gemtext link parsing code and improve tests. 2025-01-16 10:04:02 +02:00
64f98bb37c Add mode that prints multiple worker status in console 2025-01-16 10:04:02 +02:00
3 changed files with 53 additions and 6 deletions

13
NOTES.md Normal file
View File

@@ -0,0 +1,13 @@
# Notes
Avoiding endless loops while crawling
- Make sure we follow robots.txt
- Announce our own agent so people can block us in their robots.txt
- Put a limit on number of pages per host, and notify on limit reach.
- Put a limit on the number of redirects (not needed?)
Heuristics:
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.

View File

@@ -22,12 +22,8 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
## How to run ## How to run
```shell Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
make build All configuration is done via command-line flags.
./dist/crawler --help
```
Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
## Configuration ## Configuration

38
TODO.md Normal file
View File

@@ -0,0 +1,38 @@
# TODO
## Outstanding Issues
### 1. Ctrl+C Signal Handling Issue
**Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
**Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
- Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
- Job queueing when channel is full (`jobs <- url` can block if workers are slow)
- Long-running database transactions
**Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
**Solution**: Add signal/context checking to blocking operations:
- Use cancellable context instead of `context.Background()` for database operations
- Make job queueing non-blocking or context-aware
- Add timeout/cancellation to database operations
### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
**Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
**Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
- Identical content → no new snapshot created
- Query finds old snapshot timestamp → re-queues URL
- Creates infinite loop of re-crawling unchanged content
**Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
**Solution Options**:
1. Add `last_crawled` timestamp to URLs table
2. Create separate `crawl_attempts` table
3. Always create snapshot entries (even for duplicates) but mark them as such
4. Modify logic to work with existing schema constraints
**Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.