Commit Graph

7 Commits

Author SHA1 Message Date
antanst
2357135d5a Fix snapshot overwrite logic to preserve successful responses
- Prevent overwriting snapshots that have valid response codes
- Ensure URL is removed from queue when snapshot update is skipped
- Add last_crawled timestamp tracking for better crawl scheduling
- Remove SkipIdenticalContent flag, simplify content deduplication logic
- Update database schema with last_crawled column and indexes
2025-06-29 22:38:38 +03:00
antanst
98d3ed6707 Fix infinite recrawl loop with skip-identical-content
Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory()
infinite loop when SkipIdenticalContent=true. Now tracks actual crawl
attempts separately from content changes via database DEFAULT timestamps.
2025-06-29 22:38:38 +03:00
antanst
8b498a2603 Refine content deduplication and improve configuration 2025-06-29 22:38:38 +03:00
antanst
8588414b14 Enhance crawler with seed list and SQL utilities
Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.
2025-06-29 22:38:38 +03:00
a8173544e7 Update and refactor core functionality
- Update common package utilities
- Refactor network code for better error handling
- Remove deprecated files and functionality
- Enhance blacklist and filtering capabilities
- Improve snapshot handling and processing
2025-06-29 22:38:38 +03:00
5b84960c5a Use go_errors library everywhere. 2025-02-26 13:31:46 +02:00
ca008b0796 Reorganize code for more granular imports 2025-02-26 10:34:46 +02:00