Commit Graph

12 Commits

Author SHA1 Message Date
antanst
389b3615e2 Clean up logging in worker processing
- Move response code logging to happen after successful snapshot save
- Remove error message from log output for cleaner display
- Consolidate logging logic in saveSnapshotAndRemoveURL function
2025-06-29 22:34:15 +03:00
antanst
af42383513 Improve crawler performance and worker coordination
- Add WaitGroup synchronization for workers to prevent overlapping scheduler runs
- Increase history fetch multiplier and sleep intervals for better resource usage
- Simplify error handling and logging in worker processing
- Update SQL query to exclude error snapshots from history selection
- Fix worker ID variable reference in spawning loop
- Streamline snapshot update logic and error reporting
2025-06-29 22:34:15 +03:00
antanst
59893efc3d Update log message to reflect crawl date update behavior 2025-06-29 22:34:15 +03:00
antanst
621868f3b3 Update last_crawled timestamp when skipping duplicate content and improve error handling 2025-06-29 22:34:15 +03:00
antanst
967f371777 Improve error handling and add duplicate snapshot cleanup
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-18 11:56:26 +03:00
antanst
ada6cda4ac Fix snapshot overwrite logic to preserve successful responses
- Prevent overwriting snapshots that have valid response codes
- Ensure URL is removed from queue when snapshot update is skipped
- Add last_crawled timestamp tracking for better crawl scheduling
- Remove SkipIdenticalContent flag, simplify content deduplication logic
- Update database schema with last_crawled column and indexes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-18 11:23:56 +03:00
antanst
e9d7fa85ff Fix infinite recrawl loop with skip-identical-content
Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory()
infinite loop when SkipIdenticalContent=true. Now tracks actual crawl
attempts separately from content changes via database DEFAULT timestamps.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-17 10:41:17 +03:00
antanst
9938dc542b Refine content deduplication and improve configuration 2025-06-16 17:09:26 +03:00
antanst
37d5e7cd78 Enhance crawler with seed list and SQL utilities
Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.
2025-06-16 12:29:33 +03:00
ecaa7f338d Update and refactor core functionality
- Update common package utilities
- Refactor network code for better error handling
- Remove deprecated files and functionality
- Enhance blacklist and filtering capabilities
- Improve snapshot handling and processing
2025-05-22 12:47:01 +03:00
5b84960c5a Use go_errors library everywhere. 2025-02-26 13:31:46 +02:00
ca008b0796 Reorganize code for more granular imports 2025-02-26 10:34:46 +02:00