Compare commits

..

24 Commits

Author SHA1 Message Date
antanst
26311a6d2b Update README to reflect command-line flag configuration
- Replace environment variables with command-line flags
- Update run example with proper flag syntax
- Fix database schema path to misc/sql/initdb.sql
- Add missing configuration options (gopher, seed-url-path, max-db-connections)
- Remove outdated configuration options
2025-06-29 22:38:38 +03:00
antanst
57eb2555c5 Improve crawler performance and logging
- Optimize job scheduler to use NumOfWorkers for URL limits
- Clean up verbose logging in worker processing
- Update log messages for better clarity
2025-06-29 22:38:38 +03:00
antanst
453cf2294a Update log message for clarity
- Change "old content" to "old snapshot" for more accurate terminology
2025-06-29 22:38:38 +03:00
antanst
ee2076f337 Clean up logging in worker processing
- Move response code logging to happen after successful snapshot save
- Remove error message from log output for cleaner display
- Consolidate logging logic in saveSnapshotAndRemoveURL function
2025-06-29 22:38:38 +03:00
antanst
acbac15c20 Improve crawler performance and worker coordination
- Add WaitGroup synchronization for workers to prevent overlapping scheduler runs
- Increase history fetch multiplier and sleep intervals for better resource usage
- Simplify error handling and logging in worker processing
- Update SQL query to exclude error snapshots from history selection
- Fix worker ID variable reference in spawning loop
- Streamline snapshot update logic and error reporting
2025-06-29 22:38:38 +03:00
antanst
ddbe6b461b Update log message to reflect crawl date update behavior 2025-06-29 22:38:38 +03:00
antanst
55bb0d96d0 Update last_crawled timestamp when skipping duplicate content and improve error handling 2025-06-29 22:38:38 +03:00
antanst
349968d019 Improve error handling and add duplicate snapshot cleanup 2025-06-29 22:38:38 +03:00
antanst
2357135d5a Fix snapshot overwrite logic to preserve successful responses
- Prevent overwriting snapshots that have valid response codes
- Ensure URL is removed from queue when snapshot update is skipped
- Add last_crawled timestamp tracking for better crawl scheduling
- Remove SkipIdenticalContent flag, simplify content deduplication logic
- Update database schema with last_crawled column and indexes
2025-06-29 22:38:38 +03:00
antanst
98d3ed6707 Fix infinite recrawl loop with skip-identical-content
Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory()
infinite loop when SkipIdenticalContent=true. Now tracks actual crawl
attempts separately from content changes via database DEFAULT timestamps.
2025-06-29 22:38:38 +03:00
antanst
8b498a2603 Refine content deduplication and improve configuration 2025-06-29 22:38:38 +03:00
antanst
8588414b14 Enhance crawler with seed list and SQL utilities
Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.
2025-06-29 22:38:38 +03:00
5e6dabf1e7 Update documentation and project configuration
- Add architecture documentation for versioned snapshots
- Update Makefile with improved build commands
- Update dependency versions in go.mod
- Add project notes and development guidelines
- Improve README with new features and instructions
2025-06-29 22:38:38 +03:00
a8173544e7 Update and refactor core functionality
- Update common package utilities
- Refactor network code for better error handling
- Remove deprecated files and functionality
- Enhance blacklist and filtering capabilities
- Improve snapshot handling and processing
2025-06-29 22:38:38 +03:00
3d07b56e8c Modernize host pool management
- Add context-aware host pool operations
- Implement rate limiting for host connections
- Improve concurrency handling with mutexes
- Add host connection tracking
2025-06-29 22:38:38 +03:00
c54c093a10 Implement context-aware database operations
- Add context support to database operations
- Implement versioned snapshots for URL history
- Update database queries to support URL timestamps
- Improve transaction handling with context
- Add utility functions for snapshot history
2025-06-29 22:38:38 +03:00
57f5c0e865 Add whitelist functionality
- Implement whitelist package for filtering URLs
- Support pattern matching for allowed URLs
- Add URL validation against whitelist patterns
- Include test cases for whitelist functionality
2025-06-29 22:38:38 +03:00
dc6eb610a2 Add robots.txt parsing and matching functionality
- Create separate robotsMatch package for robots.txt handling
- Implement robots.txt parsing with support for different directives
- Add support for both Allow and Disallow patterns
- Include robots.txt matching with efficient pattern matching
- Add test cases for robots matching
2025-06-29 22:38:38 +03:00
39e9ead982 Add context-aware network operations
- Implement context-aware versions of network operations
- Add request cancellation support throughout network code
- Use structured logging with context metadata
- Support timeout management with contexts
- Improve error handling with detailed logging
2025-06-29 22:38:38 +03:00
5f4da4f806 Improve error handling with xerrors package
- Replace custom error handling with xerrors package
- Enhance error descriptions for better debugging
- Add text utilities for string processing
- Update error tests to use standard errors package
- Add String() method to GeminiError
2025-06-29 22:38:38 +03:00
4ef3f70f1f Implement structured logging with slog
- Replace zerolog with Go's standard slog package
- Add ColorHandler for terminal color output
- Add context-aware logging system
- Format attributes on the same line as log messages
- Use green color for INFO level logs
- Set up context value extraction helpers
2025-06-29 22:38:38 +03:00
b8ea6fab4a Change errors to use xerrors package. 2025-06-29 22:38:38 +03:00
5fe1490f1e Fix Makefile. 2025-06-29 22:38:38 +03:00
a41490f834 Fix linter warnings in gemini/network.go
Remove redundant nil checks before len() operations as len() for nil slices is defined as zero in Go.
2025-06-29 22:38:38 +03:00

Diff Content Not Available