Commit Graph

42 Commits

Author SHA1 Message Date
antanst
e9d7fa85ff Fix infinite recrawl loop with skip-identical-content
Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory()
infinite loop when SkipIdenticalContent=true. Now tracks actual crawl
attempts separately from content changes via database DEFAULT timestamps.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-17 10:41:17 +03:00
antanst
9938dc542b Refine content deduplication and improve configuration 2025-06-16 17:09:26 +03:00
antanst
37d5e7cd78 Enhance crawler with seed list and SQL utilities
Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.
2025-06-16 12:29:33 +03:00
dfb050588c Update documentation and project configuration
- Add architecture documentation for versioned snapshots
- Update Makefile with improved build commands
- Update dependency versions in go.mod
- Add project notes and development guidelines
- Improve README with new features and instructions
2025-05-22 13:26:11 +03:00
ecaa7f338d Update and refactor core functionality
- Update common package utilities
- Refactor network code for better error handling
- Remove deprecated files and functionality
- Enhance blacklist and filtering capabilities
- Improve snapshot handling and processing
2025-05-22 12:47:01 +03:00
6a5284e91a Modernize host pool management
- Add context-aware host pool operations
- Implement rate limiting for host connections
- Improve concurrency handling with mutexes
- Add host connection tracking
2025-05-22 12:46:42 +03:00
6b22953046 Implement context-aware database operations
- Add context support to database operations
- Implement versioned snapshots for URL history
- Update database queries to support URL timestamps
- Improve transaction handling with context
- Add utility functions for snapshot history
2025-05-22 12:46:36 +03:00
0821f78f2d Add whitelist functionality
- Implement whitelist package for filtering URLs
- Support pattern matching for allowed URLs
- Add URL validation against whitelist patterns
- Include test cases for whitelist functionality
2025-05-22 12:46:28 +03:00
fe40874844 Add robots.txt parsing and matching functionality
- Create separate robotsMatch package for robots.txt handling
- Implement robots.txt parsing with support for different directives
- Add support for both Allow and Disallow patterns
- Include robots.txt matching with efficient pattern matching
- Add test cases for robots matching
2025-05-22 12:46:21 +03:00
a7aa5cd410 Add context-aware network operations
- Implement context-aware versions of network operations
- Add request cancellation support throughout network code
- Use structured logging with context metadata
- Support timeout management with contexts
- Improve error handling with detailed logging

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-22 12:45:58 +03:00
ef628eeb3c Improve error handling with xerrors package
- Replace custom error handling with xerrors package
- Enhance error descriptions for better debugging
- Add text utilities for string processing
- Update error tests to use standard errors package
- Add String() method to GeminiError

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-22 12:45:46 +03:00
376e1ced64 Implement structured logging with slog
- Replace zerolog with Go's standard slog package
- Add ColorHandler for terminal color output
- Add context-aware logging system
- Format attributes on the same line as log messages
- Use green color for INFO level logs
- Set up context value extraction helpers

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-22 12:44:08 +03:00
94429b2224 Change errors to use xerrors package. 2025-05-12 20:37:58 +03:00
a6dfc25e25 Fix Makefile. 2025-03-10 16:54:06 +02:00
a2d5b04d58 Fix linter warnings in gemini/network.go
Remove redundant nil checks before len() operations as len() for nil slices is defined as zero in Go.

🤖 Generated with [Claude Code](https://claude.ai/code)
2025-03-10 11:34:29 +02:00
701a5df44f Improvements in error handling & descriptions 2025-02-27 09:20:22 +02:00
5b84960c5a Use go_errors library everywhere. 2025-02-26 13:31:46 +02:00
be38104f05 Update license and readme. 2025-02-26 10:39:51 +02:00
d70d6c35a3 update gitignore 2025-02-26 10:37:20 +02:00
8399225046 Improve main error handling 2025-02-26 10:37:09 +02:00
e8e26ec76a Use Go race detector 2025-02-26 10:36:51 +02:00
f6ac5003b0 Tidy go mod 2025-02-26 10:36:41 +02:00
e626aabecb Add gemget script that downloads Gemini pages 2025-02-26 10:35:54 +02:00
ebf59c50b8 Add Gopherspace crawling! 2025-02-26 10:35:28 +02:00
2a041fec7c Simplify host pool 2025-02-26 10:35:11 +02:00
ca008b0796 Reorganize code for more granular imports 2025-02-26 10:34:46 +02:00
8350e106d6 Reorganize errors 2025-02-26 10:32:38 +02:00
9c7502b2a8 Improve blacklist to use regex matching 2025-02-26 10:32:01 +02:00
dda21e833c Add regex matching function to util 2025-01-16 22:37:39 +02:00
b0e7052c10 Add tidy & update Makefile targets 2025-01-16 22:37:39 +02:00
43b207c9ab Simplify duplicate code 2025-01-16 22:37:39 +02:00
285f2955e7 Proper package in tests 2025-01-16 10:04:02 +02:00
998b0e74ec Add DB scan error 2025-01-16 10:04:02 +02:00
766ee26f68 Simplify IP pool and convert it to host pool 2025-01-16 10:04:02 +02:00
5357ceb04d Break up Gemtext link parsing code and improve tests. 2025-01-16 10:04:02 +02:00
03e1849191 Add mode that prints multiple worker status in console 2025-01-16 10:04:02 +02:00
ccb8f6838e Update DB init instructions & README 2025-01-04 15:39:21 +02:00
4e6fad873b Break up common functions and small refactor. 2025-01-04 15:31:26 +02:00
b78fe00221 Add license. 2024-12-27 12:13:05 +02:00
90f6ecd024 Add README.md and Makefile. 2024-12-27 12:11:35 +02:00
b52df073e9 Add first version of gemini-grc. 2024-12-27 12:09:55 +02:00
93822b239e Initial commit. 2024-12-26 21:34:54 +02:00