Improve error handling and add duplicate snapshot cleanup

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix snapshot overwrite logic to preserve successful responses
2025-06-18 11:56:26 +03:00 · 2025-06-18 11:23:56 +03:00 · 2025-06-17 10:41:17 +03:00 · 2025-06-16 17:09:26 +03:00 · 2025-06-16 12:29:33 +03:00 · 2025-05-22 13:26:11 +03:00
9 changed files with 145 additions and 121 deletions
--- a/NOTES.md
+++ b/NOTES.md
@@ -0,0 +1,13 @@
 # Notes
 Avoiding endless loops while crawling
 - Make sure we follow robots.txt
 - Announce our own agent so people can block us in their robots.txt
 - Put a limit on number of pages per host, and notify on limit reach.
 - Put a limit on the number of redirects (not needed?)
 Heuristics:
 - Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
 - Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
--- a/README.md
+++ b/README.md
@@ -4,13 +4,13 @@ A crawler for the [Gemini](https://en.wikipedia.org/wiki/Gemini_(protocol)) netw
 Easily extendable as a "wayback machine" of Gemini.
 ## Features
 - [x] Concurrent downloading with configurable number of workers
 - [x] Save image/* and text/* files
 - [x] Concurrent downloading with configurable number of workers
 - [x] Connection limit per host
 - [x] URL Blacklist
 - [x] URL Whitelist (overrides blacklist and robots.txt)
 - [x] Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
- [x] Configuration via command-line flags
+- [x] Configuration via environment variables
 - [x] Storing capsule snapshots in PostgreSQL
 - [x] Proper response header & body UTF-8 and format validation
 - [x] Proper URL normalization
@@ -22,59 +22,46 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
 ## How to run
-```shell
+Spin up a PostgreSQL, check `db/sql/initdb.sql` to create the tables and start the crawler.
-make build
+All configuration is done via environment variables.
 ./dist/crawler --help
 ```
 Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
 ## Configuration
-Available command-line flags:
+Bool can be `true`,`false` or `0`,`1`.
 ```text
-  -blacklist-path string
+	LogLevel               string // Logging level (debug, info, warn, error)
-        File that has blacklist regexes
+	MaxResponseSize        int // Maximum size of response in bytes
-  -dry-run
+	NumOfWorkers           int // Number of concurrent workers
-        Dry run mode
+	ResponseTimeout        int // Timeout for responses in seconds
-  -gopher
+	PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
-        Enable crawling of Gopher holes
+	BlacklistPath          string // File that has blacklisted strings of "host:port"
-  -log-level string
+	WhitelistPath          string // File with URLs that should always be crawled regardless of blacklist or robots.txt
-        Logging level (debug, info, warn, error) (default "info")
+	DryRun                 bool // If false, don't write to disk
-  -max-db-connections int
+	SkipIdenticalContent   bool // When true, skip storing snapshots with identical content
-        Maximum number of database connections (default 100)
+	SkipIfUpdatedDays      int  // Skip re-crawling URLs updated within this many days (0 to disable)
  -max-response-size int
        Maximum size of response in bytes (default 1048576)
  -pgurl string
        Postgres URL
  -response-timeout int
        Timeout for network responses in seconds (default 10)
  -seed-url-path string
        File with seed URLs that should be added to the queue immediately
  -skip-if-updated-days int
        Skip re-crawling URLs updated within this many days (0 to disable) (default 60)
  -whitelist-path string
        File with URLs that should always be crawled regardless of blacklist
  -workers int
        Number of concurrent workers (default 1)
 ```
 Example:
 ```shell
-./dist/crawler \
+LOG_LEVEL=info \
-  -pgurl="postgres://test:test@127.0.0.1:5434/test?sslmode=disable" \
+NUM_OF_WORKERS=10 \
-  -log-level=info \
+BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
-  -workers=10 \
+WHITELIST_PATH="./whitelist.txt" \ # URLs that override blacklist and robots.txt
-  -blacklist-path="./blacklist.txt" \
+MAX_RESPONSE_SIZE=10485760 \
-  -whitelist-path="./whitelist.txt" \
+RESPONSE_TIMEOUT=10 \
-  -max-response-size=10485760 \
+PANIC_ON_UNEXPECTED_ERROR=true \
-  -response-timeout=10 \
+PG_DATABASE=test \
-  -max-db-connections=100 \
+PG_HOST=127.0.0.1 \
-  -skip-if-updated-days=7 \
+PG_MAX_OPEN_CONNECTIONS=100 \
-  -gopher \
+PG_PORT=5434 \
-  -seed-url-path="./seed_urls.txt"
+PG_USER=test \
 PG_PASSWORD=test \
 DRY_RUN=false \
 SKIP_IDENTICAL_CONTENT=false \
 SKIP_IF_UPDATED_DAYS=7 \
 ./gemini-grc
 ```
 ## Development
@@ -120,9 +107,6 @@ You can access the snapshot history using the included `snapshot_history.sh` scr
 Good starting points:
 gemini://warmedal.se/~antenna/
 gemini://tlgs.one/
 gopher://i-logout.cz:70/1/bongusta/
 gopher://gopher.quux.org:70/
--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,38 @@
 # TODO
 ## Outstanding Issues
 ### 1. Ctrl+C Signal Handling Issue
 **Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
 **Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
 - Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
 - Job queueing when channel is full (`jobs <- url` can block if workers are slow)
 - Long-running database transactions
 **Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
 **Solution**: Add signal/context checking to blocking operations:
 - Use cancellable context instead of `context.Background()` for database operations
 - Make job queueing non-blocking or context-aware
 - Add timeout/cancellation to database operations
 ### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
 **Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
 **Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
 - Identical content → no new snapshot created
 - Query finds old snapshot timestamp → re-queues URL
 - Creates infinite loop of re-crawling unchanged content
 **Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
 **Solution Options**:
 1. Add `last_crawled` timestamp to URLs table
 2. Create separate `crawl_attempts` table  
 3. Always create snapshot entries (even for duplicates) but mark them as such
 4. Modify logic to work with existing schema constraints
 **Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
--- a/cmd/crawler/crawler.go
+++ b/cmd/crawler/crawler.go
@@ -148,7 +148,7 @@ func spawnWorkers(total int) {
 		go func(a int) {
 			for {
 				job := <-jobs
-				common.RunWorkerWithTx(a, job)
+				common.RunWorkerWithTx(id, job)
 			}
 		}(id)
 	}
@@ -215,7 +215,7 @@ func runJobScheduler() {
 			common.FatalErrorsChan <- err
 			return
 		}
-		// Commit this tx here so the loop below sees the changes.
+		// Commit this tx here so the loop sees the changes.
 		err := tx.Commit()
 		if err != nil {
 			common.FatalErrorsChan <- err
@@ -251,14 +251,14 @@ func runJobScheduler() {
 		// When out of pending URLs, add some random ones.
 		if len(distinctHosts) == 0 {
 			// Queue random old URLs from history.
-			count, err := fetchSnapshotsFromHistory(dbCtx, tx, config.CONFIG.NumOfWorkers, config.CONFIG.SkipIfUpdatedDays)
+			count, err := fetchSnapshotsFromHistory(dbCtx, tx, config.CONFIG.NumOfWorkers*3, config.CONFIG.SkipIfUpdatedDays)
 			if err != nil {
 				common.FatalErrorsChan <- err
 				return
 			}
 			if count == 0 {
-				contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "No work, waiting to poll DB...")
+				contextlog.LogDebugWithContext(ctx, logging.GetSlogger(), "No work, waiting to poll DB...")
-				time.Sleep(120 * time.Second)
+				time.Sleep(30 * time.Second)
 				continue
 			}
 			distinctHosts, err = gemdb.Database.GetUrlHosts(dbCtx, tx)
@@ -269,7 +269,7 @@ func runJobScheduler() {
 		}
 		// Get some URLs from each host, up to a limit
-		urls, err := gemdb.Database.GetRandomUrlsFromHosts(dbCtx, distinctHosts, config.CONFIG.NumOfWorkers, tx)
+		urls, err := gemdb.Database.GetRandomUrlsFromHosts(dbCtx, distinctHosts, 10, tx)
 		if err != nil {
 			common.FatalErrorsChan <- err
 			return
@@ -282,39 +282,28 @@ func runJobScheduler() {
 		}
 		if len(urls) == 0 {
-			contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "No work, waiting to poll DB...")
+			contextlog.LogDebugWithContext(ctx, logging.GetSlogger(), "No work, waiting to poll DB...")
-			time.Sleep(120 * time.Second)
+			time.Sleep(30 * time.Second)
 			continue
 		}
-		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "%d urls to crawl", len(urls))
+		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "Queueing %d distinct hosts -> %d urls to crawl", len(distinctHosts), len(urls))
 		// Add jobs to WaitGroup before queuing
 		common.WorkerWG.Add(len(urls))
 		for _, url := range urls {
 			jobs <- url
 		}
 		// Wait for all workers to complete their jobs
 		common.WorkerWG.Wait()
 		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "All workers done. New scheduler run starts")
 		logging.LogInfo("")
 		logging.LogInfo("")
 	}
 }
 func enqueueSeedURLs(ctx context.Context, tx *sqlx.Tx) error {
 	// Get seed URLs from seedList module
-	//urls := seedList.GetSeedURLs()
+	urls := seedList.GetSeedURLs()
-	//
+
-	//for _, url := range urls {
+	for _, url := range urls {
-	//	err := gemdb.Database.InsertURL(ctx, tx, url)
+		err := gemdb.Database.InsertURL(ctx, tx, url)
-	//	if err != nil {
+		if err != nil {
-	//		return err
+			return err
-	//	}
+		}
-	//}
+	}
 	return nil
 }
@@ -343,6 +332,7 @@ func fetchSnapshotsFromHistory(ctx context.Context, tx *sqlx.Tx, num int, age in
 	}
 	if len(snapshotURLs) == 0 {
 		contextlog.LogInfoWithContext(historyCtx, logging.GetSlogger(), "No URLs with old latest crawl attempts found to recrawl")
 		return 0, nil
 	}
--- a/common/shared.go
+++ b/common/shared.go
@@ -1,9 +1,6 @@
 package common
-import (
+import "os"
 	"os"
 	"sync"
 )
 // FatalErrorsChan accepts errors from workers.
 // In case of fatal error, gracefully
@@ -11,7 +8,6 @@ import (
 var (
 	FatalErrorsChan chan error
 	SignalsChan     chan os.Signal
 	WorkerWG        sync.WaitGroup
 )
 const VERSION string = "0.0.1"
--- a/common/worker.go
+++ b/common/worker.go
@@ -27,6 +27,7 @@ import (
 )
 func RunWorkerWithTx(workerID int, job string) {
 	// Extract host from URL for the context.
 	parsedURL, err := url2.ParseURL(job, "", true)
 	if err != nil {
 		logging.LogInfo("Failed to parse URL: %s Error: %s", job, err)
@@ -39,6 +40,7 @@ func RunWorkerWithTx(workerID int, job string) {
 	ctx, cancel := contextutil.NewRequestContext(baseCtx, job, host, workerID)
 	ctx = contextutil.ContextWithComponent(ctx, "worker")
 	defer cancel() // Ensure the context is cancelled when we're done
 	// contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "======================================\n\n")
 	contextlog.LogDebugWithContext(ctx, logging.GetSlogger(), "Starting worker for URL %s", job)
 	// Create a new db transaction
@@ -49,7 +51,6 @@ func RunWorkerWithTx(workerID int, job string) {
 	}
 	err = runWorker(ctx, tx, []string{job})
 	WorkerWG.Done()
 	if err != nil {
 		// Two cases to handle:
 		// - context cancellation/timeout errors (log and ignore)
@@ -113,11 +114,17 @@ func WorkOnUrl(ctx context.Context, tx *sqlx.Tx, url string) (err error) {
 	s, err := snapshot.SnapshotFromURL(url, true)
 	if err != nil {
 		contextlog.LogErrorWithContext(ctx, logging.GetSlogger(), "Failed to parse URL: %v", err)
 		return err
 	}
 	// We always use the normalized URL
 	if url != s.URL.Full {
 		//err = gemdb.Database.CheckAndUpdateNormalizedURL(ctx, tx, url, s.URL.Full)
 		//if err != nil {
 		//	return err
 		//}
 		//contextlog.LogDebugWithContext(ctx, logging.GetSlogger(), "Normalized URL: %s → %s", url, s.URL.Full)
 		url = s.URL.Full
 	}
@@ -140,6 +147,7 @@ func WorkOnUrl(ctx context.Context, tx *sqlx.Tx, url string) (err error) {
 	// Only check blacklist if URL is not whitelisted
 	if !isUrlWhitelisted && blackList.IsBlacklisted(s.URL.String()) {
 		contextlog.LogDebugWithContext(ctx, logging.GetSlogger(), "URL matches blacklist, ignoring %s", url)
 		s.Error = null.StringFrom(commonErrors.ErrBlacklistMatch.Error())
 		return saveSnapshotAndRemoveURL(ctx, tx, s)
 	}
@@ -151,6 +159,7 @@ func WorkOnUrl(ctx context.Context, tx *sqlx.Tx, url string) (err error) {
 		// add it as an error and remove url
 		robotMatch = robotsMatch.RobotMatch(ctx, s.URL.String())
 		if robotMatch {
 			contextlog.LogDebugWithContext(ctx, logging.GetSlogger(), "URL matches robots.txt, skipping")
 			s.Error = null.StringFrom(commonErrors.ErrRobotsMatch.Error())
 			return saveSnapshotAndRemoveURL(ctx, tx, s)
 		}
@@ -175,6 +184,7 @@ func WorkOnUrl(ctx context.Context, tx *sqlx.Tx, url string) (err error) {
 	}
 	if err != nil {
 		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "Error visiting URL: %v", err)
 		return err
 	}
@@ -213,32 +223,40 @@ func WorkOnUrl(ctx context.Context, tx *sqlx.Tx, url string) (err error) {
 		}
 	}
 	// Save the snapshot and remove the URL from the queue
 	if s.Error.ValueOrZero() != "" {
 		// Only save error if we didn't have any valid
 		// snapshot data from a previous crawl!
 		shouldUpdateSnapshot, err := shouldUpdateSnapshotData(ctx, tx, s)
 		if err != nil {
 			return err
 		}
 		if shouldUpdateSnapshot {
 			contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "%2d %s", s.ResponseCode.ValueOrZero(), s.Error.ValueOrZero())
 			return saveSnapshotAndRemoveURL(ctx, tx, s)
 		} else {
 			contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "%2d %s (but old content exists, not updating)", s.ResponseCode.ValueOrZero(), s.Error.ValueOrZero())
 			return removeURL(ctx, tx, s.URL.String())
 		}
 	} else {
 		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "%2d", s.ResponseCode.ValueOrZero())
 		return saveSnapshotAndRemoveURL(ctx, tx, s)
 	}
 }
 func shouldUpdateSnapshotData(ctx context.Context, tx *sqlx.Tx, s *snapshot.Snapshot) (bool, error) {
 	// If we don't have an error, save the new snapshot.
 	if !s.Error.Valid {
 		return true, nil
 	}
 	prevSnapshot, err := gemdb.Database.GetLatestSnapshot(ctx, tx, s.URL.String())
 	if err != nil {
 		return false, err
 	}
 	// If we don't have a previous snapshot, save it anyway.
 	if prevSnapshot == nil {
 		return true, nil
 	}
-	// If we have a previous snapshot,
+	if prevSnapshot.ResponseCode.Valid {
 	// and it didn't have an error, save.
 	// This means that we can have a max
 	// of one consecutive snapshot with
 	// an error.
 	if prevSnapshot.Error.ValueOrZero() == "" {
 		return true, nil
 	}
 		return false, nil
 	}
 	return true, nil
 }
 func isContentIdentical(ctx context.Context, tx *sqlx.Tx, s *snapshot.Snapshot) (bool, error) {
 	// Always check if content is identical to previous snapshot
@@ -277,25 +295,11 @@ func removeURL(ctx context.Context, tx *sqlx.Tx, url string) error {
 }
 func saveSnapshotAndRemoveURL(ctx context.Context, tx *sqlx.Tx, s *snapshot.Snapshot) error {
 	shouldUpdateSnapshot, err := shouldUpdateSnapshotData(ctx, tx, s)
 	if err != nil {
 		return err
 	}
 	if shouldUpdateSnapshot {
 	err := gemdb.Database.SaveSnapshot(ctx, tx, s)
 	if err != nil {
 		return err
 	}
-		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "%2d", s.ResponseCode.ValueOrZero())
+	return gemdb.Database.DeleteURL(ctx, tx, s.URL.String())
 		return removeURL(ctx, tx, s.URL.String())
 	} else {
 		contextlog.LogInfoWithContext(ctx, logging.GetSlogger(), "%2d %s (updating crawl date)", s.ResponseCode.ValueOrZero(), s.Error.ValueOrZero())
 		err = gemdb.Database.UpdateLastCrawled(ctx, tx, s.URL.String())
 		if err != nil {
 			return err
 		}
 		return removeURL(ctx, tx, s.URL.String())
 	}
 }
 // shouldPersistURL returns true given URL is a
--- a/db/db.go
+++ b/db/db.go
@@ -448,7 +448,7 @@ func (d *DbServiceImpl) GetLatestSnapshot(ctx context.Context, tx *sqlx.Tx, url
 		if errors.Is(err, sql.ErrNoRows) {
 			return nil, nil
 		}
-		return nil, xerrors.NewError(fmt.Errorf("cannot get latest snapshot for URL %s: %w", url, err), 0, "", true)
+		return nil, xerrors.NewError(fmt.Errorf("cannot get latest snapshot for URL %s: %w", url, err), 0, "", false)
 	}
 	return s, nil
 }
--- a/db/db_queries.go
+++ b/db/db_queries.go
@@ -115,7 +115,12 @@ LIMIT $1
 	SQL_UPDATE_LAST_CRAWLED = `
        UPDATE snapshots 
        SET last_crawled = CURRENT_TIMESTAMP 
        WHERE id = (
            SELECT id FROM snapshots 
            WHERE url = $1 
            ORDER BY timestamp DESC 
            LIMIT 1
        )
    `
 	// SQL_FETCH_SNAPSHOTS_FROM_HISTORY Fetches URLs from snapshots for re-crawling based on last_crawled timestamp
 	// This query finds root domain URLs that haven't been crawled recently and selects
@@ -132,7 +137,7 @@ LIMIT $1
 				host,
 				COALESCE(MAX(last_crawled), '1970-01-01'::timestamp) as latest_attempt
 			FROM snapshots
-			WHERE url ~ '^gemini://[^/]+/?$' AND mimetype = 'text/gemini' AND error IS NULL
+			WHERE url ~ '^gemini://[^/]+/?$' AND mimetype = 'text/gemini'
 			GROUP BY url, host
 		),
 		root_urls_with_content AS (
--- a/misc/sql/fetch-snapshot-history.sql
+++ b/misc/sql/fetch-snapshot-history.sql
@@ -1,6 +0,0 @@
 select count(*) from snapshots
  where last_crawled < now() - interval '30 days'
    and error IS NULL
    and gemtext IS NOT NULL
    and mimetype='text/gemini'
    and url ~ '^gemini://[^/]+/?$';
Author	SHA1	Message	Date
antanst	8bbe6efabc	Improve error handling and add duplicate snapshot cleanup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-06-18 11:56:26 +03:00
antanst	aa2658e61e	Fix snapshot overwrite logic to preserve successful responses - Prevent overwriting snapshots that have valid response codes - Ensure URL is removed from queue when snapshot update is skipped - Add last_crawled timestamp tracking for better crawl scheduling - Remove SkipIdenticalContent flag, simplify content deduplication logic - Update database schema with last_crawled column and indexes 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-06-18 11:23:56 +03:00
antanst	4e225ee866	Fix infinite recrawl loop with skip-identical-content Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory() infinite loop when SkipIdenticalContent=true. Now tracks actual crawl attempts separately from content changes via database DEFAULT timestamps. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-06-17 10:41:17 +03:00
antanst	f9024d15aa	Refine content deduplication and improve configuration	2025-06-16 17:09:26 +03:00
antanst	330b596497	Enhance crawler with seed list and SQL utilities Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.	2025-06-16 12:29:33 +03:00
antanst	51f94c90b2	Update documentation and project configuration - Add architecture documentation for versioned snapshots - Update Makefile with improved build commands - Update dependency versions in go.mod - Add project notes and development guidelines - Improve README with new features and instructions	2025-05-22 13:26:11 +03:00
antanst	bfaa857fae	Update and refactor core functionality - Update common package utilities - Refactor network code for better error handling - Remove deprecated files and functionality - Enhance blacklist and filtering capabilities - Improve snapshot handling and processing	2025-05-22 12:47:01 +03:00
antanst	5cc82f2c75	Modernize host pool management - Add context-aware host pool operations - Implement rate limiting for host connections - Improve concurrency handling with mutexes - Add host connection tracking	2025-05-22 12:46:42 +03:00
antanst	eca54b2f68	Implement context-aware database operations - Add context support to database operations - Implement versioned snapshots for URL history - Update database queries to support URL timestamps - Improve transaction handling with context - Add utility functions for snapshot history	2025-05-22 12:46:36 +03:00
antanst	7d27e5a123	Add whitelist functionality - Implement whitelist package for filtering URLs - Support pattern matching for allowed URLs - Add URL validation against whitelist patterns - Include test cases for whitelist functionality	2025-05-22 12:46:28 +03:00
antanst	8a9ca0b2e7	Add robots.txt parsing and matching functionality - Create separate robotsMatch package for robots.txt handling - Implement robots.txt parsing with support for different directives - Add support for both Allow and Disallow patterns - Include robots.txt matching with efficient pattern matching - Add test cases for robots matching	2025-05-22 12:46:21 +03:00
antanst	5940a117fd	Add context-aware network operations - Implement context-aware versions of network operations - Add request cancellation support throughout network code - Use structured logging with context metadata - Support timeout management with contexts - Improve error handling with detailed logging 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-22 12:45:58 +03:00
antanst	d1c326f868	Improve error handling with xerrors package - Replace custom error handling with xerrors package - Enhance error descriptions for better debugging - Add text utilities for string processing - Update error tests to use standard errors package - Add String() method to GeminiError 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-22 12:45:46 +03:00
antanst	a55f820f62	Implement structured logging with slog - Replace zerolog with Go's standard slog package - Add ColorHandler for terminal color output - Add context-aware logging system - Format attributes on the same line as log messages - Use green color for INFO level logs - Set up context value extraction helpers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-22 12:44:08 +03:00
antanst	ad224a328e	Change errors to use xerrors package.	2025-05-12 20:37:58 +03:00
antanst	a823f5abc3	Fix Makefile.	2025-03-10 16:54:06 +02:00
antanst	658c5f5471	Fix linter warnings in gemini/network.go Remove redundant nil checks before len() operations as len() for nil slices is defined as zero in Go. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-03-10 11:34:29 +02:00
antanst	efaedcc6b2	Improvements in error handling & descriptions	2025-02-27 09:20:22 +02:00
antanst	9dc008cb0f	Use go_errors library everywhere.	2025-02-26 13:31:46 +02:00
antanst	c82b436d32	Update license and readme.	2025-02-26 10:39:51 +02:00
antanst	4f47521401	update gitignore	2025-02-26 10:37:20 +02:00
antanst	96a39ec3b6	Improve main error handling	2025-02-26 10:37:09 +02:00
antanst	54474d45cd	Use Go race detector	2025-02-26 10:36:51 +02:00
antanst	d306c44f3d	Tidy go mod	2025-02-26 10:36:41 +02:00
antanst	79e3175467	Add gemget script that downloads Gemini pages	2025-02-26 10:35:54 +02:00
antanst	d89dd72fe9	Add Gopherspace crawling!	2025-02-26 10:35:28 +02:00
antanst	29877cb2da	Simplify host pool	2025-02-26 10:35:11 +02:00
antanst	4bceb75695	Reorganize code for more granular imports	2025-02-26 10:34:46 +02:00
antanst	a9983f3531	Reorganize errors	2025-02-26 10:32:38 +02:00
antanst	5cf720103f	Improve blacklist to use regex matching	2025-02-26 10:32:01 +02:00
antanst	b6dd77e57e	Add regex matching function to util	2025-01-16 22:37:39 +02:00
antanst	973a4f3a2d	Add tidy & update Makefile targets	2025-01-16 22:37:39 +02:00
antanst	b30b7274ec	Simplify duplicate code	2025-01-16 22:37:39 +02:00
antanst	63adf73ef9	Proper package in tests	2025-01-16 10:04:02 +02:00
antanst	b3387ce7ad	Add DB scan error	2025-01-16 10:04:02 +02:00
antanst	9ade26b6e8	Simplify IP pool and convert it to host pool	2025-01-16 10:04:02 +02:00
antanst	4a345a1763	Break up Gemtext link parsing code and improve tests.	2025-01-16 10:04:02 +02:00
antanst	64f98bb37c	Add mode that prints multiple worker status in console	2025-01-16 10:04:02 +02:00