Fix infinite recrawl loop with skip-identical-content
Add last_crawled timestamp tracking to fix fetchSnapshotsFromHistory() infinite loop when SkipIdenticalContent=true. Now tracks actual crawl attempts separately from content changes via database DEFAULT timestamps. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -94,7 +94,8 @@ links = :links,
|
||||
lang = :lang,
|
||||
response_code = :response_code,
|
||||
error = :error,
|
||||
header = :header
|
||||
header = :header,
|
||||
last_crawled = CURRENT_TIMESTAMP
|
||||
WHERE id = :id
|
||||
RETURNING id
|
||||
`
|
||||
@@ -139,4 +140,9 @@ RETURNING id
|
||||
AND timestamp BETWEEN $2 AND $3
|
||||
ORDER BY timestamp DESC
|
||||
`
|
||||
// New query to record crawl attempt when content is identical (no new snapshot needed)
|
||||
SQL_RECORD_CRAWL_ATTEMPT = `
|
||||
INSERT INTO snapshots (url, host, mimetype, response_code, error)
|
||||
VALUES ($1, $2, $3, $4, $5)
|
||||
`
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user