Compare commits
24 Commits
main
...
c386d5eb14
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c386d5eb14 | ||
|
|
1ba432c127 | ||
|
|
21b8769bc5 | ||
|
|
389b3615e2 | ||
|
|
af42383513 | ||
|
|
59893efc3d | ||
|
|
621868f3b3 | ||
|
|
967f371777 | ||
|
|
ada6cda4ac | ||
|
|
e9d7fa85ff | ||
|
|
9938dc542b | ||
|
|
37d5e7cd78 | ||
| dfb050588c | |||
| ecaa7f338d | |||
| 6a5284e91a | |||
| 6b22953046 | |||
| 0821f78f2d | |||
| fe40874844 | |||
| a7aa5cd410 | |||
| ef628eeb3c | |||
| 376e1ced64 | |||
| 94429b2224 | |||
| a6dfc25e25 | |||
| a2d5b04d58 |
13
NOTES.md
Normal file
13
NOTES.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Notes
|
||||
|
||||
Avoiding endless loops while crawling
|
||||
|
||||
- Make sure we follow robots.txt
|
||||
- Announce our own agent so people can block us in their robots.txt
|
||||
- Put a limit on number of pages per host, and notify on limit reach.
|
||||
- Put a limit on the number of redirects (not needed?)
|
||||
|
||||
Heuristics:
|
||||
|
||||
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
|
||||
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
|
||||
@@ -22,12 +22,8 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
|
||||
|
||||
## How to run
|
||||
|
||||
```shell
|
||||
make build
|
||||
./dist/crawler --help
|
||||
```
|
||||
|
||||
Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
|
||||
Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
|
||||
All configuration is done via command-line flags.
|
||||
|
||||
## Configuration
|
||||
|
||||
|
||||
38
TODO.md
Normal file
38
TODO.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# TODO
|
||||
|
||||
## Outstanding Issues
|
||||
|
||||
### 1. Ctrl+C Signal Handling Issue
|
||||
|
||||
**Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
|
||||
|
||||
**Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
|
||||
- Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
|
||||
- Job queueing when channel is full (`jobs <- url` can block if workers are slow)
|
||||
- Long-running database transactions
|
||||
|
||||
**Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
|
||||
|
||||
**Solution**: Add signal/context checking to blocking operations:
|
||||
- Use cancellable context instead of `context.Background()` for database operations
|
||||
- Make job queueing non-blocking or context-aware
|
||||
- Add timeout/cancellation to database operations
|
||||
|
||||
### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
|
||||
|
||||
**Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
|
||||
|
||||
**Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
|
||||
- Identical content → no new snapshot created
|
||||
- Query finds old snapshot timestamp → re-queues URL
|
||||
- Creates infinite loop of re-crawling unchanged content
|
||||
|
||||
**Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
|
||||
|
||||
**Solution Options**:
|
||||
1. Add `last_crawled` timestamp to URLs table
|
||||
2. Create separate `crawl_attempts` table
|
||||
3. Always create snapshot entries (even for duplicates) but mark them as such
|
||||
4. Modify logic to work with existing schema constraints
|
||||
|
||||
**Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
|
||||
Reference in New Issue
Block a user