Compare commits
45 Commits
main
...
0db2557cfc
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0db2557cfc | ||
|
|
db3448f448 | ||
|
|
9a09dd7735 | ||
|
|
0f62b0c622 | ||
|
|
3bdff0e22e | ||
|
|
a74f29d7b0 | ||
|
|
ffeef334e7 | ||
|
|
8bbe6efabc | ||
|
|
aa2658e61e | ||
|
|
4e225ee866 | ||
|
|
f9024d15aa | ||
|
|
330b596497 | ||
| 51f94c90b2 | |||
| bfaa857fae | |||
| 5cc82f2c75 | |||
| eca54b2f68 | |||
| 7d27e5a123 | |||
| 8a9ca0b2e7 | |||
| 5940a117fd | |||
| d1c326f868 | |||
| a55f820f62 | |||
| ad224a328e | |||
| a823f5abc3 | |||
| 658c5f5471 | |||
| efaedcc6b2 | |||
| 9dc008cb0f | |||
| c82b436d32 | |||
| 4f47521401 | |||
| 96a39ec3b6 | |||
| 54474d45cd | |||
| d306c44f3d | |||
| 79e3175467 | |||
| d89dd72fe9 | |||
| 29877cb2da | |||
| 4bceb75695 | |||
| a9983f3531 | |||
| 5cf720103f | |||
| b6dd77e57e | |||
| 973a4f3a2d | |||
| b30b7274ec | |||
| 63adf73ef9 | |||
| b3387ce7ad | |||
| 9ade26b6e8 | |||
| 4a345a1763 | |||
| 64f98bb37c |
13
NOTES.md
Normal file
13
NOTES.md
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
# Notes
|
||||||
|
|
||||||
|
Avoiding endless loops while crawling
|
||||||
|
|
||||||
|
- Make sure we follow robots.txt
|
||||||
|
- Announce our own agent so people can block us in their robots.txt
|
||||||
|
- Put a limit on number of pages per host, and notify on limit reach.
|
||||||
|
- Put a limit on the number of redirects (not needed?)
|
||||||
|
|
||||||
|
Heuristics:
|
||||||
|
|
||||||
|
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
|
||||||
|
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
|
||||||
@@ -22,12 +22,8 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
|
|||||||
|
|
||||||
## How to run
|
## How to run
|
||||||
|
|
||||||
```shell
|
Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
|
||||||
make build
|
All configuration is done via command-line flags.
|
||||||
./dist/crawler --help
|
|
||||||
```
|
|
||||||
|
|
||||||
Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
|
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
|
|||||||
38
TODO.md
Normal file
38
TODO.md
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
# TODO
|
||||||
|
|
||||||
|
## Outstanding Issues
|
||||||
|
|
||||||
|
### 1. Ctrl+C Signal Handling Issue
|
||||||
|
|
||||||
|
**Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
|
||||||
|
|
||||||
|
**Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
|
||||||
|
- Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
|
||||||
|
- Job queueing when channel is full (`jobs <- url` can block if workers are slow)
|
||||||
|
- Long-running database transactions
|
||||||
|
|
||||||
|
**Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
|
||||||
|
|
||||||
|
**Solution**: Add signal/context checking to blocking operations:
|
||||||
|
- Use cancellable context instead of `context.Background()` for database operations
|
||||||
|
- Make job queueing non-blocking or context-aware
|
||||||
|
- Add timeout/cancellation to database operations
|
||||||
|
|
||||||
|
### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
|
||||||
|
|
||||||
|
**Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
|
||||||
|
|
||||||
|
**Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
|
||||||
|
- Identical content → no new snapshot created
|
||||||
|
- Query finds old snapshot timestamp → re-queues URL
|
||||||
|
- Creates infinite loop of re-crawling unchanged content
|
||||||
|
|
||||||
|
**Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
|
||||||
|
|
||||||
|
**Solution Options**:
|
||||||
|
1. Add `last_crawled` timestamp to URLs table
|
||||||
|
2. Create separate `crawl_attempts` table
|
||||||
|
3. Always create snapshot entries (even for duplicates) but mark them as such
|
||||||
|
4. Modify logic to work with existing schema constraints
|
||||||
|
|
||||||
|
**Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
|
||||||
Reference in New Issue
Block a user