Compare commits
48 Commits
0db2557cfc
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f362a1d2da | ||
|
|
7b3ad38f03 | ||
|
|
8e30a6a365 | ||
|
|
26311a6d2b | ||
|
|
57eb2555c5 | ||
|
|
453cf2294a | ||
|
|
ee2076f337 | ||
|
|
acbac15c20 | ||
|
|
ddbe6b461b | ||
|
|
55bb0d96d0 | ||
|
|
349968d019 | ||
|
|
2357135d5a | ||
|
|
98d3ed6707 | ||
|
|
8b498a2603 | ||
|
|
8588414b14 | ||
| 5e6dabf1e7 | |||
| a8173544e7 | |||
| 3d07b56e8c | |||
| c54c093a10 | |||
| 57f5c0e865 | |||
| dc6eb610a2 | |||
| 39e9ead982 | |||
| 5f4da4f806 | |||
| 4ef3f70f1f | |||
| b8ea6fab4a | |||
| 5fe1490f1e | |||
| a41490f834 | |||
| 701a5df44f | |||
| 5b84960c5a | |||
| be38104f05 | |||
| d70d6c35a3 | |||
| 8399225046 | |||
| e8e26ec76a | |||
| f6ac5003b0 | |||
| e626aabecb | |||
| ebf59c50b8 | |||
| 2a041fec7c | |||
| ca008b0796 | |||
| 8350e106d6 | |||
| 9c7502b2a8 | |||
| dda21e833c | |||
| b0e7052c10 | |||
| 43b207c9ab | |||
| 285f2955e7 | |||
| 998b0e74ec | |||
| 766ee26f68 | |||
| 5357ceb04d | |||
| 03e1849191 |
13
NOTES.md
13
NOTES.md
@@ -1,13 +0,0 @@
|
|||||||
# Notes
|
|
||||||
|
|
||||||
Avoiding endless loops while crawling
|
|
||||||
|
|
||||||
- Make sure we follow robots.txt
|
|
||||||
- Announce our own agent so people can block us in their robots.txt
|
|
||||||
- Put a limit on number of pages per host, and notify on limit reach.
|
|
||||||
- Put a limit on the number of redirects (not needed?)
|
|
||||||
|
|
||||||
Heuristics:
|
|
||||||
|
|
||||||
- Do _not_ parse links from pages that have '/git/' or '/cgi/' or '/cgi-bin/' in their URLs.
|
|
||||||
- Have a list of "whitelisted" hosts/urls that we visit in regular intervals.
|
|
||||||
@@ -22,8 +22,12 @@ This crawler uses `InsecureSkipVerify: true` in TLS configuration to accept all
|
|||||||
|
|
||||||
## How to run
|
## How to run
|
||||||
|
|
||||||
Spin up a PostgreSQL, check `misc/sql/initdb.sql` to create the tables and start the crawler.
|
```shell
|
||||||
All configuration is done via command-line flags.
|
make build
|
||||||
|
./dist/crawler --help
|
||||||
|
```
|
||||||
|
|
||||||
|
Check `misc/sql/initdb.sql` to create the PostgreSQL tables.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
|
|||||||
38
TODO.md
38
TODO.md
@@ -1,38 +0,0 @@
|
|||||||
# TODO
|
|
||||||
|
|
||||||
## Outstanding Issues
|
|
||||||
|
|
||||||
### 1. Ctrl+C Signal Handling Issue
|
|
||||||
|
|
||||||
**Problem**: The crawler sometimes doesn't exit properly when Ctrl+C is pressed.
|
|
||||||
|
|
||||||
**Root Cause**: The main thread gets stuck in blocking operations before it can check for signals:
|
|
||||||
- Database operations in the polling loop (`cmd/crawler/crawler.go:239-250`)
|
|
||||||
- Job queueing when channel is full (`jobs <- url` can block if workers are slow)
|
|
||||||
- Long-running database transactions
|
|
||||||
|
|
||||||
**Location**: `cmd/crawler/crawler.go` - main polling loop starting at line 233
|
|
||||||
|
|
||||||
**Solution**: Add signal/context checking to blocking operations:
|
|
||||||
- Use cancellable context instead of `context.Background()` for database operations
|
|
||||||
- Make job queueing non-blocking or context-aware
|
|
||||||
- Add timeout/cancellation to database operations
|
|
||||||
|
|
||||||
### 2. fetchSnapshotsFromHistory() Doesn't Work with --skip-identical-content=true
|
|
||||||
|
|
||||||
**Problem**: When `--skip-identical-content=true` (default), URLs with unchanged content get continuously re-queued.
|
|
||||||
|
|
||||||
**Root Cause**: The function tracks when content last changed, not when URLs were last crawled:
|
|
||||||
- Identical content → no new snapshot created
|
|
||||||
- Query finds old snapshot timestamp → re-queues URL
|
|
||||||
- Creates infinite loop of re-crawling unchanged content
|
|
||||||
|
|
||||||
**Location**: `cmd/crawler/crawler.go:388-470` - `fetchSnapshotsFromHistory()` function
|
|
||||||
|
|
||||||
**Solution Options**:
|
|
||||||
1. Add `last_crawled` timestamp to URLs table
|
|
||||||
2. Create separate `crawl_attempts` table
|
|
||||||
3. Always create snapshot entries (even for duplicates) but mark them as such
|
|
||||||
4. Modify logic to work with existing schema constraints
|
|
||||||
|
|
||||||
**Current Status**: Function assumes `SkipIdenticalContent=false` per original comment at line 391.
|
|
||||||
Reference in New Issue
Block a user