Lots of features, first version that reliably crawls Geminispace.
- [x] Concurrent downloading with workers - [x] Concurrent connection limit per host - [x] URL Blacklist - [x] Save image/* and text/* files - [x] Configuration via environment variables - [x] Storing snapshots in PostgreSQL - [x] Proper response header & body UTF-8 and format validation . . .
This commit is contained in:
9
db/show-dups.sql
Normal file
9
db/show-dups.sql
Normal file
@@ -0,0 +1,9 @@
|
||||
WITH DuplicateSnapshots AS (
|
||||
SELECT id,
|
||||
url,
|
||||
ROW_NUMBER() OVER (PARTITION BY url ORDER BY id) AS row_num
|
||||
FROM snapshots
|
||||
)
|
||||
SELECT *
|
||||
FROM DuplicateSnapshots
|
||||
WHERE row_num > 1;
|
||||
Reference in New Issue
Block a user