Enhance crawler with seed list and SQL utilities
Add seedList module for URL initialization, comprehensive SQL utilities for database analysis, and update project configuration.
This commit is contained in:
28
misc/sql/README.md
Normal file
28
misc/sql/README.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# SQL Queries for Snapshot Analysis
|
||||
|
||||
This directory contains SQL queries to analyze snapshot data in the gemini-grc database.
|
||||
|
||||
## Usage
|
||||
|
||||
You can run these queries directly from psql using the `\i` directive:
|
||||
|
||||
```
|
||||
\i misc/sql/snapshots_per_url.sql
|
||||
```
|
||||
|
||||
## Available Queries
|
||||
|
||||
- **snapshots_per_url.sql** - Basic count of snapshots per URL
|
||||
- **snapshots_date_range.sql** - Shows snapshot count with date range information for each URL
|
||||
- **host_snapshot_stats.sql** - Groups snapshots by hosts and shows URLs with multiple snapshots
|
||||
- **content_changes.sql** - Finds URLs with the most content changes between consecutive snapshots
|
||||
- **snapshot_distribution.sql** - Shows the distribution of snapshots per URL (how many URLs have 1, 2, 3, etc. snapshots)
|
||||
- **recent_snapshot_activity.sql** - Shows URLs with most snapshots in the last 7 days
|
||||
- **storage_efficiency.sql** - Shows potential storage savings from deduplication
|
||||
- **snapshots_by_timeframe.sql** - Shows snapshot count by timeframe (day, week, month)
|
||||
|
||||
## Notes
|
||||
|
||||
- These queries are designed to work with PostgreSQL and the gemini-grc database schema
|
||||
- Some queries may be resource-intensive on large databases
|
||||
- The results can help optimize storage and understand the effectiveness of the versioned snapshot feature
|
||||
Reference in New Issue
Block a user