WebSum

Production-minded web summarization pipeline: submit URLs via CLI/API, extract & clean page text, generate summaries, and store results with metadata. Includes SQLite-backed queue, worker processing, caching, structured retries, JSON logs/metrics hooks, and testable components—built to "run it, operate it."

Features

  • Fast summarization via heuristic-based key insight extraction

  • SQLite-backed queue for reliable, persistent job processing

  • Caching layer to avoid re-fetching and re-processing identical URLs

  • Structured retries with exponential backoff for transient failures

  • Multi-worker support for horizontal scaling

  • JSON logging for metrics, debugging, and observability

  • Stdlib-only dependencies (no heavy ML frameworks)

Requirements

  • Python 3.7+ with sqlite3 support

  • Network access to fetch URLs

  • Writable filesystem for SQLite database storage

Setup & Installation

1. Clone or download the repository

git clone https://github.com/satsonmusic/WebSum.git

cd WebSum

2. (Optional) Create a virtual environment

# On Windows (PowerShell)

python -m venv venv .\venv\Scripts\Activate.ps1

# On macOS/Linux

python3 -m venv venv source venv/bin/activate

3. Install dependencies

No external dependencies required—fully compatible with Python standard library.

Quick Start

One-shot URL Summarization

The simplest way to use WebSum:

# Windows (PowerShell) python websum.py summarize "https://www.cnn.com" # macOS/Linux python3 websum.py summarize "https://www.cnn.com"

Output: A numbered list of key insights extracted from the page.

Example:

1. Breaking news on international developments. 2. Market analysis shows economic trends. 3. Technology sector updates reported today.

Customize Number of Insights

python websum.py summarize "https://www.example.com" --items 5

Ignore Cache and Re-fetch

python websum.py summarize "https://www.example.com" --no-cache

Set Custom Timeout

python websum.py summarize "https://www.example.com" --timeout-s 30

Advanced Usage

Enqueue a URL for Background Processing

Add a URL to the processing queue:

python websum.py enqueue "https://www.example.com" --priority 1 --max-attempts 5

Returns a job_id for tracking.

Run a Worker

Start a long-running worker that processes queued jobs:

python websum.py worker --worker-id "worker-1" --poll-s 1.0 --timeout-s 20

The worker will continuously:

  1. Claim the next queued job

  2. Fetch and summarize the URL

  3. Cache the results

  4. Mark the job as done or retry on failure

Retrieve a Cached Summary

python websum.py get "https://www.example.com"

Returns JSON with metadata:

{ "url": "https://www.example.com", "content_hash": "abc123...", "fetched_at": "2026-03-24T10:30:00+00:00", "extractor_version": "extract_v1", "model_version": "stub_v1", "summary": "1. First insight...\n2. Second insight..." }

Database

By default, WebSum stores the SQLite database at:

  • Windows: C:\Users\<username>\WebSum\pipeline.sqlite

  • macOS/Linux: Modify labroot() in websum.py as needed

The database includes two tables:

  • jobs: Queue status, retry tracking, locks

  • cache: Cached summaries with content hashes and metadata

Deployment

Single Worker Mode

python websum.py worker --worker-id "primary" --poll-s 2.0

Multiple Workers

Start multiple workers on the same or different machines pointing to the same database:

python websum.py worker --worker-id "worker-1" & python websum.py worker --worker-id "worker-2" &

Monitoring

All operations log structured JSON to stdout:

{"ts": "2026-03-24T10:30:45.123456+00:00", "event": "job_processed", "job_id": 42, "url": "https://...", "ms": 1234}

Capture and parse logs with your monitoring infrastracture.

Batch Processing via Queue

# Enqueue multiple URLs python websum.py enqueue "https://example1.com" --priority 5 python websum.py enqueue "https://example2.com" --priority 3 python websum.py enqueue "https://example3.com" --priority 1 # Start a worker to process them python websum.py worker --worker-id "main"

Rebuild Cache

# Force re-fetch and re-summarize python websum.py summarize "https://example.com" --no-cache

Architecture

  • Fetch Module: HTTP client with charset detection and configurable timeout

  • Extract Module: Lightweight HTML-to-text converter (stdlib only)

  • Summarize Module: Heuristic TF-IDF-style scoring for key sentences

  • Storage: SQLite with WAL journal mode for concurrent access

  • Queue: Atomic operations with job locking and exponential backoff retries

  • Worker: Long-running process that claims and processes jobs

EXAMPLE