Dartmouth News Scraper

I built a resilient Python scraper to archive Dartmouth News articles and export clean PDFs for offline reading and long‑term storage.

Backstory

Campus articles often disappear behind redesigns or fragile CMS links. This project ensures institutional stories remain accessible in a portable format with consistent metadata and layout.

System Overview

Crawler: discovers article URLs from listings, sitemaps, and pagination, with dedupe and normalized canonical links.
Fetcher: robust retry policy, polite rate limiting, and content‑type validation before parsing.
Parser: extracts title, author, publish date, body, and images; sanitizes HTML and inlines critical assets for stability.
PDF pipeline: converts parsed content into reader‑friendly PDFs with consistent typography and headings.
Storage: deterministic file paths based on slug/date; index written for quick lookup.

Design Details

Object‑oriented modules (Fetcher, Parser, Formatter, Exporter) with explicit interfaces make the pipeline easy to extend.
Bounded memory approach for large pages and images; streams data to avoid buffering entire payloads.
Checkpointing so partial runs can resume; idempotent writes prevent duplicates.

Challenges → Solutions

Inconsistent article markup → tolerant parsing with multiple selectors and fallbacks.
Transient network errors → backoff with jitter and graceful degradation of non‑critical assets.
Image hotlinking risks → local caching and reference rewriting inside the PDF output.

Outputs

Clean, portable PDFs of each article.
A structured index for quick navigation and future search.

Backstory

System Overview

Design Details

Challenges → Solutions

Outputs

Project Log