Project

Dartmouth News Scraper

Distributed Python scraper archiving Dartmouth News with PDF outputs

PythonWebScrapingPDFOOP
Dartmouth News Scraper

I built a resilient Python scraper to archive Dartmouth News articles and export clean PDFs for offline reading and long‑term storage.

Backstory

Campus articles often disappear behind redesigns or fragile CMS links. This project ensures institutional stories remain accessible in a portable format with consistent metadata and layout.

System Overview

  • Crawler: discovers article URLs from listings, sitemaps, and pagination, with dedupe and normalized canonical links.
  • Fetcher: robust retry policy, polite rate limiting, and content‑type validation before parsing.
  • Parser: extracts title, author, publish date, body, and images; sanitizes HTML and inlines critical assets for stability.
  • PDF pipeline: converts parsed content into reader‑friendly PDFs with consistent typography and headings.
  • Storage: deterministic file paths based on slug/date; index written for quick lookup.

Design Details

  • Object‑oriented modules (Fetcher, Parser, Formatter, Exporter) with explicit interfaces make the pipeline easy to extend.
  • Bounded memory approach for large pages and images; streams data to avoid buffering entire payloads.
  • Checkpointing so partial runs can resume; idempotent writes prevent duplicates.

Challenges → Solutions

  • Inconsistent article markup → tolerant parsing with multiple selectors and fallbacks.
  • Transient network errors → backoff with jitter and graceful degradation of non‑critical assets.
  • Image hotlinking risks → local caching and reference rewriting inside the PDF output.

Outputs

  • Clean, portable PDFs of each article.
  • A structured index for quick navigation and future search.

Project Log

No log entries yet. I’ll share stories, insights, and progress notes here.