Project

Tiny Search Engine

Crawler, indexer, and query engine for 15K+ pages with efficient memory & I/O management

CDataStructuresMemoryAlgorithms
Tiny Search Engine

I implemented a miniature search engine in C with three components—Crawler, Indexer, Querier—focused on correctness, memory safety, and performance.

Backstory

This project was a deep dive into IR fundamentals: fetching pages, building an index, and answering free‑text queries efficiently without the crutch of high‑level libraries.

Architecture

  • Crawler: respectful fetcher with URL normalization, deduplication, and domain scoping.
  • Indexer: inverted index over tokens with document frequencies; compact in‑memory representation plus on‑disk persistence.
  • Querier: tokenizes and normalizes queries, ranks results using index statistics, and prints annotated matches.

Engineering Practices

  • Defensive programming throughout; all allocations checked, and errors surfaced with clear codes.
  • Valgrind‑driven iteration to eliminate leaks and undefined behavior.
  • Tight inner loops and careful data layout to reduce cache misses on lookups.

Performance

  • Optimized tokenization and thread pooling, reducing average query latency from 30s → 0.8s.
  • Implemented efficient memory management and I/O optimizations for handling 15K+ pages.
  • Bounded memory and streaming writes during indexing to handle larger inputs gracefully.

Project Log

No log entries yet. I’ll share stories, insights, and progress notes here.