Tiny Search Engine

I implemented a miniature search engine in C with three components—Crawler, Indexer, Querier—focused on correctness, memory safety, and performance.

Backstory

This project was a deep dive into IR fundamentals: fetching pages, building an index, and answering free‑text queries efficiently without the crutch of high‑level libraries.

Architecture

Crawler: respectful fetcher with URL normalization, deduplication, and domain scoping.
Indexer: inverted index over tokens with document frequencies; compact in‑memory representation plus on‑disk persistence.
Querier: tokenizes and normalizes queries, ranks results using index statistics, and prints annotated matches.

Engineering Practices

Defensive programming throughout; all allocations checked, and errors surfaced with clear codes.
Valgrind‑driven iteration to eliminate leaks and undefined behavior.
Tight inner loops and careful data layout to reduce cache misses on lookups.

Performance

Optimized tokenization and thread pooling, reducing average query latency from 30s → 0.8s.
Implemented efficient memory management and I/O optimizations for handling 15K+ pages.
Bounded memory and streaming writes during indexing to handle larger inputs gracefully.

Backstory

Architecture

Engineering Practices

Performance

Project Log