TL;DR: I pull in everything I read across different mediums — Goodreads, Audible, Spotify, and my physical bookshelf. This builds a live, unified view of it all without me having to track anything manually.

The System

At its core, this is a distributed system that aggregates and normalizes book data from multiple sources in real-time. The architecture consists of four independent data collectors (Goodreads, Audible, Spotify, and a physical bookshelf scanner) that feed into a central processing pipeline. Each collector operates on its own schedule, with the physical scanner running continuously and the API-based collectors executing frequent self-regulated syncs.

The integration layer processes this data through a series of transformations: first building relationships between entries using fuzzy matching, then applying normalization rules, and finally resolving any conflicts in metadata. This data/pipeline isn't just used for the site, but for some of the other projects I am working on.

All of this feeds into my books page!, which displays the unified view of my reading habits across platforms. The page itself is statically generated, but the underlying data pipeline ensures it's always up to date with minimal manual intervention.

Actual Pseudocode Links: (upd 2/3/25)

This is a simplified (ish) version of my previously linked code from my github. Check out the actual, more complicated, pseudocode below:

Overview script: Github Link #1 (/personal-website-features/vision-books.py)
Goodreads API-scraper script: Github Link #2 (personal-website-features/cleaned-goodreads.ts)

The Interesting Parts

Goodreads Integration

When Goodreads shut down their API, I built a custom scraper that intelligently caches results to avoid getting rate-limited. The fun part was figuring out how to handle their inconsistent data formats - some books have "Last, First" author names while others use "First Last", and cover URLs come in all shapes and sizes.

pythonCode
# (simplification)
def format_author(name: str) -> str:
    # Handle "Last, First" -> "First Last"
    if ',' in name:
        last, first = name.split(',', 1)
        return f"{first.strip()} {last.strip()}"
    
    # Handle "First Last" -> "First Last"
    return name.strip()

def normalize_url(url: str) -> str:
    # Handle various Goodreads cover URL formats
    if 'nophoto' in url:
        return None  # Skip placeholder images
    if '._SX' in url:
        # Convert to high-res version
        return url.replace('._SX', '._SL')
    return url

Audible Integration

This was probably the most challenging part. Amazon's anti-bot measures meant I had to get creative. I ended up booting up a VM and running an actual Audible app to reverse-engineer their internal API endpoints. The scraper now uses rotating headers and carefully managed session cookies to maintain access without triggering their security systems.

pythonCode
def setup_scraper():
    # Rotate through realistic browser fingerprints
    headers = {
        'User-Agent': get_random_user_agent(),
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept': 'application/json',
        'X-Amzn-RequestId': generate_uuid(),
        'X-Amzn-SessionId': get_session_id()
    }
    
    # Load and refresh session cookies
    cookies = load_session_cookies()
    if not cookies or is_expired(cookies):
        cookies = refresh_session()
    
    return ScraperConfig(headers, cookies)

Physical Bookshelf Scanner

The most over-engineered part: a Raspberry Pi setup that uses computer vision to track my physical books. It's not perfect (sometimes it mistakes book spines for shadows), but it's surprisingly accurate at detecting when books are added or removed. The real challenge was handling different lighting conditions and book orientations.

pythonCode
# (simplification)
def process_image(img):
    # Adaptive thresholding for different lighting
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    thresh = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    
    # Find book spines using contour detection
    contours, _ = cv2.findContours(
        thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )
    
    # Filter and process potential book spines
    spines = []
    for contour in contours:
        if is_book_spine(contour):
            spine = extract_spine_region(img, contour)
            title = ocr_spine(spine)
            spines.append(title)
    
    return spines

Data Integration Layer

The core challenge was creating a robust system to unify data from disparate sources. The integration layer uses a combination of fuzzy string matching, metadata normalization, and a custom rule engine to handle the various formats and inconsistencies across platforms.

pythonCode
def standardize_metadata(data):
    # Create a knowledge graph of book relationships
    graph = build_relationship_graph(data)
    
    # Apply normalization rules
    normalized = apply_normalization_rules(graph)
    
    # Resolve conflicts using source priority
    resolved = resolve_conflicts(normalized)
    
    # Generate unified format
    return generate_unified_format(resolved)

def build_relationship_graph(data):
    # Use fuzzy matching to identify related entries
    relationships = []
    for source1, entries1 in data.items():
        for source2, entries2 in data.items():
            if source1 >= source2:
                continue
            matches = fuzzy_match_entries(entries1, entries2)
            relationships.extend(matches)
    
    return Graph(relationships)

The system runs automatically, quietly keeping track of my reading habits across all platforms. While it's definitely overkill, it's been running reliably for months with minimal intervention - exactly how I like my side projects.

Table of Contents

The Magic Behind My Books Page