TL;DR: I pull in everything I read across different mediums - Goodreads, Audible, Spotify, and my physical bookshelf. This builds a live, unified view of it all without me having to track anything manually.
I like to call these "airplane projects" because they are the fun & easy things to build without LLMs and brings me back to why I love to code.
More like a Dyson Sphere than a Dewey Decimal System
At its core, this is a distributed system that aggregates and normalizes book data from multiple sources in real-time. The architecture consists of four independent data collectors (Goodreads, Audible, Spotify, and a physical bookshelf scanner) that feed into a central processing pipeline. Each collector operates on its own schedule, with the physical scanner running continuously and the API-based collectors executing frequent self-regulated syncs.
The integration layer processes this data through a series of transformations: first building relationships between entries using fuzzy matching, then applying normalization rules, and finally resolving any conflicts in metadata. This data/pipeline isn't just used for the site, but for some of the other projects I am working on.
All of this feeds into my books page!, which displays the unified view of my reading habits across platforms. The page itself is statically generated, but the underlying data pipeline ensures it's always up to date with minimal manual intervention.
Actual Pseudocode Links: (upd 2/3/25)
This is a simplified (ish) version of my previously linked code from my github. Check out the actual, more complicated, pseudocode below:
- Overview script: Github Link #1 (/personal-website-features/vision-books.py)
- Goodreads API-scraper script: Github Link #2 (personal-website-features/cleaned-goodreads.ts)
The Technical Architecture
1. Goodreads Integration
When Goodreads shut down their API, I had to build a custom scraper. There were two hurdles here: avoiding rate limits and parsing their notoriously inconsistent data.
The Rate Limit Solves: Modern Web Application Firewalls (WAF) are aggressive at blocking bots. I solved this by making my scraper act "polite." It uses a jittered exponential backoff strategy: if it gets blocked, it waits, then waits X amount of time longer, adding a customized random "jitter" duration to appear more human and desynchronize from pattern-detection algorithms.
Since Goodreads has no public API, I reverse-engineered their internal backend routes by intercepting and replaying authenticated requests. The scraper polls and persists rotating CSRF tokens and session headers, refreshing them server-side on a scheduled cadence to maintain valid sessions without manual intervention. I use a pool of concurrent agents to parallelize fetches across different shelf endpoints, each maintaining its own token lifecycle, so the system can ingest an entire library in seconds rather than minutes.
The Data Solve: Handling inconsistent formatting was the fun part. Some books use "Last, First" author names, others "First Last," and cover URLs vary wildly. This was especially true for Goodreads, where some books use "Last, First" author names, others "First Last," and cover URLs vary very (18 different formats!) wildly. I ended up building a fuzzy matching system (which I talk about later) to normalize these entities.
pythonCodeasync def polite_fetch(url, agent_id, retries=X): for i in range(retries): try: # 1. Pull from pre-warmed browser pool if available — these # sessions have navigation history, reCAPTCHA v3 scores, # and cookie jars already built up from background browsing # across my scrapers. Fall back to a cold token-pool session. session = await warm_pool.checkout_browser(agent_id) \ or await token_pool.checkout(agent_id) # 2. Build fingerprint-consistent headers (order matters — # Chrome and Firefox send headers in different sequences) headers = OrderedDict([ ("Host", "www.goodreads.com"), ("Sec-Ch-Ua", build_chromium_brand_header()), ("Sec-Ch-Ua-Mobile", "?0"), ("User-Agent", session.ua), # pinned per session, not per request ("Accept", "text/html,application/xhtml+xml"), ("Accept-Language", session.locale), ("Referer", "https://www.goodreads.com/review/list"), ("Cookie", session.cookie_jar.serialize()), ("X-CSRF-Token", session.csrf), ]) # 3. Match TLS + HTTP/2 fingerprint to the spoofed browser # (JA3 hash, H2 SETTINGS frame order, WINDOW_UPDATE values) content = await browser.get( url, headers=headers, proxy=session.proxy, tls_profile=session.tls_profile, h2_fingerprint=session.h2_fingerprint, ) # 4. Detect soft blocks (200 OK but CAPTCHA or empty body) if is_soft_blocked(content): raise SoftBlockError() return parse_reading_shelf(content) except (RateLimitError, SoftBlockError): wait_time = (2 ** i) + random.uniform(0.5, 3.0) await asyncio.sleep(wait_time) except TokenExpiredError: await token_pool.rotate_and_refresh(agent_id) return None # fallback to cached data
2. Audible is a lot like Nike
Audible's data is locked behind some pretty sophisticated anti-bot techniques- almost identical to the Kasada system used by Nike.com (and Footlocker) that I saw when I used to build sneaker bots. After looking at their firewalls, I found they use client-side virtualization obfuscation. Instead of running standard JavaScript, the site runs a custom Virtual Machine (VM) inside your browser, hiding the real logic inside an unreadable blob of bytecode.
To solve this, I applied devirtualization techniques I originally picked up from sneaker drops. I built a disassembler to track the "instruction pointer" through the VM's opcodes, eventually finding the hidden pipes that handle library syncs. I actually ended up reusing code I wrote in 2020 for my sneaker bot to handle the bytecode interpretation.
pythonCodedef run_virtual_stepper(bytecode, state): # A simplified view of how I traced the obfuscated logic while state.running: # 1. Fetch the next command from the unreadable bytecode instruction = bytecode[state.ptr] state.ptr += 1 # 2. Execute the custom "Opcode" (like ADD, JUMP, or FETCH_TOKEN) if instruction == OP_AUTH_CALL: # We found the hidden pipe! state.registers[2] = call_proprietary_endpoint(state.token) elif instruction == OP_JUMP_IF_BOT: # Bypassing the security check state.ptr = state.bypass_ptr return state.result
(former sneaker bot dev Umasi did a much better, more visual job of explaining this process, definetely read his writeup on Kasada)
3. Bookshelf Scanner
This is by far the most “over-engineered” part of the project: a Raspberry Pi 4 I picked up on eBay, paired with a camera module mounted directly across from my bookshelf. Every hour, it wakes up, snaps a high-resolution photo of my library, and uses computer vision to track which physical books are actually on the shelf.
Because the camera is mounted at a slight angle, and because I’m constantly moving the setup between Minnesota, North Carolina, and San Francisco, the raw images end up full of hard-to-estimate distortions. The book spines look skewed and tilted in ways that make them difficult to interpret. To fix this, I apply perspective correction to “flatten” the image so the bookshelf, using markers I placed on either side, appears perfectly front-facing.
Once the image is corrected, I run it through a fine-tuned version of Meta’s Segment Anything Model 3 (SAM 3) that I trained specifically to isolate individual book spines. SAM 3’s zero-shot segmentation is impressive out of the box, but it doesn’t natively distinguish “book spine” from “shelf divider” or “bookend,” so I fine-tuned it on a small labeled dataset of my own shelves to produce tight per-spine masks. Each mask gets post-processed — I crop, deskew, and normalize the contrast of each isolated spine independently before passing it to OCR. This pipeline (perspective correction → SAM 3 segmentation → per-spine post-processing → OCR) makes text extraction a LOT (like 5x) more reliable than running OCR on the raw image, because the model is reading one clean, axis-aligned spine at a time instead of trying to parse an entire cluttered shelf.
pythonCodedef process_bookshelf(raw_image): # 1. Perspective correction corners = detect_shelf_outline(raw_image) flat_view = cv2.warpPerspective(raw_image, corners, (3000, 1200)) corrected = apply_lighting_correction(flat_view) # 2. Segment individual spines with fine-tuned SAM 3 sam_predictor.set_image(corrected) spine_masks = sam_predictor.generate( min_area=500, # ignore tiny fragments class_filter=”spine” # fine-tuned class head ) # 3. Post-process each spine and run OCR books = [] for mask in spine_masks: spine_crop = extract_and_deskew(corrected, mask) spine_crop = normalize_contrast(spine_crop) text = ocr_engine.read(spine_crop) if text: books.append(parse_title_author(text)) return books
I initially looked into quantizing the model to cut inference cost on the Pi, but the overhead difference between quantized and full SAM 3 turned out to be negligible for a single hourly snapshot — so I kept the full model. Sometimes overkill just isn't worth optimizing away.
4. Solving Data Collisions (Integration Layer)
When you pull data from four different places, you're going to get "collisions" (the same book appearing slightly differently in each source). I fuzzy match them to bridge the gap. Instead of checking if the titles are exactly the same, it calculates a similarity score. Like if "Slaughterhouse Five" and "Slaughterhouse-Five (Anniversary Edition)" are 95% similar (notice how it might not account for the characters directly, this is intentional), the system knows they're the same book and merges them.
pythonCodedef resolve_book_clash(new_book, library): for existing in library: # Check for similarity in titles # 1.0 = exact match, 0.0 = no match similarity = calculate_similarity(new_book.title, existing.title) if similarity > 0.92: # Probably the same book return merge_book_details(existing, new_book) # No match found, treat it as a new book return add_to_library(new_book)
The system runs automatically, quietly keeping track of my reading habits across all platforms. While it's definitely overkill, it's been running reliably for months with minimal intervention - exactly how I like my side projects.