Yoink
Yoink is a polite-by-default public data crawler we built for jobs that take longer than a single process should ever be trusted to live: long enough that the network will glitch, the host will reboot, and a naive crawler will lose everything it had. Async Python, ~3,200 lines, 134 tests, and a resumable checkpoint model that means you can kill it and restart it without rebuilding state.
Lines of Python
~3,200
Tests
134
Default behavior
Polite
Why build another crawler
Most general-purpose crawlers fall into two camps. One is the academic toolkit — flexible, but you write your own scheduling, retries, robots.txt parser, and checkpoint logic. The other is the SaaS scraper — quick to start, but you lose the ability to run it locally, version-control the spec, or audit what it actually fetched. We needed something in the middle: a single Python package you install, point at a target, and trust to behave correctly on someone else's site for hours at a time without supervision.
The specific job that triggered it was harvesting a few million pages of structured public data for analysis. Cloud scrapers wanted six figures a year. Writing our own from scratch would have been a week of yak-shaving before the first row landed. Yoink was the third option.
Polite by default
The defining design choice was making polite behavior the default path, not an opt-in flag. Yoink reads and honors robots.txt before it touches anything else; it respects crawl-delay; it rate-limits per host (not per process) so a job spread across workers can't accidentally hammer the same origin; and it advertises a real User-Agent that points to a contact address. None of these are novel — they're standard — but most crawlers leave them off until the developer remembers, and most developers don't.
The trade-off is that Yoink is slower than an aggressive scraper. On a target with a 10-second crawl-delay, a Yoink job is 10× slower than a tool that ignores the directive. We decided that was correct: the people running the sites we crawl are the ones writing the data we want indexed, and being a good citizen is the price of admission.
Resumable checkpoints
The other defining design choice was the checkpoint system. Yoink writes a structured checkpoint to S3 at every URL transition — request, response, parse result, errors — keyed by a deterministic crawl ID. If the process dies, the next run reads the checkpoint, skips everything already fetched, and resumes at the exact frontier where it left off. There is no separate database, no in-memory queue to lose, no "are we sure it finished" question at the end of an eight-hour run.
The implementation is unremarkable on purpose. Checkpoints are line-delimited JSON in an S3 prefix; resuming a crawl is a streamed read of that prefix. The boring choice survives operations failures that the clever choice doesn't — and crawling at scale is mostly operations failures.
JS rendering, but only when asked
About a third of the targets we needed required JavaScript to render the data — single-page apps, lazy-loaded sections, hydration-time content. Yoink can drive a headless browser when configured to, but never by default. The cost difference between fetch-and-parse and headless-render is roughly two orders of magnitude in both time and money; making it opt-in per target keeps the easy cases easy. Configuration lives in a typed YAML spec that's version-controlled alongside the job.
What we'd change
Two things, if we started over. First, the checkpoint format would be Parquet or a length-prefixed binary instead of JSONL — JSON parsing dominated wall-clock time on the resume path once the checkpoint grew past a few hundred thousand entries. Second, we'd build the contact-address / sitemap-aware discovery loop earlier; treating sitemap.xml as the primary URL source (rather than HTML-link discovery) cuts both the time-to-first-row and the host load by a meaningful factor on sites that publish one.
Have something hard to build?
Tell us what you're working on. We'll come back with scope, risks, and what we'd do first, within one business day.