← WORK
FORGE — hero
Open Source Data Pipeline2025

FORGE

The Gap
Cold outreach only works if the data is clean. Most small-business datasets are stale, duplicated, missing websites, or scraped from a single source that was wrong to begin with. Developers building outreach tools waste weeks on enrichment before they can touch the actual product.
Challenges
Enriching 16 million small businesses across every state means pulling from dozens of sources, resolving conflicts, and keeping it current. Doing it in a way a solo developer can run on their own machine, without burning cloud credits, meant the pipeline had to work at two different scales.
What We Built
A data enrichment pipeline connected to government registries, Google Maps, Google Places, and a dozen other sources. 16.5 million small businesses, every state in the country, enriched down to the website. Two flavors. One runs against a local LLM on your own hardware so you're not paying per token. The other runs against Claude and the OpenAI API for developers who'd rather use frontier models than manage prompting themselves. Released open source on the Nuclear Marmalade repo. Same pipeline Nuclear Directories runs on.
Inside the build
01

Email Extraction in 6 Layers

Mailto links, regex passes, Cloudflare decoding, JSON-LD parsing, obfuscation handling, and contact-page crawl. Every layer fails gracefully into the next. Whatever's reachable from the surface, Forge finds it.

02

SMTP Verification, No Bounces

Generates candidate addresses (info@, contact@, admin@) and verifies them via SMTP RCPT TO without ever sending mail. Clean validation, no spam flags, no inbox damage.

03

30+ Tech Stack Detection

Identifies WordPress, Shopify, React, Next.js, Stripe, Intercom, Google Analytics, Tailwind, and more on every target site. Lead scoring without manual tagging — the platform tells you what the prospect runs.

04

Local AI Enrichment

Ollama-backed inference generates business summaries, industry classification, health scores, and pain points. No per-token cost. No API key. No data leaving the box.

FORGE — detail 1FORGE — detail 2
Beyond the API

Replaces a five-figure subscription line item.

  • Government Data Importers

    Built-in connectors to FCC ULS (3M telecom licenses), NPI Registry (7M healthcare providers), and SAM.gov (900K federal contractors). Nightly batch ingest, dedup, and match — included, not bolted on.

  • Quality and Safety Layer

    A Haiku audit agent spot-checks enrichments. Field validators catch garbage. COALESCE writes never overwrite existing data. Every batch is rollback-able. Build mistakes don't become data debt.

  • Operational Reliability

    Checkpoint-based resume after crashes, hourly watchdog process, and graceful degradation when individual extraction layers fail. The pipeline keeps running while you sleep.

  • Native MCP Integration

    First-class Model Context Protocol server. Claude (or any MCP client) can call discover, enrich, search, export, and stats tools directly. The dataset becomes an agent's working memory.

MORE
PROJECTS