Skip to main content
Generative AI Solutions — Case Study

AI Lead Generation Platform

The client was a fast-growing B2B sales platform serving mid-market revenue teams across North America. Their existing prospecting workflow leaned on a patchwork of standalone data vendors, each priced per-seat and exported as a CSV that an SDR cleaned manually before importing into HubSpot. As the user base grew past four hundred seats, the per-seat data spend was outpacing platform ARPU and customer support tickets about stale or duplicated contacts were becoming the top driver of churn.

+40%lead accuracy
-70%manual prospecting time
10+sources unified
60stime to CRM-ready set
AI Lead Generation Platform
Category

Generative AI Solutions

Industry

SalesTech / B2B SaaS

Timeline

11 weeks from kickoff to first production cohort

Team size

5 specialists

Project Overview

The full story

The specific user pain was operational. SDRs were spending the first ninety minutes of every shift reconciling exports — copying titles between Apollo and Lusha, hand-checking email validity through a third tool, then chasing duplicates that crept in from LinkedIn scrape jobs. The data was three to six weeks old by the time a sequence went out. Sales leaders had no way to attribute conversion lift to a specific data source, so they could not negotiate pricing with their vendors.

We designed a single ingestion plane that pulled from ten verified sources behind one normalized contact schema, with a streaming deduplication engine keyed on company domain plus a fuzzy match on name and title. An enrichment step ran a confidence score per field and only wrote back to the CRM when the score cleared a tenant-configurable threshold. We layered an attribution model on top so the platform could show which source produced the highest reply rate per segment.

What shipped was a unified prospecting workspace where an SDR pastes a target account list and gets a CRM-ready contact set in under sixty seconds, with provenance and confidence visible per row. HubSpot and Salesforce sync runs continuously rather than nightly. The vendor selection tool recommends which sources to retain based on actual outbound performance, giving sales ops a defensible position in vendor renewal conversations.

The Problem

Sales teams were paying for ten data tools and still importing stale, duplicated contacts into the CRM by hand each morning.

01Friction point

Ten separate data vendors with overlapping coverage, paid per seat, and no shared schema across the exports SDRs received daily.

02Friction point

Manual deduplication and email validation consumed roughly ninety minutes per rep per day before any outbound work began.

03Friction point

Contact records aged three to six weeks before a sequence reached the prospect, which suppressed reply rates and burned domains.

04Friction point

No attribution back to source meant sales ops could not negotiate vendor pricing or justify cutting the lowest-performing feeds.

05Friction point

CRM ingestion ran as a nightly cron, so updated titles and role changes did not surface until the following day at the earliest.

Our Approach

How we structured the engagement

We treated this as a data engineering problem first and a model problem second — clean inputs before clever inference.

  1. Phase 01Weeks 1-2

    Discovery

    Sat with four SDR teams for a full week, instrumented the existing CSV workflow, and measured time-to-clean per record. Audited the ten vendor APIs for rate limits, field coverage, and licensing terms. Output: a normalized contact schema and a ranked source-priority table per field.

  2. Phase 02Weeks 3-4

    Architecture

    Designed a streaming ingestion plane with one connector per vendor, a Kafka topic per source, and a single deduplication consumer keyed on domain plus fuzzy name match. Picked PostgreSQL with the citext extension as the system of record and Redis for the in-flight match cache.

  3. Phase 03Weeks 5-9

    Build

    Shipped the ten connectors in parallel pairs, then the deduplication engine, then the confidence scorer. Used a fine-tuned distilBERT for title normalization and a heuristic ensemble for email validity. Wrote a tenant-scoped sync layer for HubSpot and Salesforce with backoff and dead-letter handling.

  4. Phase 04Weeks 10-11

    Launch

    Rolled out to three design-partner accounts behind a feature flag, ran a four-week soak with daily accuracy audits, then promoted to general availability. Built the attribution dashboard during soak based on what design partners actually asked about during weekly review calls.

System Architecture

What we built, component by component

  1. 01

    Source connectors

    Ten vendor-specific clients with per-source rate limiting, retry, and schema translation into the normalized contact format.

  2. 02

    Ingestion stream

    Kafka topic per source plus a fan-in topic that the deduplication consumer reads, providing replay and audit history.

  3. 03

    Deduplication engine

    Keyed on domain and a fuzzy name-plus-title match, holds an in-flight cache in Redis with a five-minute window.

  4. 04

    Confidence scorer

    Per-field score combining source priority, recency, and cross-source agreement, written alongside every contact record.

  5. 05

    Contact store

    PostgreSQL with citext columns, partitioned by tenant, with row-level security and per-tenant retention policies.

  6. 06

    CRM sync layer

    Tenant-scoped HubSpot and Salesforce workers with exponential backoff, idempotency keys, and a dead-letter queue.

  7. 07

    Attribution service

    Joins outbound activity from the CRM back to the source that supplied each contact, exposed via a sales-ops dashboard.

Data Flow

A user request triggers parallel pulls across the ten source connectors, each writing to its own Kafka topic. The deduplication consumer merges into a single contact, the confidence scorer annotates each field, and the contact store accepts the row only when the per-field threshold is met. The CRM sync layer then pushes the record outbound and the attribution service waits for downstream reply or meeting events to close the loop.

Source connectors
Ingestion stream
Deduplication engine
Confidence scorer
Contact store
Key Decisions

The trade-offs we made and why

Decision 01Lead trade-off

Chose Kafka over a database trigger pipeline

Triggers would have coupled vendor latency to the write path and made replay during a vendor outage impossible. Kafka let us decouple ingestion from deduplication, replay a bad day cleanly, and add the eleventh vendor without touching existing consumers.

Decision 02

Used PostgreSQL with citext over Elasticsearch

Most queries were exact-match on domain or email, not full-text. Postgres gave us row-level security per tenant, transactional updates from the sync workers, and lower operational cost than a search cluster. We pushed fuzzy matching into a Redis-backed in-flight cache instead.

Decision 03

Fine-tuned distilBERT for title normalization rather than calling GPT

Title strings are short, repetitive, and the latency budget per record was under fifty milliseconds. A six-megabyte fine-tune ran on CPU at the ingest node and removed a per-call cost that would have made the per-record economics break at scale.

Decision 04

Wrote a custom CRM sync layer instead of using a third-party iPaaS

iPaaS pricing per record would have eroded gross margin past five hundred customers, and we needed tenant-scoped retry semantics for compliance. The custom layer is roughly twelve hundred lines of Python and is the most boring, most stable part of the system.

Outcomes

What changed for the client

lead accuracy

Measured as the percentage of contacts that resulted in a verified email open or reply within the first sequence step.

manual prospecting time

Average daily minutes spent on CSV reconciliation and validation across a sample of sixty SDRs before and after rollout.

sources unified

Apollo, Lusha, ZoomInfo, Cognism, LinkedIn Sales Navigator, Clearbit, Hunter, RocketReach, UpLead, and Snov, with a connector contract for the eleventh.

time to CRM-ready set

Median wall-clock time from pasting a target account list to a fully synced, scored contact set in the destination CRM.

In their words
We expected an integration project and got a sales-ops weapon. The attribution view paid for the build inside one renewal cycle.
Head of Revenue OperationsSeries B SalesTech platform
Tech Stack

The tools behind the system

Built with a deliberate stack chosen for production reliability and operational velocity.

8 componentsProduction-grade
PythonNode.jsReact.jsFastAPIAI/MLDockerPostgreSQLAWS
What we’d carry forward

Lessons learned from the build

01Lesson

Investing two full weeks in observed SDR workflow before writing any code paid back five times during build. The schema we shipped was visibly closer to how reps actually think about a contact than the one we would have written from vendor docs alone.

02Lesson

We underestimated how much value the attribution dashboard would carry. It was a stretch goal in the original scope and became the single feature that closed the renewal. Next time we would build it first and let the pipeline serve it, rather than the other way around.

03Lesson

Per-field confidence scoring was the right design but the wrong default. Tenants kept asking for visibility into why a record was rejected. We would expose the score and rejection reason in the UI from day one rather than treating it as an internal signal.

Related Services

Similar delivery work usually starts in these service areas

If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.

Industry Context

Where this project sits in the bigger market picture

Patterns for AI features, internal tooling, and product delivery in SaaS businesses.

Similar Project?

Build a result-driven AI product with a team that has shipped before

If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.

Start with clarity

Have an AI idea, messy workflow, or product vision? Let's make it buildable.

Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.

  • A practical first roadmap in the discovery call

  • Architecture, timeline, and delivery options in plain English

  • Security, scalability, and reliability discussed upfront

Model registry

softus-rag-v4.2

live

187ms

Latency

128k

Context

$0.004

Cost / req

Evaluation suite

Faithfulness94%
Answer relevance97%
Citation accuracy99%

Deploy pipeline

prod / canary 25% — healthy