AI Lead Generation Platform
The client was a fast-growing B2B sales platform serving mid-market revenue teams across North America. Their existing prospecting workflow leaned on a patchwork of standalone data vendors, each priced per-seat and exported as a CSV that an SDR cleaned manually before importing into HubSpot. As the user base grew past four hundred seats, the per-seat data spend was outpacing platform ARPU and customer support tickets about stale or duplicated contacts were becoming the top driver of churn.
Generative AI Solutions
SalesTech / B2B SaaS
11 weeks from kickoff to first production cohort
5 specialists
The full story
The specific user pain was operational. SDRs were spending the first ninety minutes of every shift reconciling exports — copying titles between Apollo and Lusha, hand-checking email validity through a third tool, then chasing duplicates that crept in from LinkedIn scrape jobs. The data was three to six weeks old by the time a sequence went out. Sales leaders had no way to attribute conversion lift to a specific data source, so they could not negotiate pricing with their vendors.
We designed a single ingestion plane that pulled from ten verified sources behind one normalized contact schema, with a streaming deduplication engine keyed on company domain plus a fuzzy match on name and title. An enrichment step ran a confidence score per field and only wrote back to the CRM when the score cleared a tenant-configurable threshold. We layered an attribution model on top so the platform could show which source produced the highest reply rate per segment.
What shipped was a unified prospecting workspace where an SDR pastes a target account list and gets a CRM-ready contact set in under sixty seconds, with provenance and confidence visible per row. HubSpot and Salesforce sync runs continuously rather than nightly. The vendor selection tool recommends which sources to retain based on actual outbound performance, giving sales ops a defensible position in vendor renewal conversations.
Sales teams were paying for ten data tools and still importing stale, duplicated contacts into the CRM by hand each morning.
Ten separate data vendors with overlapping coverage, paid per seat, and no shared schema across the exports SDRs received daily.
Manual deduplication and email validation consumed roughly ninety minutes per rep per day before any outbound work began.
Contact records aged three to six weeks before a sequence reached the prospect, which suppressed reply rates and burned domains.
No attribution back to source meant sales ops could not negotiate vendor pricing or justify cutting the lowest-performing feeds.
CRM ingestion ran as a nightly cron, so updated titles and role changes did not surface until the following day at the earliest.
How we structured the engagement
We treated this as a data engineering problem first and a model problem second — clean inputs before clever inference.
- 01Phase 01Weeks 1-2
Discovery
Sat with four SDR teams for a full week, instrumented the existing CSV workflow, and measured time-to-clean per record. Audited the ten vendor APIs for rate limits, field coverage, and licensing terms. Output: a normalized contact schema and a ranked source-priority table per field.
- 02Phase 02Weeks 3-4
Architecture
Designed a streaming ingestion plane with one connector per vendor, a Kafka topic per source, and a single deduplication consumer keyed on domain plus fuzzy name match. Picked PostgreSQL with the citext extension as the system of record and Redis for the in-flight match cache.
- 03Phase 03Weeks 5-9
Build
Shipped the ten connectors in parallel pairs, then the deduplication engine, then the confidence scorer. Used a fine-tuned distilBERT for title normalization and a heuristic ensemble for email validity. Wrote a tenant-scoped sync layer for HubSpot and Salesforce with backoff and dead-letter handling.
- 04Phase 04Weeks 10-11
Launch
Rolled out to three design-partner accounts behind a feature flag, ran a four-week soak with daily accuracy audits, then promoted to general availability. Built the attribution dashboard during soak based on what design partners actually asked about during weekly review calls.
What we built, component by component
- 01
Source connectors
Ten vendor-specific clients with per-source rate limiting, retry, and schema translation into the normalized contact format.
- 02
Ingestion stream
Kafka topic per source plus a fan-in topic that the deduplication consumer reads, providing replay and audit history.
- 03
Deduplication engine
Keyed on domain and a fuzzy name-plus-title match, holds an in-flight cache in Redis with a five-minute window.
- 04
Confidence scorer
Per-field score combining source priority, recency, and cross-source agreement, written alongside every contact record.
- 05
Contact store
PostgreSQL with citext columns, partitioned by tenant, with row-level security and per-tenant retention policies.
- 06
CRM sync layer
Tenant-scoped HubSpot and Salesforce workers with exponential backoff, idempotency keys, and a dead-letter queue.
- 07
Attribution service
Joins outbound activity from the CRM back to the source that supplied each contact, exposed via a sales-ops dashboard.
A user request triggers parallel pulls across the ten source connectors, each writing to its own Kafka topic. The deduplication consumer merges into a single contact, the confidence scorer annotates each field, and the contact store accepts the row only when the per-field threshold is met. The CRM sync layer then pushes the record outbound and the attribution service waits for downstream reply or meeting events to close the loop.
The trade-offs we made and why
Chose Kafka over a database trigger pipeline
Triggers would have coupled vendor latency to the write path and made replay during a vendor outage impossible. Kafka let us decouple ingestion from deduplication, replay a bad day cleanly, and add the eleventh vendor without touching existing consumers.
Used PostgreSQL with citext over Elasticsearch
Most queries were exact-match on domain or email, not full-text. Postgres gave us row-level security per tenant, transactional updates from the sync workers, and lower operational cost than a search cluster. We pushed fuzzy matching into a Redis-backed in-flight cache instead.
Fine-tuned distilBERT for title normalization rather than calling GPT
Title strings are short, repetitive, and the latency budget per record was under fifty milliseconds. A six-megabyte fine-tune ran on CPU at the ingest node and removed a per-call cost that would have made the per-record economics break at scale.
Wrote a custom CRM sync layer instead of using a third-party iPaaS
iPaaS pricing per record would have eroded gross margin past five hundred customers, and we needed tenant-scoped retry semantics for compliance. The custom layer is roughly twelve hundred lines of Python and is the most boring, most stable part of the system.
What changed for the client
lead accuracy
Measured as the percentage of contacts that resulted in a verified email open or reply within the first sequence step.
manual prospecting time
Average daily minutes spent on CSV reconciliation and validation across a sample of sixty SDRs before and after rollout.
sources unified
Apollo, Lusha, ZoomInfo, Cognism, LinkedIn Sales Navigator, Clearbit, Hunter, RocketReach, UpLead, and Snov, with a connector contract for the eleventh.
time to CRM-ready set
Median wall-clock time from pasting a target account list to a fully synced, scored contact set in the destination CRM.
“We expected an integration project and got a sales-ops weapon. The attribution view paid for the build inside one renewal cycle.”
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
Investing two full weeks in observed SDR workflow before writing any code paid back five times during build. The schema we shipped was visibly closer to how reps actually think about a contact than the one we would have written from vendor docs alone.
We underestimated how much value the attribution dashboard would carry. It was a stretch goal in the original scope and became the single feature that closed the renewal. Next time we would build it first and let the pipeline serve it, rather than the other way around.
Per-field confidence scoring was the right design but the wrong default. Tenants kept asking for visibility into why a record was rejected. We would expose the score and rejection reason in the UI from day one rather than treating it as an internal signal.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
Patterns for AI features, internal tooling, and product delivery in SaaS businesses.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
