Skip to main content
NLP & Knowledge Systems — Case Study

Legal AI Tax Assistant

The client was a tax preparation software vendor serving roughly four hundred thousand individual filers and a long tail of small accounting practices that white-labeled the engine. The competitive landscape had shifted as larger players began bundling AI tax help, and the client’s filers were complaining about long research cycles for non-routine questions — multi-state filers, gig-economy income, and post-divorce custody scenarios were the three highest-volume support categories.

-90%research time
<3stime to cited answer
60k+pages curated
<1%unsourced answers
Legal AI Tax Assistant
Category

NLP & Knowledge Systems

Industry

LegalTech / FinTech

Timeline

12 weeks from kickoff to full filer-base rollout

Team size

5 specialists

Project Overview

The full story

The practical problem was that authoritative tax answers lived across roughly sixty thousand pages of IRS publications, revenue rulings, and notices, plus a moving target of state-level guidance. Generic LLMs gave plausible but unsourced answers that filers could not defend if questioned. Hired tax professionals could give grounded answers but cost more per hour than the price of the product, which made human-routing every question economically broken.

We built a conversational assistant grounded in a curated corpus of authoritative tax sources, with every answer carrying citations back to the source paragraph and a confidence score. The system distinguished between general guidance and personalized scenarios — for personalized questions, it opened a private mode that could read the filer’s uploaded documents under a strict isolation boundary, with documents purged after the session unless explicitly saved.

What shipped was an in-product tax assistant that handled the routine multi-state, gig-income, and family-status questions with cited authoritative answers in under three seconds. For complex scenarios, the assistant prepared a structured summary for a human tax pro to review, which dropped per-question pro time by more than half. Filer-reported accuracy on filed returns improved, support ticket volume on tax-research questions dropped sharply, and the white-label accounting practices saw their per-return realization improve because their pros spent time on judgment rather than research.

The Problem

Authoritative tax answers lived across sixty thousand pages, and generic LLMs gave plausible answers filers could not defend.

01Friction point

Multi-state, gig-income, and family-status questions were the three highest-volume support categories and grew quarterly.

02Friction point

Generic LLM output lacked citations, so filers and pros would not act on it without re-verifying against the source documents.

03Friction point

Tax professional time cost more per hour than the product price, breaking the unit economics of routing every question.

04Friction point

IRS guidance updates rolled in continuously, so any answer cached more than a few weeks ago risked being stale at filing time.

05Friction point

White-label accounting practices needed the same engine but with their own brand and pro-routing rules per practice.

Our Approach

How we structured the engagement

Curated the corpus carefully and made citation the contract — every answer ships with sources or the system declines to answer.

  1. Phase 01Weeks 1-2

    Discovery

    Mapped the existing support volume by question type, audited the IRS publication catalog and state guidance sources, and met with the in-house tax pros on what made an answer defensible. Output: a curated source list of roughly sixty thousand pages, a citation requirement spec, and a private-mode isolation requirement.

  2. Phase 02Weeks 3-4

    Architecture

    Designed a retrieval system using Pinecone for the public corpus and a separate per-session ephemeral index for private-mode documents. Built a chain that required at least one citation per claim or the system would refuse to answer. Picked LangChain for orchestration because the chain steps were stable and needed explicit logging.

  3. Phase 03Weeks 5-10

    Build

    Shipped the public-corpus retrieval and citation-gated answering first. Built the private mode with session-scoped Pinecone namespaces and aggressive purge on session end. Implemented the white-label routing rules per practice and a structured-summary export for human pro hand-off on complex cases.

  4. Phase 04Weeks 11-12

    Launch

    Rolled out to a fifty-thousand-filer cohort during filing season for four weeks, monitored citation precision and pro-handoff quality daily, and tuned the refusal threshold until plausible-but-unsourced answers fell below one percent. Promoted to the full filer base after the season’s quality bar held.

System Architecture

What we built, component by component

  1. 01

    Source curator

    Maintains the authoritative source list with version tracking, ingests updates weekly, and flags potentially stale answers.

  2. 02

    Public corpus index

    Pinecone vector store over the sixty-thousand-page IRS and state corpus with per-paragraph embeddings and citation metadata.

  3. 03

    Citation-gated chain

    LangChain orchestration that requires at least one citation per claim or refuses to answer, with claim-level logging.

  4. 04

    Private mode

    Per-session ephemeral Pinecone namespace for uploaded documents, purged on session end unless explicitly saved by the user.

  5. 05

    Pro hand-off

    Structured summary export for human tax pros covering question scope, retrieved sources, and the assistant’s reasoning.

  6. 06

    White-label router

    Per-practice branding and routing rules, including which questions go to the practice’s own pros versus the platform pool.

Data Flow

A filer submits a question and the public-corpus index produces candidate passages. The citation-gated chain composes an answer that must include at least one citation, or refuses. If the user enables private mode, their documents land in a session-scoped namespace that the retriever can also see. Complex scenarios export a structured summary to the pro hand-off surface.

Source curator
Public corpus index
Citation-gated chain
Private mode
Pro hand-off
Key Decisions

The trade-offs we made and why

Decision 01Lead trade-off

Required citation per claim or refused to answer

Plausible-but-unsourced answers were the failure mode generic LLMs already produced. Hard-gating the chain on citation per claim meant the system either gave defensible answers or said it did not know, which was the trust posture filers and pros actually needed.

Decision 02

Built private mode with session-scoped ephemeral indexes

Filers needed personalized answers but did not trust uploads to a persistent store. Session-scoped Pinecone namespaces with hard purge gave the personalization without the storage commitment, which made the privacy story defensible in marketing and legal review.

Decision 03

Used Pinecone over self-hosted pgvector

The query latency budget under three seconds at large corpus size pushed us to a managed vector store with proven sub-second retrieval at scale. The trade-off in operational ownership was worth the latency guarantee on a customer-facing interactive product.

Decision 04

Shipped pro hand-off as a structured summary, not a transcript

Raw transcripts buried the question for pros. A structured summary with scope, retrieved sources, and the assistant’s chain of reasoning let a pro pick up the case in under a minute, which is what made the hybrid model economically work for white-label practices.

Outcomes

What changed for the client

research time

Median time from question to defensible answer with sources across a sample of three hundred filer interactions.

time to cited answer

P95 latency from question submission to fully cited answer in the in-product assistant during filing-season load.

pages curated

Authoritative source pages indexed with paragraph-level granularity, plus weekly updates as IRS guidance changes.

unsourced answers

Share of generated answers that lacked at least one citation after refusal-threshold tuning, used as a quality cutover gate.

In their words
The assistant either cites the source or refuses, which is exactly the bar our pros set for their own answers. The structured hand-off changed how our practices price returns.
Head of ProductTax preparation software vendor
Tech Stack

The tools behind the system

Built with a deliberate stack chosen for production reliability and operational velocity.

5 componentsProduction-grade
PythonFastAPILangChainPineconeReact.js
What we’d carry forward

Lessons learned from the build

01Lesson

The refusal posture was the trust posture. Forcing the chain to refuse rather than guess produced fewer answers but every answer was defensible, and that asymmetry built filer confidence faster than a higher answer rate with occasional misses would have.

02Lesson

Session-scoped private indexes solved the privacy story cleanly. We almost shipped a persistent document store with delete-on-request semantics, and the session-scoped approach turned out to be both simpler operationally and more reassuring to filers in user testing.

03Lesson

Pro hand-off as a structured summary was the white-label unlock. Without it, the human-in-the-loop economics did not work for the smaller accounting practices. We would design the hand-off surface alongside the retrieval system next time, not as a follow-on phase.

Related Services

Similar delivery work usually starts in these service areas

If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.

Industry Context

Where this project sits in the bigger market picture

How we approach AI delivery for payments, banking, underwriting, and financial workflows.

Similar Project?

Build a result-driven AI product with a team that has shipped before

If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.

Start with clarity

Have an AI idea, messy workflow, or product vision? Let's make it buildable.

Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.

  • A practical first roadmap in the discovery call

  • Architecture, timeline, and delivery options in plain English

  • Security, scalability, and reliability discussed upfront

Model registry

softus-rag-v4.2

live

187ms

Latency

128k

Context

$0.004

Cost / req

Evaluation suite

Faithfulness94%
Answer relevance97%
Citation accuracy99%

Deploy pipeline

prod / canary 25% — healthy