Legal AI Tax Assistant
The client was a tax preparation software vendor serving roughly four hundred thousand individual filers and a long tail of small accounting practices that white-labeled the engine. The competitive landscape had shifted as larger players began bundling AI tax help, and the client’s filers were complaining about long research cycles for non-routine questions — multi-state filers, gig-economy income, and post-divorce custody scenarios were the three highest-volume support categories.
NLP & Knowledge Systems
LegalTech / FinTech
12 weeks from kickoff to full filer-base rollout
5 specialists
The full story
The practical problem was that authoritative tax answers lived across roughly sixty thousand pages of IRS publications, revenue rulings, and notices, plus a moving target of state-level guidance. Generic LLMs gave plausible but unsourced answers that filers could not defend if questioned. Hired tax professionals could give grounded answers but cost more per hour than the price of the product, which made human-routing every question economically broken.
We built a conversational assistant grounded in a curated corpus of authoritative tax sources, with every answer carrying citations back to the source paragraph and a confidence score. The system distinguished between general guidance and personalized scenarios — for personalized questions, it opened a private mode that could read the filer’s uploaded documents under a strict isolation boundary, with documents purged after the session unless explicitly saved.
What shipped was an in-product tax assistant that handled the routine multi-state, gig-income, and family-status questions with cited authoritative answers in under three seconds. For complex scenarios, the assistant prepared a structured summary for a human tax pro to review, which dropped per-question pro time by more than half. Filer-reported accuracy on filed returns improved, support ticket volume on tax-research questions dropped sharply, and the white-label accounting practices saw their per-return realization improve because their pros spent time on judgment rather than research.
Authoritative tax answers lived across sixty thousand pages, and generic LLMs gave plausible answers filers could not defend.
Multi-state, gig-income, and family-status questions were the three highest-volume support categories and grew quarterly.
Generic LLM output lacked citations, so filers and pros would not act on it without re-verifying against the source documents.
Tax professional time cost more per hour than the product price, breaking the unit economics of routing every question.
IRS guidance updates rolled in continuously, so any answer cached more than a few weeks ago risked being stale at filing time.
White-label accounting practices needed the same engine but with their own brand and pro-routing rules per practice.
How we structured the engagement
Curated the corpus carefully and made citation the contract — every answer ships with sources or the system declines to answer.
- 01Phase 01Weeks 1-2
Discovery
Mapped the existing support volume by question type, audited the IRS publication catalog and state guidance sources, and met with the in-house tax pros on what made an answer defensible. Output: a curated source list of roughly sixty thousand pages, a citation requirement spec, and a private-mode isolation requirement.
- 02Phase 02Weeks 3-4
Architecture
Designed a retrieval system using Pinecone for the public corpus and a separate per-session ephemeral index for private-mode documents. Built a chain that required at least one citation per claim or the system would refuse to answer. Picked LangChain for orchestration because the chain steps were stable and needed explicit logging.
- 03Phase 03Weeks 5-10
Build
Shipped the public-corpus retrieval and citation-gated answering first. Built the private mode with session-scoped Pinecone namespaces and aggressive purge on session end. Implemented the white-label routing rules per practice and a structured-summary export for human pro hand-off on complex cases.
- 04Phase 04Weeks 11-12
Launch
Rolled out to a fifty-thousand-filer cohort during filing season for four weeks, monitored citation precision and pro-handoff quality daily, and tuned the refusal threshold until plausible-but-unsourced answers fell below one percent. Promoted to the full filer base after the season’s quality bar held.
What we built, component by component
- 01
Source curator
Maintains the authoritative source list with version tracking, ingests updates weekly, and flags potentially stale answers.
- 02
Public corpus index
Pinecone vector store over the sixty-thousand-page IRS and state corpus with per-paragraph embeddings and citation metadata.
- 03
Citation-gated chain
LangChain orchestration that requires at least one citation per claim or refuses to answer, with claim-level logging.
- 04
Private mode
Per-session ephemeral Pinecone namespace for uploaded documents, purged on session end unless explicitly saved by the user.
- 05
Pro hand-off
Structured summary export for human tax pros covering question scope, retrieved sources, and the assistant’s reasoning.
- 06
White-label router
Per-practice branding and routing rules, including which questions go to the practice’s own pros versus the platform pool.
A filer submits a question and the public-corpus index produces candidate passages. The citation-gated chain composes an answer that must include at least one citation, or refuses. If the user enables private mode, their documents land in a session-scoped namespace that the retriever can also see. Complex scenarios export a structured summary to the pro hand-off surface.
The trade-offs we made and why
Required citation per claim or refused to answer
Plausible-but-unsourced answers were the failure mode generic LLMs already produced. Hard-gating the chain on citation per claim meant the system either gave defensible answers or said it did not know, which was the trust posture filers and pros actually needed.
Built private mode with session-scoped ephemeral indexes
Filers needed personalized answers but did not trust uploads to a persistent store. Session-scoped Pinecone namespaces with hard purge gave the personalization without the storage commitment, which made the privacy story defensible in marketing and legal review.
Used Pinecone over self-hosted pgvector
The query latency budget under three seconds at large corpus size pushed us to a managed vector store with proven sub-second retrieval at scale. The trade-off in operational ownership was worth the latency guarantee on a customer-facing interactive product.
Shipped pro hand-off as a structured summary, not a transcript
Raw transcripts buried the question for pros. A structured summary with scope, retrieved sources, and the assistant’s chain of reasoning let a pro pick up the case in under a minute, which is what made the hybrid model economically work for white-label practices.
What changed for the client
research time
Median time from question to defensible answer with sources across a sample of three hundred filer interactions.
time to cited answer
P95 latency from question submission to fully cited answer in the in-product assistant during filing-season load.
pages curated
Authoritative source pages indexed with paragraph-level granularity, plus weekly updates as IRS guidance changes.
unsourced answers
Share of generated answers that lacked at least one citation after refusal-threshold tuning, used as a quality cutover gate.
“The assistant either cites the source or refuses, which is exactly the bar our pros set for their own answers. The structured hand-off changed how our practices price returns.”
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
The refusal posture was the trust posture. Forcing the chain to refuse rather than guess produced fewer answers but every answer was defensible, and that asymmetry built filer confidence faster than a higher answer rate with occasional misses would have.
Session-scoped private indexes solved the privacy story cleanly. We almost shipped a persistent document store with delete-on-request semantics, and the session-scoped approach turned out to be both simpler operationally and more reassuring to filers in user testing.
Pro hand-off as a structured summary was the white-label unlock. Without it, the human-in-the-loop economics did not work for the smaller accounting practices. We would design the hand-off surface alongside the retrieval system next time, not as a follow-on phase.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
How we approach AI delivery for payments, banking, underwriting, and financial workflows.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
