AI-Powered RAG-App for Document Querying
The client was a mid-sized corporate legal team supporting roughly two hundred deals per year, where every deal involved reviewing fifty to two hundred pages of contracts, schedules, and supplements. Junior associates spent the majority of their billable hours locating specific clauses across long documents — change-of-control provisions, indemnification limits, assignment restrictions — and the partners reviewed extracted text rather than the underlying document, which created defensibility risk on close.
NLP & Knowledge Systems
LegalTech, Enterprise Knowledge Systems
11 weeks from kickoff to full team rollout
4 specialists
The full story
The practical problem was that PDF search produced page-level results when the team needed clause-level precision. Existing legal AI tools either licensed per-seat at an unworkable price for the team’s size, or returned answers without citation back to source, which the partners would not sign off on. The team had tried generic ChatGPT for clause extraction and dropped it within weeks because grounding was inconsistent and the privacy posture was unacceptable for client documents.
We built a secure document-querying platform that ran entirely in the client’s tenancy, used a retrieval architecture specifically tuned for legal document structure, and produced answers with clause-level citations linked back to the exact paragraph and page in the source PDF. The retrieval layer indexed at multiple granularities — document, section, clause — so the system could route a query to the right level instead of always returning paragraph chunks.
What shipped was a workspace where an associate uploads a contract bundle, asks a natural-language question, and gets an answer with every cited passage rendered inline with a one-click jump to the source PDF location. Partners review the answer alongside the cited passages in the same view, which restored the defensibility the team needed. Review cycle time on a typical deal dropped substantially and the team took on new matters without expanding headcount.
Junior associates spent most billable hours locating clauses, and partners reviewed extracted text without seeing the source.
Page-level PDF search returned ten pages of hits for a clause-level question, forcing manual reading at scale.
Per-seat legal AI tools were priced for large firms and were not economically defensible at a two-hundred-deal annual volume.
Generic LLM tools returned ungrounded answers, which partners would not approve for inclusion in deal materials.
Privacy posture on shared cloud tools made client documents off-limits, so junior associates could not use them at all.
Without inline citation, partners spent their review time re-finding clauses the system had already located, doubling work.
How we structured the engagement
Indexed at three granularities — document, section, clause — so retrieval matched the question’s actual scope.
- 01Phase 01Weeks 1-2
Discovery
Reviewed ten recently-closed deal files to taxonomize the questions junior associates actually asked. Audited PDF structures across five document families to plan section detection. Output: a question taxonomy, a section-detection spec, and a privacy requirement to keep all inference inside the client tenancy.
- 02Phase 02Weeks 3-4
Architecture
Designed a three-granularity retrieval system using pgvector for embeddings and BM25 for keyword fallback, with a router that picked granularity based on question type. Deployed entirely inside the client’s AWS tenancy with no data leaving the account. Picked an open-weights LLM for inference for the same reason.
- 03Phase 03Weeks 5-9
Build
Shipped section detection first because every downstream index depended on clean boundaries. Built the multi-granularity indexer and the routing layer. Implemented inline citation rendering with one-click jump to the PDF location, which required preserving page-and-coordinate metadata through the entire pipeline.
- 04Phase 04Weeks 10-11
Launch
Rolled out to two senior partners and four associates for three weeks of structured testing on closed deals, measured citation precision and partner trust. Tuned the granularity router based on misroutes observed in live use. Promoted to the full team after partner sign-off.
What we built, component by component
- 01
Document ingester
Parses PDFs with structure detection, preserves page-and-coordinate metadata, and emits document, section, and clause units.
- 02
Multi-granularity index
pgvector embeddings at three granularities plus BM25 keyword index, all scoped per matter for tenant isolation.
- 03
Retrieval router
Classifies each question by scope and picks the right granularity to search, with fallback when confidence is low.
- 04
In-tenancy LLM
Open-weights model deployed on a private endpoint inside the client’s AWS tenancy with no external data egress.
- 05
Citation renderer
Renders cited passages inline with the answer and provides one-click jump to the page and coordinate in the source PDF.
- 06
Audit log
Append-only log of every query, retrieval result, and answer with operator identity for defensibility on close.
A document bundle is ingested into the three-granularity index. A user query goes through the retrieval router which picks granularity and runs both vector and keyword retrieval. The retrieved passages are passed to the in-tenancy LLM with strict citation requirements, the citation renderer presents the answer with inline source links, and the audit log records the full transaction.
The trade-offs we made and why
Indexed at multiple granularities, not just paragraph chunks
A single chunk size either lost section context or buried clause-level detail. Indexing at document, section, and clause levels let the router match retrieval to question scope, which improved precision substantially over a single-granularity baseline.
Deployed entirely inside the client tenancy
Privacy posture was a precondition, not a preference. Running the LLM endpoint and the index inside the client’s own AWS account removed the external-data-exposure objection that had killed prior tool evaluations and made adoption viable.
Preserved page-and-coordinate metadata through ingestion
The one-click jump to the source PDF was the feature that earned partner trust. Preserving coordinate metadata through embedding and retrieval required careful ingestion design but was non-negotiable for the citation experience to work.
Picked open-weights over a frontier model API
API-served models would have required data egress and per-query billing that did not fit the use pattern. An open-weights model on a private endpoint cost less at this volume, kept data in the tenancy, and was sufficient for the retrieval-grounded answer quality.
What changed for the client
review time
Median time-to-clause-extraction per deal across a thirty-deal sample before and after the rollout to the legal team.
index granularities
Document, section, and clause levels indexed in parallel with a router that selects scope per question.
answers cited
Every answer ships with inline citations linking back to the exact page and paragraph in the source PDF.
data isolation
Full pipeline including LLM endpoint runs inside the client AWS account with no external data egress.
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
Multi-granularity indexing was the single largest precision win. We tried a single chunk size first because it was simpler, and watched recall and precision both suffer on clause-level questions. The three-granularity index closed the gap and we would design it that way from day one in future projects.
Inline citation is what bought partner trust, not answer quality on its own. The same answer without one-click citation would have failed the same partners who later approved the rollout. We would prioritize the citation experience equally with retrieval quality in scoping.
Open-weights inside the tenancy was a strategic call that paid off twice. The privacy story unlocked adoption and the per-query cost stayed flat as usage grew, which would not have been true on an API-served frontier model at this query volume.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
AI systems for document review, compliance, clause extraction, and legal operations.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
