Skip to main content
NLP & Knowledge Systems — Case Study

AI-Powered RAG-App for Document Querying

The client was a mid-sized corporate legal team supporting roughly two hundred deals per year, where every deal involved reviewing fifty to two hundred pages of contracts, schedules, and supplements. Junior associates spent the majority of their billable hours locating specific clauses across long documents — change-of-control provisions, indemnification limits, assignment restrictions — and the partners reviewed extracted text rather than the underlying document, which created defensibility risk on close.

-75%review time
3index granularities
100%answers cited
in-tenancydata isolation
AI-Powered RAG-App for Document Querying
Category

NLP & Knowledge Systems

Industry

LegalTech, Enterprise Knowledge Systems

Timeline

11 weeks from kickoff to full team rollout

Team size

4 specialists

Project Overview

The full story

The practical problem was that PDF search produced page-level results when the team needed clause-level precision. Existing legal AI tools either licensed per-seat at an unworkable price for the team’s size, or returned answers without citation back to source, which the partners would not sign off on. The team had tried generic ChatGPT for clause extraction and dropped it within weeks because grounding was inconsistent and the privacy posture was unacceptable for client documents.

We built a secure document-querying platform that ran entirely in the client’s tenancy, used a retrieval architecture specifically tuned for legal document structure, and produced answers with clause-level citations linked back to the exact paragraph and page in the source PDF. The retrieval layer indexed at multiple granularities — document, section, clause — so the system could route a query to the right level instead of always returning paragraph chunks.

What shipped was a workspace where an associate uploads a contract bundle, asks a natural-language question, and gets an answer with every cited passage rendered inline with a one-click jump to the source PDF location. Partners review the answer alongside the cited passages in the same view, which restored the defensibility the team needed. Review cycle time on a typical deal dropped substantially and the team took on new matters without expanding headcount.

The Problem

Junior associates spent most billable hours locating clauses, and partners reviewed extracted text without seeing the source.

01Friction point

Page-level PDF search returned ten pages of hits for a clause-level question, forcing manual reading at scale.

02Friction point

Per-seat legal AI tools were priced for large firms and were not economically defensible at a two-hundred-deal annual volume.

03Friction point

Generic LLM tools returned ungrounded answers, which partners would not approve for inclusion in deal materials.

04Friction point

Privacy posture on shared cloud tools made client documents off-limits, so junior associates could not use them at all.

05Friction point

Without inline citation, partners spent their review time re-finding clauses the system had already located, doubling work.

Our Approach

How we structured the engagement

Indexed at three granularities — document, section, clause — so retrieval matched the question’s actual scope.

  1. Phase 01Weeks 1-2

    Discovery

    Reviewed ten recently-closed deal files to taxonomize the questions junior associates actually asked. Audited PDF structures across five document families to plan section detection. Output: a question taxonomy, a section-detection spec, and a privacy requirement to keep all inference inside the client tenancy.

  2. Phase 02Weeks 3-4

    Architecture

    Designed a three-granularity retrieval system using pgvector for embeddings and BM25 for keyword fallback, with a router that picked granularity based on question type. Deployed entirely inside the client’s AWS tenancy with no data leaving the account. Picked an open-weights LLM for inference for the same reason.

  3. Phase 03Weeks 5-9

    Build

    Shipped section detection first because every downstream index depended on clean boundaries. Built the multi-granularity indexer and the routing layer. Implemented inline citation rendering with one-click jump to the PDF location, which required preserving page-and-coordinate metadata through the entire pipeline.

  4. Phase 04Weeks 10-11

    Launch

    Rolled out to two senior partners and four associates for three weeks of structured testing on closed deals, measured citation precision and partner trust. Tuned the granularity router based on misroutes observed in live use. Promoted to the full team after partner sign-off.

System Architecture

What we built, component by component

  1. 01

    Document ingester

    Parses PDFs with structure detection, preserves page-and-coordinate metadata, and emits document, section, and clause units.

  2. 02

    Multi-granularity index

    pgvector embeddings at three granularities plus BM25 keyword index, all scoped per matter for tenant isolation.

  3. 03

    Retrieval router

    Classifies each question by scope and picks the right granularity to search, with fallback when confidence is low.

  4. 04

    In-tenancy LLM

    Open-weights model deployed on a private endpoint inside the client’s AWS tenancy with no external data egress.

  5. 05

    Citation renderer

    Renders cited passages inline with the answer and provides one-click jump to the page and coordinate in the source PDF.

  6. 06

    Audit log

    Append-only log of every query, retrieval result, and answer with operator identity for defensibility on close.

Data Flow

A document bundle is ingested into the three-granularity index. A user query goes through the retrieval router which picks granularity and runs both vector and keyword retrieval. The retrieved passages are passed to the in-tenancy LLM with strict citation requirements, the citation renderer presents the answer with inline source links, and the audit log records the full transaction.

Document ingester
Multi-granularity index
Retrieval router
In-tenancy LLM
Citation renderer
Key Decisions

The trade-offs we made and why

Decision 01Lead trade-off

Indexed at multiple granularities, not just paragraph chunks

A single chunk size either lost section context or buried clause-level detail. Indexing at document, section, and clause levels let the router match retrieval to question scope, which improved precision substantially over a single-granularity baseline.

Decision 02

Deployed entirely inside the client tenancy

Privacy posture was a precondition, not a preference. Running the LLM endpoint and the index inside the client’s own AWS account removed the external-data-exposure objection that had killed prior tool evaluations and made adoption viable.

Decision 03

Preserved page-and-coordinate metadata through ingestion

The one-click jump to the source PDF was the feature that earned partner trust. Preserving coordinate metadata through embedding and retrieval required careful ingestion design but was non-negotiable for the citation experience to work.

Decision 04

Picked open-weights over a frontier model API

API-served models would have required data egress and per-query billing that did not fit the use pattern. An open-weights model on a private endpoint cost less at this volume, kept data in the tenancy, and was sufficient for the retrieval-grounded answer quality.

Outcomes

What changed for the client

review time

Median time-to-clause-extraction per deal across a thirty-deal sample before and after the rollout to the legal team.

index granularities

Document, section, and clause levels indexed in parallel with a router that selects scope per question.

answers cited

Every answer ships with inline citations linking back to the exact page and paragraph in the source PDF.

in-tenancy

data isolation

Full pipeline including LLM endpoint runs inside the client AWS account with no external data egress.

Tech Stack

The tools behind the system

Built with a deliberate stack chosen for production reliability and operational velocity.

5 componentsProduction-grade
PythonFlaskLLMsReact.jsDocker
What we’d carry forward

Lessons learned from the build

01Lesson

Multi-granularity indexing was the single largest precision win. We tried a single chunk size first because it was simpler, and watched recall and precision both suffer on clause-level questions. The three-granularity index closed the gap and we would design it that way from day one in future projects.

02Lesson

Inline citation is what bought partner trust, not answer quality on its own. The same answer without one-click citation would have failed the same partners who later approved the rollout. We would prioritize the citation experience equally with retrieval quality in scoping.

03Lesson

Open-weights inside the tenancy was a strategic call that paid off twice. The privacy story unlocked adoption and the per-query cost stayed flat as usage grew, which would not have been true on an API-served frontier model at this query volume.

Related Services

Similar delivery work usually starts in these service areas

If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.

Industry Context

Where this project sits in the bigger market picture

AI systems for document review, compliance, clause extraction, and legal operations.

Similar Project?

Build a result-driven AI product with a team that has shipped before

If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.

Start with clarity

Have an AI idea, messy workflow, or product vision? Let's make it buildable.

Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.

  • A practical first roadmap in the discovery call

  • Architecture, timeline, and delivery options in plain English

  • Security, scalability, and reliability discussed upfront

Model registry

softus-rag-v4.2

live

187ms

Latency

128k

Context

$0.004

Cost / req

Evaluation suite

Faithfulness94%
Answer relevance97%
Citation accuracy99%

Deploy pipeline

prod / canary 25% — healthy