// projects
RAG Support Assistant
Answer product and documentation questions from your own knowledge base — with citations, not hallucinations.
A retrieval-augmented question-answering service that grounds an LLM in your documentation and returns answers with linked sources. It uses hybrid retrieval plus reranking so the model only reasons over the most relevant chunks, and it enforces a per-request cost cap so a single query cannot run away with your budget. This is the system most companies actually want first, which makes it the highest-leverage project in a portfolio.
Teams sit on gigabytes of docs, runbooks, and support tickets, but the knowledge is trapped behind search that only matches keywords. A raw LLM will answer confidently from its training data and invent details about your product. The real problem is grounding: getting a model to answer only from your content, admit when it does not know, and show its sources so a human can verify — all fast and cheap enough to sit in front of real users.
- 01
Ingestion & chunking
A pipeline pulls documents from sources (Markdown, HTML, PDF), normalizes them, and splits on structure — headings and sections — instead of fixed character counts, keeping metadata like title, URL, and section anchor.
- 02
Embedding & indexing
Chunks are embedded and stored in Postgres with pgvector alongside a full-text tsvector column, so one row supports both vector similarity and keyword search.
- 03
Hybrid retrieval
A query runs dense vector search and BM25 full-text search in parallel, then fuses the two ranked lists with reciprocal rank fusion to catch both semantic and exact-keyword matches.
- 04
Reranking
A cross-encoder reranker scores the top ~30 fused candidates and keeps only the best 4-6, which sharply cuts the noise the model has to reason over.
- 05
Grounded generation
The prompt is assembled from the surviving chunks with instructions to answer only from context and cite chunk IDs; the model returns an answer plus the sources it used.
- 06
Cost & cache layer
Redis caches embeddings and recent answers, and a per-request token budget short-circuits generation before it can exceed a configured cost ceiling.
- 07
API & citations
A FastAPI endpoint returns the answer, the resolved source links, and a confidence signal so the caller can render citations or fall back to a safe 'I do not know' response.
pgvector instead of a dedicated vector database
If you already run Postgres, one datastore means one backup story, one connection pool, and native SQL filtering on metadata. You trade some raw ANN throughput for radically less operational surface — the right call until scale forces a dedicated store.
Hybrid retrieval + reranking instead of vector search alone
Pure vector search misses exact identifiers like error codes and API names, and returns near-duplicates. Adding BM25 and a reranker costs latency and a second model call, but it is the single biggest lever on answer quality.
Structure-aware chunking over fixed-size windows
Splitting on headings keeps ideas intact and preserves the metadata needed for citations. It is more work than a character splitter, but fixed windows routinely cut sentences in half and wreck retrieval quality.
A hard per-request cost cap
Grounding on retrieved context bounds token usage predictably. Enforcing an explicit ceiling turns 'the LLM bill is scary' into a number you can defend — a detail interviewers notice.
- A labeled question/answer test set with the ground-truth source for each question, versioned in the repo.
- Retrieval metrics measured separately from generation (recall@k and MRR) so you can tell a bad answer caused by bad retrieval from one caused by the model.
- Faithfulness scoring that flags answers containing claims not supported by the retrieved chunks.
- A regression gate in CI that fails the build if retrieval recall or faithfulness drops below threshold after a prompt, chunking, or model change.
- Dockerized API with a separate ingestion worker; Postgres + pgvector and Redis run as managed services or containers.
- Secrets (provider key, database URL) are injected from the platform secret store — never baked into the image.
- CI runs the eval suite and a smoke test against a seeded index before a deploy is promoted.
- Request tracing plus token and cost metrics are exported so you can watch p95 latency and spend per query in production.
- 01Why you separated retrieval evals from answer evals, and how that let you debug wrong answers as retrieval bugs.
- 02How reciprocal rank fusion and a reranker changed answer quality versus naive top-k vector search.
- 03How the per-request cost cap and caching keep spend predictable under real traffic.
- 04When you would graduate from pgvector to a dedicated vector store, and what signal would trigger that.
Video walkthrough
Watch it built, end to end
A full video walkthrough — architecture, trade-offs, evals, and deployment — ships with the AI Engineer Interview & Portfolio Kit at launch (August 2026). There is no fake demo here: join the waitlist and you will get it the day it lands.
Related reading
Production AI Notes
One practical AI engineering email each week
One concept, one architecture, one project idea, and one interview question — written for developers who want to build and ship real AI systems.