Glossary

AI engineering glossary

The core vocabulary of production AI — defined in plain English for developers. 20 terms and counting.

Large Language Model (LLM): A large language model is a transformer-based neural network trained on huge text corpora to predict the next token. In AI engineering you use LLMs as components behind an API rather than training them yourself. How to become an AI engineer→
Token: A token is the chunk of text a model processes — often a word piece. Pricing, context limits, and latency are all measured in tokens, so token budgeting is a core production concern.
Context window: The context window is the total tokens (prompt + response) a model can handle in a single call. It bounds how much retrieved context, history, and instructions you can include, which is why retrieval and summarization matter.
Embedding: An embedding maps text (or images) to a high-dimensional vector so that semantically similar items sit close together. Embeddings power semantic search and are the backbone of retrieval-augmented generation. The data engineer's path to RAG→
Vector database: A vector database indexes embeddings and returns the nearest vectors to a query using approximate nearest-neighbor search. It's the retrieval layer in most RAG systems (e.g. pgvector, Pinecone, Qdrant, Weaviate).
Retrieval-Augmented Generation (RAG): RAG retrieves relevant chunks from your data and adds them to the prompt so the model answers from real sources instead of memory. It reduces hallucination, enables citations, and keeps answers current without retraining. Build a production RAG app→
Chunking: Chunking breaks source documents into passages sized for retrieval and the context window. Chunk size, overlap, and boundaries (semantic vs fixed) strongly affect retrieval quality in a RAG system.
Reranking: A reranker (often a cross-encoder) takes the top candidates from vector search and reorders them by true relevance to the query, improving precision before the context is handed to the LLM.
Prompt engineering: Prompt engineering is the practice of structuring instructions, examples, and context to get reliable outputs. In production it extends to templating, system prompts, and guarding against prompt injection.
Fine-tuning: Fine-tuning further trains a base model on your labeled examples to shape style, format, or task behavior. It's best for consistency, not for injecting fresh facts — that's RAG's job. RAG vs fine-tuning→
Agent: An agent uses an LLM to decide actions, call tools or APIs, observe results, and iterate toward a goal. Production agents add memory, guardrails, retries, and human-in-the-loop control. What is agentic AI?→
Agentic AI: Agentic AI describes systems that plan and act over multiple steps — using tools, memory, and control flow — rather than answering in a single shot. It raises the bar for evaluation, safety, and observability. What is agentic AI?→
Function calling / tool use: Function calling lets an LLM return a structured request to run a defined tool (search, database query, API call). It's the mechanism that turns a chat model into an agent that can act.
Model Context Protocol (MCP): MCP is an open protocol that standardizes how applications expose tools, resources, and context to LLMs, so agents can connect to external systems through a common interface instead of bespoke integrations.
Evaluation (evals): Evals are automated tests for LLM systems — graded on datasets using exact match, model-graded scoring, or task metrics. They catch regressions and are essential for shipping and iterating safely. LLMOps for DevOps engineers→
Hallucination: A hallucination is fluent but incorrect or fabricated output. Grounding with RAG, citations, constrained decoding, and evals are the standard mitigations in production systems.
Guardrails: Guardrails validate and constrain what goes into and comes out of a model — input filtering, output schemas, safety classifiers, and refusal policies — to keep behavior safe and predictable.
LLMOps: LLMOps applies operational discipline to LLM applications: versioning prompts, tracing requests, running evals in CI, monitoring cost and latency, and managing rollouts and incidents. LLMOps for DevOps engineers→
Temperature: Temperature scales how randomly a model samples the next token. Lower values give deterministic, focused output (good for extraction); higher values increase diversity (useful for brainstorming).
Prompt injection: Prompt injection is when untrusted content overrides your instructions — e.g. text in a retrieved document telling the model to ignore its rules. Defenses include input isolation, allow-lists, and least-privilege tools.

Production AI Notes

One practical AI engineering email each week

One concept, one architecture, one project idea, and one interview question — written for developers who want to build and ship real AI systems.