Skip to content

Glossary

AI engineering glossary

The core vocabulary of production AI — defined in plain English for developers. 20 terms and counting.

Large Language Model (LLM)
A large language model is a transformer-based neural network trained on huge text corpora to predict the next token. In AI engineering you use LLMs as components behind an API rather than training them yourself. How to become an AI engineer
Token
A token is the chunk of text a model processes — often a word piece. Pricing, context limits, and latency are all measured in tokens, so token budgeting is a core production concern.
Context window
The context window is the total tokens (prompt + response) a model can handle in a single call. It bounds how much retrieved context, history, and instructions you can include, which is why retrieval and summarization matter.
Embedding
An embedding maps text (or images) to a high-dimensional vector so that semantically similar items sit close together. Embeddings power semantic search and are the backbone of retrieval-augmented generation. The data engineer's path to RAG
Vector database
A vector database indexes embeddings and returns the nearest vectors to a query using approximate nearest-neighbor search. It's the retrieval layer in most RAG systems (e.g. pgvector, Pinecone, Qdrant, Weaviate).
Retrieval-Augmented Generation (RAG)
RAG retrieves relevant chunks from your data and adds them to the prompt so the model answers from real sources instead of memory. It reduces hallucination, enables citations, and keeps answers current without retraining. Build a production RAG app
Chunking
Chunking breaks source documents into passages sized for retrieval and the context window. Chunk size, overlap, and boundaries (semantic vs fixed) strongly affect retrieval quality in a RAG system.
Reranking
A reranker (often a cross-encoder) takes the top candidates from vector search and reorders them by true relevance to the query, improving precision before the context is handed to the LLM.
Prompt engineering
Prompt engineering is the practice of structuring instructions, examples, and context to get reliable outputs. In production it extends to templating, system prompts, and guarding against prompt injection.
Fine-tuning
Fine-tuning further trains a base model on your labeled examples to shape style, format, or task behavior. It's best for consistency, not for injecting fresh facts — that's RAG's job. RAG vs fine-tuning
Agent
An agent uses an LLM to decide actions, call tools or APIs, observe results, and iterate toward a goal. Production agents add memory, guardrails, retries, and human-in-the-loop control. What is agentic AI?
Agentic AI
Agentic AI describes systems that plan and act over multiple steps — using tools, memory, and control flow — rather than answering in a single shot. It raises the bar for evaluation, safety, and observability. What is agentic AI?
Function calling / tool use
Function calling lets an LLM return a structured request to run a defined tool (search, database query, API call). It's the mechanism that turns a chat model into an agent that can act.
Model Context Protocol (MCP)
MCP is an open protocol that standardizes how applications expose tools, resources, and context to LLMs, so agents can connect to external systems through a common interface instead of bespoke integrations.
Evaluation (evals)
Evals are automated tests for LLM systems — graded on datasets using exact match, model-graded scoring, or task metrics. They catch regressions and are essential for shipping and iterating safely. LLMOps for DevOps engineers
Hallucination
A hallucination is fluent but incorrect or fabricated output. Grounding with RAG, citations, constrained decoding, and evals are the standard mitigations in production systems.
Guardrails
Guardrails validate and constrain what goes into and comes out of a model — input filtering, output schemas, safety classifiers, and refusal policies — to keep behavior safe and predictable.
LLMOps
LLMOps applies operational discipline to LLM applications: versioning prompts, tracing requests, running evals in CI, monitoring cost and latency, and managing rollouts and incidents. LLMOps for DevOps engineers
Temperature
Temperature scales how randomly a model samples the next token. Lower values give deterministic, focused output (good for extraction); higher values increase diversity (useful for brainstorming).
Prompt injection
Prompt injection is when untrusted content overrides your instructions — e.g. text in a retrieved document telling the model to ignore its rules. Defenses include input isolation, allow-lists, and least-privilege tools.

Production AI Notes

One practical AI engineering email each week

One concept, one architecture, one project idea, and one interview question — written for developers who want to build and ship real AI systems.

No spam. Unsubscribe anytime.