RAG vs Agents: When to Use Which (With Real Examples from Our Stack)

Mon, 25 May 2026 00:00:00 +0000

TL;DR — RAG answers from documents. Agents take actions. Most real systems use both: RAG provides context, agents act on it. The hard part isn’t picking one — it’s knowing which layer of your problem belongs to which pattern.

Why This Comparison Matters Right Now

Two things happened in the last six months that make this comparison less academic than it used to be.

First: coding agents crossed a quality threshold around November 2025. Simon Willison’s five-minute PyCon talk describes it as the moment agents went from “often-work” to “mostly-work” — usable as daily drivers, not just demos. The “best model” title changed hands five times between Anthropic, OpenAI, and Google in a single month.

Second: the model labs themselves are pivoting. Greg Brockman: “the model alone is no longer the product.” AI21 shuttered its model team to focus on agents. DeepSeek spun up its first “Harness team.” Latent Space called this “all model labs are now agent labs.”

When the people who train the models start saying the model isn’t the product, the question of how you wire models into systems becomes the actual engineering work. RAG and agents are the two dominant answers. They solve different problems, and getting the choice wrong wastes a lot of tokens.

The Mental Model

RAG: Retrieve, then Generate

RAG is a fixed four-step pipeline:

User query
   │
   ▼
Embedding model → vector
   │
   ▼
Vector DB / search index → top-K relevant chunks
   │
   ▼
Chunks injected into the LLM prompt as context
   │
   ▼
LLM writes one answer, grounded in the retrieved text

One retrieval. One generation. Cheap, deterministic, easy to debug.

Agent: Reason, then Act, then Reason Again

Agent is a reasoning loop:

User goal
   │
   ▼
┌──────────────────────────────────────────┐
│   LLM reads the goal                      │
│   ↓                                       │
│   Picks a tool (Read, Edit, Bash, ...)    │
│   ↓                                       │
│   Runtime executes the tool               │
│   ↓                                       │
│   Result feeds back to the LLM            │
│   ↓                                       │
│   LLM reasons about what to do next       │
│   ↓                                       │
│   Picks the next tool                     │
│   ↓                                       │
│   ...loop until task is done              │
└──────────────────────────────────────────┘

Every iteration burns tokens. Every step can fail. Errors compound across the loop.

A Concrete Example of Each

RAG in Action: Semantic Wiki Search

We run a personal knowledge base — about 60 markdown files covering project notes, design decisions, and conversation transcripts. Plain grep doesn’t cut it because the question and the answer rarely share keywords.

The solution is an MCP server that wraps a vector search:

MCP server: wiki-search
  Backend: bge-m3 embedding model
  Storage: cosine similarity index over 60+ markdown files
  Input:   natural-language query (English or Chinese)
  Output:  file path + section title + similarity score

When I ask Claude Code “what did we decide about the auto-pipeline last month?”, this happens:

Claude Code recognizes it needs to search the wiki and invokes search_wiki("auto-pipeline decision")
The query gets embedded into a 1024-dim vector
Cosine similarity returns the top 5 matching sections
The matching chunks get injected into Claude’s context
Claude answers, grounded in the actual files — no hallucination

This is RAG end-to-end. The wiki is passive. It gets queried; it doesn’t act.

Agent in Action: Claude Code Editing Files

Same Claude Code instance. Different problem: “Update the baseURL in mybrew/hugo.yaml to aibrew.ai and add a note in TODO.md.”

Iteration 1:
  LLM reasoning: "I need to see the current config first."
  Tool: Read("mybrew/hugo.yaml")
  Result: file contents shown

Iteration 2:
  LLM reasoning: "baseURL is on line 1. I'll edit it."
  Tool: Edit(old="https://mybrew.cc/", new="https://aibrew.ai/")
  Result: edit succeeded

Iteration 3:
  LLM reasoning: "Now update TODO.md to mark this done."
  Tool: Read("hugo/TODO.md")
  Result: file contents shown

Iteration 4:
  LLM reasoning: "Need to add the entry under 'Domain config'."
  Tool: Edit(...)
  Result: edit succeeded

Task complete.

Four iterations. Four tool calls. Multiple reasoning steps. The agent decided what to do, how to do it, and when it was done — all on its own.

A Higher-Stakes Agent: Game Server Control

We also run an agent that controls a Terraria game server through MCP — the bridge exposes ~40 tools (give items, teleport, ban players, spawn bosses, restart server).

Player in chat: "@ai give me a Zenith"
  → terra_item_lookup("Zenith") → resolves to ID 4956
  → terra_give_item(player="kali", item="Zenith") → SUCCESS
  → Item appears in player's inventory

Compare to a destructive operation:

Player: "@ai end the world"
  → terra_world_hardmode(confirm=true) requires explicit authorization
  → Refuses without confirmation
  → If confirmed: world permanently enters hardmode (irreversible)

This is where the agent pattern gets dangerous. The LLM is now in the driver’s seat of a real system. The blast radius of a wrong tool call is no longer “wrong answer” — it’s “wrecked world.” Permission boundaries become first-class design.

The Decision Framework

The one-line rule:

Use RAG when the answer lives in your documents. Use an agent when the answer requires action.

Here’s the longer version:

Dimension	RAG	Agent
Goal	Answer a question	Complete a task
Interaction model	One-shot	Multi-turn loop
Token cost	Low (1× retrieval + 1× generation)	High (N× reasoning + N× tool calls)
Latency	~1–3 seconds	Seconds to minutes
Determinism	High — same query → similar answer	Low — same goal → different paths
Debuggability	Inspect retrieval results	Trace each reasoning step
Failure mode	Wrong/missing context → bad answer	Tool error compounds → drift
Blast radius	Limited to wrong answer	Touches real systems
Best for	Q&A, search, summarization	Coding, ops, automation, workflows

When You Definitely Want RAG

“What does our internal API documentation say about rate limits?”
“Summarize last week’s customer feedback.”
“What did the design discussion conclude about authentication?”

When You Definitely Want an Agent

“Run the test suite and fix any failures.”
“Pull yesterday’s unread RSS items, pick the three most interesting, and draft a roundup post.”
“Refactor this directory to use the new logging API.”

When You Need Both (Most Real Systems)

“Find the related design doc, then propose a code change consistent with it.” → RAG to retrieve the doc, agent to make the change.
“Look up how Pinterest handled MCP auth, then design our auth layer.” → RAG to gather references, agent to write code.

Hybrid Patterns: RAG-Powered Agents

Here’s the thing most “RAG vs Agent” comparisons gloss over: inside any real agent, RAG is happening at multiple layers.

A Claude Code session, simplified:

Session start:
  └─ Load CLAUDE.md into context ............... RAG-on-startup
  └─ Load relevant MEMORY.md files ............. RAG-on-startup

User query:
  └─ Agent reasons about the goal
       │
       ├─ Tool call: search_wiki("...") ........ RAG-on-demand
       ├─ Tool call: searxng_web_search("...") . RAG-on-demand
       ├─ Tool call: Read("config.yaml") ....... Deterministic retrieval
       └─ Tool call: Edit(...) ................. Action

The agent loop is the outer shell. RAG calls happen inside the loop, on demand, whenever the agent decides it needs more grounding.

This matches what Pinterest engineers describe in their MCP rollout: the agent surfaces (chat, IDE, CLI) all talk to a common set of MCP servers, some of which are pure retrieval (Presto query, doc search) and some of which are actions (file a ticket, restart a job). The agent decides at runtime which to call.

Production Case Study: Pinterest’s MCP Ecosystem

ByteByteGo’s writeup of Pinterest’s MCP rollout is one of the few public production stories.

The N×M Problem

Pinterest engineers work across many systems daily — Presto for data, Spark for batch jobs, Airflow for workflows, internal docs, ticketing. They wanted AI agents that could reach into these systems directly.

The brute-force math:

5 agent surfaces × 10 internal tools = 50 bespoke integrations

Every new surface or new tool multiplied the work. Plus 50 auth flows, 50 token lifecycles, 50 sets of plumbing.

The MCP Bet

The Model Context Protocol promised to flatten this:

5 clients + 10 servers = 15 standardized integrations

One protocol, used in both directions. Build a client per surface. Wrap each tool in a server. They all speak the same language.

What MCP Doesn’t Solve

Pinterest’s hard-won lesson: the protocol is the easy part. The real engineering went into the surrounding infrastructure:

Concern	Pinterest’s Solution
Discovery	Central registry of MCP servers — name, version, owner, endpoint
Auth (Layer 1)	Service identity — which agent runtime is making this call
Auth (Layer 2)	User identity — whose permissions is the agent acting under
Deployment	Unified CI/CD pipeline for all MCP servers
Observability	Tool-call metrics from day one — usage, latency, error rate

The takeaway: the more capable your agents become, the more your permission and observability layers matter. A protocol that lets any agent call any tool is also a protocol that lets any compromised agent call any tool.

This is also why our smaller setup (3 MCP servers: searxng, wiki-search, terra_llm_bridge) puts hard confirm=true gates on destructive operations like banning players, restarting the world, or enabling hardmode. Three servers don’t need a registry — but they do need authorization.

Architecture Comparison: Claude Code vs OpenClaw

Two of the most popular agent harnesses today take very different stances. ByteByteGo’s EP214 breaks them down on five dimensions:

1. System Scope

	Claude Code	OpenClaw
Lifetime	Short-lived process	Long-running daemon
Trigger	User runs CLI	WebSocket from Discord/Slack/WhatsApp
Exit	After task complete	Never

Claude Code is a workhorse you summon. OpenClaw is a butler that’s always listening.

2. Agent Runtime

Claude Code: single async loop — Think → Tool Call → Observe → Repeat. One task at a time per process.
OpenClaw: per-session queues. The Gateway demultiplexes incoming messages and dispatches them to separate runtime queues.

3. Extension Model

Claude Code: Four extension primitives, all hooking into the same agent loop:
- MCP (external tool servers)
- Plugins (bundled tool sets)
- Skills (named procedures the model can invoke)
- Hooks (event-driven shell commands)
OpenClaw: Manifest-first plugins. All plugins go through a central Registry before being made available to the Agent.

4. Memory

Claude Code: CLAUDE.md loaded into context at session start. Subdirectories have their own CLAUDE.md that gets appended when you cd into them.
OpenClaw: MEMORY.md separated from daily notes. Hybrid vector + keyword search across structured sections.

5. Multi-Agent Topology

Claude Code: Lead → subagent pattern. Main agent delegates work to spawned subagents.
OpenClaw: Route-and-delegate. Inbound channels route to dedicated agents that hand off to shared subagents.

The deeper pattern: Claude Code optimizes for “one session, one task.” OpenClaw optimizes for “many concurrent conversations, ambient presence.” Both are correct for their respective use cases. Don’t pick the wrong one for yours.

Failure Modes and Anti-Patterns

RAG Failure Modes

1. Retrieval misses the relevant chunk. Your embedding model thinks the question and the answer are semantically distant when they aren’t. Mitigation: hybrid search (vector + keyword), reranking, query expansion.

2. Retrieval returns too many irrelevant chunks. Context window fills with noise. Mitigation: stricter top-K, similarity threshold, post-retrieval filtering.

3. The answer isn’t actually in your corpus. RAG can’t fabricate truth — if the knowledge isn’t indexed, the model still doesn’t know. Mitigation: a confidence check, or a fallback to web search.

4. Chunking destroyed the structure. You split a markdown file mid-table, mid-code-block, mid-argument. Mitigation: structure-aware chunking (by heading, by paragraph, by semantic unit).

Agent Failure Modes

1. Reasoning drift. The agent gets stuck in a loop, repeatedly trying variations of the same failed approach. Mitigation: max-step limits, distinct-tool-call constraints, explicit “what have I tried” memory.

2. Permission overreach. The agent does too much. It was asked to fix one test, it refactored half the file. Mitigation: explicit scope in the prompt, narrow tool permissions, human-in-the-loop for destructive ops.

3. Tool-call cascade failure. A single bad tool call (e.g., a malformed path) gets followed by five reasoning steps trying to “fix” the symptom rather than the root cause. Mitigation: clear error messages from tools, “try once then escalate” tool design.

4. Spending money on the wrong thing. A 20-step agent loop costs 20× a single LLM call. If RAG would have answered the question, you just paid 20× to get a worse answer. Mitigation: ask “could this be a single retrieval?” before going to agent mode.

The Worst Anti-Pattern: Agent-When-RAG-Works

The single most expensive mistake teams make: building an agent for a problem that’s actually a search problem.

If your users are asking “where in the docs does it say…”, you don’t need an agent. You need a search box wired to a vector index. Stop spending tokens on multi-step reasoning to find something a single retrieval call would surface.

What This Means for Builders

A practical checklist if you’re starting a new AI feature:

Frame the problem as a verb. “Answer questions about X” → RAG. “Do X on behalf of the user” → agent.
If you can answer it with one retrieval, do. Cheaper, faster, more predictable.
If you go agent, design permissions on day one. Not day fifty. Pinterest’s two-layer auth wasn’t a feature — it was a survival requirement.
Plan for hybrid. Real agents will need RAG-style retrieval inside their loop. Pick a protocol (MCP is the obvious default) and stick to it.
Instrument everything. Tool call counts, retrieval hit rates, drift indicators. You can’t tune what you can’t see.
Set a budget per task. Both in tokens and in iterations. Agents without budgets find creative ways to spend forever on the wrong thing.

Closing Thought

The RAG-versus-agent framing made sense in 2023, when these were two distinct paradigms competing for the same job. In 2026, they’re complementary layers of the same system.

The interesting question isn’t which one to use. It’s which slice of your problem belongs in which layer. Get that division right and you ship something useful. Get it wrong and you’ll spend a quarter rebuilding it.

For most teams shipping today, the answer looks like this:

                ┌───────────────────────────────┐
                │      Agent loop (outer)        │
                │   reasoning + tool selection   │
                └──────────┬────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
   RAG retrieval     Action tools       Computation
   (knowledge)       (mutate state)     (math, code)

Agent decides. RAG informs. Tools act. That’s the whole stack.

References

System Design on AI Brew