<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>MyBrew</title>
    <link>https://aibrew.ai/</link>
    <description>Recent content on MyBrew</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Fri, 29 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://aibrew.ai/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>DeepSeek V4 in Claude Code: Anthropic-Compatible API with Native Web Search</title>
      <link>https://aibrew.ai/2026/05/deepseek-v4-in-claude-code-anthropic-compatible-api-with-native-web-search/</link>
      <pubDate>Fri, 29 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/2026/05/deepseek-v4-in-claude-code-anthropic-compatible-api-with-native-web-search/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — DeepSeek ships an Anthropic-compatible API endpoint that lets Claude Code treat DeepSeek V4 Pro like a drop-in replacement for Claude Opus. The setup is eight environment variables, and it works — including tool calling, sub-agent spawning, and native web search. At $0.435/M input tokens (permanent price after the initial launch promo), it&amp;rsquo;s roughly 4–17× cheaper than Claude Opus 4.7. This is a practical guide based on a real setup we run daily.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — DeepSeek ships an Anthropic-compatible API endpoint that lets Claude Code treat DeepSeek V4 Pro like a drop-in replacement for Claude Opus. The setup is eight environment variables, and it works — including tool calling, sub-agent spawning, and native web search. At $0.435/M input tokens (permanent price after the initial launch promo), it&rsquo;s roughly 4–17× cheaper than Claude Opus 4.7. This is a practical guide based on a real setup we run daily.</p>
</blockquote>
<hr>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Claude Code is Anthropic&rsquo;s terminal-based AI coding agent. It reads your codebase, runs bash commands, spawns sub-agents, and writes edits — all through Anthropic&rsquo;s API. The problem: <strong>it only speaks Anthropic&rsquo;s message format</strong>. You can&rsquo;t point it at OpenAI, Gemini, or a local Ollama instance without a translation layer.</p>
<p>DeepSeek solved this the obvious way: they built an Anthropic-compatible API endpoint at <code>https://api.deepseek.com/anthropic</code> and documented the Claude Code integration on day one. No proxy, no wrapper, no SDK fork. Just environment variables.</p>
<p>We&rsquo;ve been running this setup in production — managing a multi-project workspace with six active sub-projects, MCP servers, and daily coding sessions. Here&rsquo;s what works, what doesn&rsquo;t, and the exact configuration.</p>
<hr>
<h2 id="step-1-get-a-deepseek-api-key">Step 1: Get a DeepSeek API Key</h2>
<p>Sign up at <a href="https://platform.deepseek.com/api_keys">platform.deepseek.com</a> and create an API key. DeepSeek uses prepaid balance — top up what you need, no subscription.</p>
<p>Two models matter for Claude Code:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Role</th>
          <th>Context</th>
          <th>Max Output</th>
          <th>Input (cache miss)</th>
          <th>Output</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>deepseek-v4-pro</code></td>
          <td>Heavy reasoning, main agent</td>
          <td>1M tokens</td>
          <td>384K</td>
          <td>$0.435/M</td>
          <td>$0.87/M</td>
      </tr>
      <tr>
          <td><code>deepseek-v4-flash</code></td>
          <td>Sub-agents, fast tasks</td>
          <td>1M tokens</td>
          <td>384K</td>
          <td>$0.14/M</td>
          <td>$0.28/M</td>
      </tr>
  </tbody>
</table>
<p>After the initial launch promotion ends (2026-05-31), DeepSeek permanently adjusts the official price to 1/4 of the original — so $0.435/M input becomes the new normal, not a temporary deal. That&rsquo;s still <del>34× cheaper than Claude Opus 4.7 (</del>$15/M input, ~$75/M output) on input tokens. V4 Flash stays as-is.</p>
<p>Cache hits are absurdly cheap: <strong>$0.003625/M</strong> for V4 Pro and <strong>$0.0028/M</strong> for V4 Flash. Claude Code generates a lot of repetitive context (system prompts, CLAUDE.md files, tool definitions), so cache hits dominate real usage.</p>
<hr>
<h2 id="step-2-configure-environment-variables">Step 2: Configure Environment Variables</h2>
<p>Create a shell script (we call ours <code>claude.sh</code>) and source it before launching Claude Code:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_BASE_URL</span><span style="color:#ff79c6">=</span>https://api.deepseek.com/anthropic
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_AUTH_TOKEN</span><span style="color:#ff79c6">=</span>&lt;your-deepseek-api-key&gt;
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_OPUS_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_SONNET_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_HAIKU_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-flash
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">CLAUDE_CODE_SUBAGENT_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-flash
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">CLAUDE_CODE_EFFORT_LEVEL</span><span style="color:#ff79c6">=</span>max
</span></span></code></pre></td></tr></table>
</div>
</div><p>Then launch:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">source</span> claude.sh
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">cd</span> /path/to/your/project
</span></span><span style="display:flex;"><span>claude
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>What each variable does:</strong></p>
<table>
  <thead>
      <tr>
          <th>Variable</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ANTHROPIC_BASE_URL</code></td>
          <td>Redirects all API calls to DeepSeek&rsquo;s Anthropic-compatible endpoint</td>
      </tr>
      <tr>
          <td><code>ANTHROPIC_AUTH_TOKEN</code></td>
          <td>Your DeepSeek API key (not <code>ANTHROPIC_API_KEY</code> — Claude Code uses <code>AUTH_TOKEN</code>)</td>
      </tr>
      <tr>
          <td><code>ANTHROPIC_MODEL</code></td>
          <td>Default model for the main agent loop</td>
      </tr>
      <tr>
          <td><code>ANTHROPIC_DEFAULT_OPUS_MODEL</code></td>
          <td>What Claude Code uses when it internally requests Opus</td>
      </tr>
      <tr>
          <td><code>ANTHROPIC_DEFAULT_SONNET_MODEL</code></td>
          <td>What Claude Code uses when it internally requests Sonnet</td>
      </tr>
      <tr>
          <td><code>ANTHROPIC_DEFAULT_HAIKU_MODEL</code></td>
          <td>What Claude Code uses when it internally requests Haiku</td>
      </tr>
      <tr>
          <td><code>CLAUDE_CODE_SUBAGENT_MODEL</code></td>
          <td>Model for spawned sub-agents (Explore, Plan, etc.)</td>
      </tr>
      <tr>
          <td><code>CLAUDE_CODE_EFFORT_LEVEL</code></td>
          <td>Thinking budget — <code>max</code> gives the model the most reasoning tokens</td>
      </tr>
  </tbody>
</table>
<p>The <code>[1m]</code> suffix on model names is a DeepSeek convention for requesting the 1M-token context window. Without it, you get the default context length.</p>
<hr>
<h2 id="step-3-understand-model-mapping">Step 3: Understand Model Mapping</h2>
<p>DeepSeek does automatic model name mapping. When Claude Code internally requests <code>claude-opus-4-7</code>, DeepSeek&rsquo;s API maps it:</p>
<pre tabindex="0"><code>claude-opus-*    →  deepseek-v4-pro
claude-sonnet-*  →  deepseek-v4-flash
claude-haiku-*   →  deepseek-v4-flash
</code></pre><p>This means you don&rsquo;t need to patch Claude Code&rsquo;s source. When the agent decides it needs &ldquo;Opus-level&rdquo; reasoning, DeepSeek routes it to V4 Pro. When it wants Haiku for a fast sub-agent, it gets V4 Flash.</p>
<p>We explicitly set the model variables anyway (rather than relying on mapping) because it gives us control over which model handles sub-agents. V4 Flash is fast enough for search and file-reading sub-agents, and it&rsquo;s 3× cheaper than V4 Pro on input.</p>
<hr>
<h2 id="what-actually-works">What Actually Works</h2>
<h3 id="tool-calling">Tool Calling</h3>
<p>DeepSeek&rsquo;s Anthropic API fully supports <code>tool_use</code> and <code>tool_result</code> message types. Claude Code&rsquo;s entire agent loop is built on tool calling — Read, Write, Edit, Bash, Grep, Glob — and all of it works.</p>
<pre tabindex="0"><code>Message: array, type = &#34;tool_use&#34;
  - id:           Fully Supported
  - input:        Fully Supported
  - name:         Fully Supported
  - cache_control: Ignored

Message: array, type = &#34;tool_result&#34;
  - tool_use_id:  Fully Supported
  - content:      Fully Supported
  - is_error:     Ignored
</code></pre><h3 id="sub-agent-spawning">Sub-Agent Spawning</h3>
<p>Claude Code spawns specialized sub-agents (Explore for file search, Plan for architecture, etc.) using the <code>Agent</code> tool. Each sub-agent is itself a tool-calling loop with restricted permissions. This works on DeepSeek — we&rsquo;ve tested multi-agent sessions where the main V4 Pro agent spawns V4 Flash sub-agents for file search, and the results flow back correctly.</p>
<h3 id="web-search-the-surprising-part">Web Search (the surprising part)</h3>
<p>This is the feature that caught us off guard. <strong>DeepSeek&rsquo;s API natively supports Claude Code&rsquo;s built-in Web Search tool.</strong> When the model determines your question needs web results, it invokes the search tool through DeepSeek&rsquo;s own search infrastructure — not Anthropic&rsquo;s.</p>
<p>From DeepSeek&rsquo;s documentation:</p>
<blockquote>
<p>&ldquo;The DeepSeek API natively supports the Web Search feature in Claude Code. When using Claude Code, if the model determines that your question requires a web search, it will invoke the Web Search tool and perform the search through the API provided by DeepSeek.&rdquo;</p>
</blockquote>
<p>In practice: ask Claude Code &ldquo;what&rsquo;s the latest version of LangGraph?&rdquo; and it will trigger a web search, get results, and summarize them — all through DeepSeek. The <code>web_search_tool_result</code> message type is fully supported in the API.</p>
<p><strong>Cost caveat:</strong> Each web search triggers additional LLM API calls to summarize the retrieved content. DeepSeek bills these as normal token usage. A single search-then-summarize cycle might consume 5–20K extra input tokens.</p>
<h3 id="thinking-mode">Thinking Mode</h3>
<p>DeepSeek V4 supports thinking mode (extended reasoning). The <code>thinking</code> field in the API is supported, though <code>budget_tokens</code> is ignored — DeepSeek manages its own reasoning budget internally. Setting <code>CLAUDE_CODE_EFFORT_LEVEL=max</code> gives the model maximum latitude to think.</p>
<h3 id="streaming">Streaming</h3>
<p>Fully supported. Responses stream token-by-token just like native Claude.</p>
<hr>
<h2 id="what-doesnt-work">What Doesn&rsquo;t Work</h2>
<p>Being honest about limitations matters. DeepSeek&rsquo;s Anthropic API is not a perfect clone — it&rsquo;s a pragmatic subset.</p>
<h3 id="no-image-or-document-input">No Image or Document Input</h3>
<pre tabindex="0"><code>array, type = &#34;image&#34;     →  Not Supported
array, type = &#34;document&#34;  →  Not Supported
</code></pre><p>You can&rsquo;t paste screenshots or upload PDFs through Claude Code when using DeepSeek. If your workflow involves vision tasks (analyzing UI mockups, reading diagrams), you need native Claude or a vision-capable model for those sessions.</p>
<h3 id="no-prompt-caching">No Prompt Caching</h3>
<pre tabindex="0"><code>cache_control  →  Ignored (on tools, messages, and tool results)
</code></pre><p>DeepSeek has its own context caching (cache hits are priced separately), but the Anthropic-compatible endpoint ignores <code>cache_control</code> markers. Caching happens at DeepSeek&rsquo;s discretion based on content similarity, not explicit breakpoints.</p>
<h3 id="no-mcp-tool-passthrough">No MCP Tool Passthrough</h3>
<pre tabindex="0"><code>array, type = &#34;mcp_tool_use&#34;     →  Not Supported
array, type = &#34;mcp_tool_result&#34;  →  Not Supported
</code></pre><p>MCP (Model Context Protocol) tools work differently — they&rsquo;re handled client-side by Claude Code, not server-side by the API. So MCP tools like SearXNG, filesystem watchers, or database connectors still work because Claude Code intercepts them before they hit the API. The &ldquo;not supported&rdquo; here means DeepSeek&rsquo;s API won&rsquo;t process MCP messages natively, which doesn&rsquo;t affect actual functionality.</p>
<h3 id="minor-field-ignorances">Minor Field Ignorances</h3>
<ul>
<li><code>top_k</code> — ignored</li>
<li><code>anthropic-beta</code> / <code>anthropic-version</code> headers — ignored</li>
<li><code>stop_sequences</code> — fully supported</li>
<li><code>container</code>, <code>mcp_servers</code>, <code>service_tier</code> — ignored</li>
</ul>
<p>None of these affect core Claude Code functionality.</p>
<hr>
<h2 id="cost-comparison">Cost Comparison</h2>
<p>A realistic Claude Code session: ~500K input tokens (system prompt + context + tool definitions) and ~50K output tokens.</p>
<table>
  <thead>
      <tr>
          <th>Provider</th>
          <th>Input Cost</th>
          <th>Output Cost</th>
          <th>Session Total</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Opus 4.7 (direct)</td>
          <td>~$7.50</td>
          <td>~$3.75</td>
          <td><strong>~$11.25</strong></td>
      </tr>
      <tr>
          <td>DeepSeek V4 Pro</td>
          <td>~$0.22</td>
          <td>~$0.04</td>
          <td><strong>~$0.26</strong></td>
      </tr>
      <tr>
          <td>DeepSeek V4 Flash (sub-agents)</td>
          <td>~$0.07</td>
          <td>~$0.01</td>
          <td><strong>~$0.08</strong></td>
      </tr>
  </tbody>
</table>
<p>With the main agent on V4 Pro and sub-agents on V4 Flash, a typical mixed session costs around <strong>$0.15–0.30</strong>. That&rsquo;s roughly <strong>30–70× cheaper</strong> than Claude Opus direct — and it&rsquo;s the permanent price, not a limited-time promo.</p>
<p>The cache hit pricing makes repetitive sessions (same project, same CLAUDE.md, same tool definitions) even cheaper. Our workspace loads ~80K tokens of context on every session start — most of that hits cache at $0.003625/M.</p>
<hr>
<h2 id="real-world-tips">Real-World Tips</h2>
<h3 id="use-a-wrapper-script">Use a wrapper script</h3>
<p>Don&rsquo;t export environment variables in your <code>.bashrc</code> globally — you&rsquo;ll accidentally use DeepSeek for tools that need native Claude (like vision tasks). We use a <code>claude.sh</code> script:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#ff79c6">#!/bin/bash
</span></span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_BASE_URL</span><span style="color:#ff79c6">=</span>https://api.deepseek.com/anthropic
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_AUTH_TOKEN</span><span style="color:#ff79c6">=</span>sk-your-key-here
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_OPUS_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_SONNET_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_HAIKU_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-flash
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">CLAUDE_CODE_SUBAGENT_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-flash
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">CLAUDE_CODE_EFFORT_LEVEL</span><span style="color:#ff79c6">=</span>max
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_USER_ID</span><span style="color:#ff79c6">=</span><span style="color:#f1fa8c">&#34;your-username&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">exec</span> claude <span style="color:#f1fa8c">&#34;</span><span style="color:#8be9fd;font-style:italic">$@</span><span style="color:#f1fa8c">&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Run it with <code>bash claude.sh</code> or <code>source claude.sh &amp;&amp; claude</code>.</p>
<h3 id="the-anthropic_user_id-matters">The <code>ANTHROPIC_USER_ID</code> matters</h3>
<p>DeepSeek supports the <code>user_id</code> metadata field for rate limit isolation. Setting <code>ANTHROPIC_USER_ID</code> ensures your requests are bucketed separately from other users on the same API key — useful if you share a key across projects.</p>
<h3 id="v4-flash-for-routine-work">V4 Flash for routine work</h3>
<p>If you&rsquo;re doing routine file editing, formatting, or batch operations, swap <code>ANTHROPIC_MODEL</code> to <code>deepseek-v4-flash</code>. It&rsquo;s 3× cheaper and fast enough for non-reasoning tasks. Save V4 Pro for architecture decisions, debugging, and complex multi-step problems.</p>
<hr>
<h2 id="the-bottom-line">The Bottom Line</h2>
<p>DeepSeek&rsquo;s Anthropic-compatible API is the most seamless third-party Claude Code integration available. No proxy server, no SDK patches, no feature gaps on the things that matter (tool calling, sub-agents, web search). The only real limitation is vision — if you need image input, you still need native Claude.</p>
<p>For pure coding work, the cost savings are dramatic enough that there&rsquo;s no reason not to try it. Eight environment variables, one API key, and you&rsquo;re running.</p>
<hr>
<h2 id="the-configuration">The Configuration</h2>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6272a4"># claude.sh — Source this before running `claude`</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_BASE_URL</span><span style="color:#ff79c6">=</span>https://api.deepseek.com/anthropic
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_AUTH_TOKEN</span><span style="color:#ff79c6">=</span>&lt;your-key&gt;
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_OPUS_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_SONNET_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-pro<span style="color:#ff79c6">[</span>1m<span style="color:#ff79c6">]</span>
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">ANTHROPIC_DEFAULT_HAIKU_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-flash
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">CLAUDE_CODE_SUBAGENT_MODEL</span><span style="color:#ff79c6">=</span>deepseek-v4-flash
</span></span><span style="display:flex;"><span><span style="color:#8be9fd;font-style:italic">export</span> <span style="color:#8be9fd;font-style:italic">CLAUDE_CODE_EFFORT_LEVEL</span><span style="color:#ff79c6">=</span>max
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<p><em>References</em></p>
<ul>
<li><a href="https://api-docs.deepseek.com/quick_start/agent_integrations/claude_code">DeepSeek API: Integrate with Claude Code</a></li>
<li><a href="https://api-docs.deepseek.com/guides/anthropic_api">DeepSeek API: Anthropic API Compatibility</a></li>
<li><a href="https://api-docs.deepseek.com/quick_start/pricing">DeepSeek API: Models &amp; Pricing</a></li>
<li><a href="https://docs.anthropic.com/en/docs/claude-code">Claude Code Official Documentation</a></li>
</ul>
<hr>
<p><em>Built with: Claude Code (latest), DeepSeek V4 Pro + V4 Flash, Node.js 22. Written from a real multi-project workspace running this setup daily.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>How Claude Code&#39;s Agent Architecture Works — and How We Built a Similar System for a Terraria Server</title>
      <link>https://aibrew.ai/2026/05/how-claude-codes-agent-architecture-works-and-how-we-built-a-similar-system-for-a-terraria-server/</link>
      <pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/2026/05/how-claude-codes-agent-architecture-works-and-how-we-built-a-similar-system-for-a-terraria-server/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — We reverse-engineered Claude Code&amp;rsquo;s agent architecture from its TypeScript source to understand how it handles security, complex tasks, and tool permissions. Then we applied those patterns to an open-source Terraria AI bridge that lets players talk to an LLM inside the game. Here&amp;rsquo;s what we found, what we built, and what we learned about practical agent design.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-we-cracked-open-claude-codes-source&#34;&gt;Why We Cracked Open Claude Code&amp;rsquo;s Source&lt;/h2&gt;
&lt;p&gt;Claude Code isn&amp;rsquo;t just a coding assistant. Under the hood it&amp;rsquo;s an agent runtime — it spawns sub-agents, manages file permissions, runs bash commands, and decides when to ask the user vs. just doing the thing. We wanted to understand how it works so we could apply the same ideas to a completely different domain: a Terraria game server.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — We reverse-engineered Claude Code&rsquo;s agent architecture from its TypeScript source to understand how it handles security, complex tasks, and tool permissions. Then we applied those patterns to an open-source Terraria AI bridge that lets players talk to an LLM inside the game. Here&rsquo;s what we found, what we built, and what we learned about practical agent design.</p>
</blockquote>
<hr>
<h2 id="why-we-cracked-open-claude-codes-source">Why We Cracked Open Claude Code&rsquo;s Source</h2>
<p>Claude Code isn&rsquo;t just a coding assistant. Under the hood it&rsquo;s an agent runtime — it spawns sub-agents, manages file permissions, runs bash commands, and decides when to ask the user vs. just doing the thing. We wanted to understand how it works so we could apply the same ideas to a completely different domain: a Terraria game server.</p>
<p>Our project, <a href="https://github.com/d99sfrmdbz-debug/terra_llm_bridge">terra_llm_bridge</a>, connects a Terraria TShock server to an LLM. Players type <code>@ai</code> in chat and get responses — but the LLM can also <em>act</em>: give items, change weather, teleport players, even toggle hardmode. That last one is where we learned our lesson.</p>
<p>The first time a player asked the AI to set the weather to rain, the LLM autonomously decided to call <code>terra_world_hardmode(confirm=True)</code> — toggling <em>irreversible</em> hardmode for the entire server. No player had asked for it. The model just&hellip; did it.</p>
<p>We needed a real permission system. So we went looking at how Claude Code does it.</p>
<hr>
<h2 id="claude-codes-7-layer-permission-architecture">Claude Code&rsquo;s 7-Layer Permission Architecture</h2>
<p>Reading through ~1,500 lines of <code>src/utils/permissions/permissions.ts</code> plus the Agent tool infrastructure (~3,800 lines), a clear architecture emerged. Claude Code doesn&rsquo;t have one security check — it has <strong>seven</strong>:</p>
<pre tabindex="0"><code>Layer 1a: Deny rules   →  &#34;Never allow Bash(git push --force)&#34;
Layer 1b: Ask rules    →  &#34;Always prompt for Bash(curl *)&#34;
Layer 1c: Tool self-check  →  Each tool&#39;s checkPermissions() method
Layer 1d: Tool self-deny   →  Read tool whitelists specific paths
Layer 1f: Content-specific rules  →  &#34;Even in bypass mode, ask for npm publish&#34;
Layer 1g: Safety checks  →  &#34;.git/, .claude/ are ALWAYS bypass-immune&#34;
Layer 2:  Mode-based bypass  →  bypassPermissions / auto / acceptEdits / dontAsk
Layer 3:  YOLO classifier →  AI reads the transcript, decides if safe
</code></pre><p>The most interesting layer is the <strong>YOLO classifier</strong> — a separate small model that reads the full conversation transcript and classifies each tool call as safe or dangerous. It&rsquo;s a two-stage system: a fast classifier for obvious cases, and a deeper thinking classifier for edge cases.</p>
<p>But the layer that matters most for our use case isn&rsquo;t the AI classifier. It&rsquo;s how Claude Code <strong>structurally prevents certain tools from being called in the wrong context</strong> — through tool allowlists, denylists, and sub-agent specialization.</p>
<hr>
<h2 id="the-agent-pattern-not-multi-agent-but-specialized-workers">The Agent Pattern: Not Multi-Agent, but Specialized Workers</h2>
<p>Claude Code doesn&rsquo;t use multi-agent &ldquo;collaboration&rdquo; in the negotiation sense. It uses a <strong>single coordinator that spawns specialized workers</strong>:</p>
<pre tabindex="0"><code>Main Agent (Tool Calling, all tools)
  │
  ├─ Simple: &#34;read file X&#34; → Read tool
  │
  └─ Complex: &#34;audit this branch&#34; → Agent(&#34;Explore&#34;)
                                       │
                                       ├─ Tools: [Read, Grep, Glob]  ← whitelist
                                       ├─ Disallowed: [Edit, Write]   ← denylist
                                       ├─ System prompt: &#34;You are a file search specialist&#34;
                                       └─ Returns findings → Main agent acts on them
</code></pre><p>Each sub-agent type is defined by three things:</p>
<ol>
<li><strong>Tool permissions</strong> (allowlist + denylist) — what it can touch</li>
<li><strong>System prompt</strong> — specialized instructions for its role</li>
<li><strong>Model</strong> — Explore agents use Haiku ($) for speed; Plan agents use Sonnet for reasoning</li>
</ol>
<p>The key insight: <strong>the main agent doesn&rsquo;t get more complex</strong>. It stays simple but has ONE tool (<code>Agent</code>) that lets it offload complex work. The sub-agent is just another Tool Calling loop with restricted tools and a different prompt.</p>
<p>This architecture is elegant because it composes: each piece is simple, but the combination handles complexity that would overwhelm a single prompt.</p>
<hr>
<h2 id="how-we-applied-this-to-terra_llm_bridge">How We Applied This to terra_llm_bridge</h2>
<p>Our Terraria bridge has a simpler job than Claude Code — 46 tools instead of hundreds, and the &ldquo;security&rdquo; problem is &ldquo;don&rsquo;t let the AI toggle hardmode when the player asked about weather&rdquo; rather than &ldquo;don&rsquo;t let the AI rm -rf /&rdquo;. But the patterns transfer directly.</p>
<h3 id="the-problem">The Problem</h3>
<p>Before: our LLM saw all 46 tools at once. When a player asked &ldquo;give me the strongest armor set,&rdquo; the LLM would fire <code>wiki_search</code> AND <code>give_item</code> in parallel — researching while also pre-committing to Solar Flare Armor before reading the wiki results. Sometimes it guessed right. Sometimes it gave a summoner player melee gear.</p>
<h3 id="our-solution-two-phase-tool-access">Our Solution: Two-Phase Tool Access</h3>
<p>We didn&rsquo;t add sub-agents — that would be overkill for 46 tools. Instead, we applied the <strong>tool restriction pattern</strong> at the graph level:</p>
<pre tabindex="0"><code>route → llm(research)  ⇄  tool      →  escalate  →  llm(action)  ⇄  authorize  ⇄  tool  →  output
         17 read tools                            46 full tools     keyword gate
         wiki, lookup, status                     give, kick, spawn
</code></pre><p>The graph has two phases:</p>
<p><strong>Research phase</strong> — the LLM gets only 17 read-only tools (wiki_search, item_lookup, player_list, world_info, etc.). It <em>cannot</em> call give_item, kick, spawn, or any destructive tool. It researches first.</p>
<p><strong>Escalate</strong> — when the LLM produces text (no more tool calls needed), the graph automatically flips to action mode and injects a hint: &ldquo;You now have access to ALL tools.&rdquo;</p>
<p><strong>Action phase</strong> — the LLM gets the full 46-tool set and can act on what it found.</p>
<p>This is structurally enforced. Not a prompt suggestion. The LLM physically cannot call <code>give_item</code> during research because the tool isn&rsquo;t bound.</p>
<h3 id="the-permission-gate">The Permission Gate</h3>
<p>Before the two-phase split, we also added <code>authorize_node</code> — a hard gate between the LLM and ToolNode that checks whether the player&rsquo;s recent chat messages contain keywords for the tool&rsquo;s domain:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>GATED_TOOLS <span style="color:#ff79c6">=</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#f1fa8c">&#34;terra_world_hardmode&#34;</span>: {<span style="color:#f1fa8c">&#34;hardmode&#34;</span>, <span style="color:#f1fa8c">&#34;hard mode&#34;</span>, <span style="color:#f1fa8c">&#34;肉山&#34;</span>, <span style="color:#f1fa8c">&#34;困难模式&#34;</span>},
</span></span><span style="display:flex;"><span>    <span style="color:#f1fa8c">&#34;terra_player_kick&#34;</span>:    {<span style="color:#f1fa8c">&#34;kick&#34;</span>, <span style="color:#f1fa8c">&#34;踢出&#34;</span>, <span style="color:#f1fa8c">&#34;踢了&#34;</span>},
</span></span><span style="display:flex;"><span>    <span style="color:#f1fa8c">&#34;terra_server_stop&#34;</span>:    {<span style="color:#f1fa8c">&#34;stop server&#34;</span>, <span style="color:#f1fa8c">&#34;关服&#34;</span>, <span style="color:#f1fa8c">&#34;停服&#34;</span>},
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># ... 8 more</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></td></tr></table>
</div>
</div><p>If the player says &ldquo;set weather to rain&rdquo; and the LLM tries to call <code>world_hardmode</code>, authorize_node checks: do any of the hardmode keywords appear in the player&rsquo;s recent messages? No? <strong>Blocked.</strong> The tool call is replaced with a BLOCKED message before ToolNode ever sees it.</p>
<p>This is a coarse filter — it checks what the player <em>mentioned</em>, not what they <em>requested</em>. &ldquo;上次打肉山的时候&rdquo; (last time when I fought Wall of Flesh) would pass the keyword check even though the player didn&rsquo;t ask for hardmode. But coarse is fine here: the goal is blocking catastrophic mismatches (weather → hardmode), not perfect intent understanding.</p>
<hr>
<h2 id="what-we-chose-not-to-build">What We Chose NOT to Build</h2>
<h3 id="no-yolo-classifier">No YOLO Classifier</h3>
<p>Claude Code&rsquo;s AI classifier reads the full transcript and classifies tool calls as safe/dangerous. We didn&rsquo;t build this because:</p>
<ul>
<li>It adds latency — an extra LLM call before every gated tool execution</li>
<li>Terraria chat is low-stakes — a false positive (giving the wrong armor) is fixable</li>
<li>Keyword matching catches the catastrophic cases</li>
</ul>
<h3 id="no-sub-agent-spawning">No Sub-Agent Spawning</h3>
<p>Claude Code spawns sub-agent processes for complex tasks. We didn&rsquo;t need this because:</p>
<ul>
<li>Terraria tool surface is small (46 tools)</li>
<li>Multi-turn tool calling handles the complexity we actually face</li>
<li>Spawning sub-processes for a game chat bot is over-engineering</li>
</ul>
<h3 id="no-react-pattern">No ReAct Pattern</h3>
<p>The classic Thought → Action → Observation loop would add token overhead without changing our core capability. DeepSeek&rsquo;s thinking tokens already handle the reasoning, and the two-phase tool access enforces &ldquo;research before action&rdquo; more reliably than prompt-based ReAct would.</p>
<hr>
<h2 id="the-architecture-in-one-diagram">The Architecture in One Diagram</h2>
<pre tabindex="0"><code>┌──────────────────────────────────────────────────────────┐
│  Terraria Server (TShock + C# plugin, 24 game hooks)      │
│  Player types &#34;@ai give me the best armor&#34;                │
└──────────────────────┬───────────────────────────────────┘
                       │ JSON webhook
┌──────────────────────▼───────────────────────────────────┐
│  Python aiohttp listener (:9876)                          │
└──────────────────────┬───────────────────────────────────┘
                       │
┌──────────────────────▼───────────────────────────────────┐
│  LangGraph StateGraph                                     │
│                                                           │
│  route  →  llm(research)  ⇄  tool    17 read tools      │
│               │                                           │
│          escalate  →  llm(action)  ⇄  authorize  ⇄  tool │
│                          46 full tools    keyword gate    │
│               │                                           │
│             output  →  broadcast to game chat             │
│                                                           │
│  Memory: AsyncSqliteSaver per player (thread_id)          │
└──────────────────────────────────────────────────────────┘
                       │
         ┌─────────────┴──────────────┐
         ▼                            ▼
   TShock REST API              Terraria Wiki API
   (give / kick / spawn)        (terraria.wiki.gg)
</code></pre><hr>
<h2 id="source-diving-lessons">Source Diving Lessons</h2>
<p>Reading Claude Code&rsquo;s source taught us three things that apply to any agent project:</p>
<p><strong>1. Security is layered, not binary.</strong> A single <code>confirm</code> parameter is a soft suggestion to the LLM. Real security needs structural enforcement — the LLM shouldn&rsquo;t be able to call a tool it isn&rsquo;t authorized to use, same way a web server shouldn&rsquo;t let you access endpoints without authentication, no matter how nicely you ask.</p>
<p><strong>2. Tool restrictions are the cheapest and most reliable form of safety.</strong> Claude Code&rsquo;s Explore agent is &ldquo;read-only&rdquo; not because of a prompt — because Edit and Write aren&rsquo;t in its tool list. Our research phase isn&rsquo;t &ldquo;research-first&rdquo; because of a prompt — because give_item literally isn&rsquo;t bound. You can&rsquo;t prompt-inject your way past a tool that doesn&rsquo;t exist.</p>
<p><strong>3. Specialization beats complexity.</strong> Claude Code&rsquo;s sub-agents aren&rsquo;t smarter than the main agent — they&rsquo;re more constrained. Fewer tools + focused prompt = more reliable behavior. Our two-phase system does the same: constrain first, expand only when ready.</p>
<hr>
<h2 id="the-project">The Project</h2>
<p><code>terra_llm_bridge</code> is an open-source project connecting Terraria game servers to LLMs. It features:</p>
<ul>
<li><strong>24 game hooks</strong> — custom C# TShock plugin captures chat, boss kills, deaths, logins, and 20 more events</li>
<li><strong>46 admin tools</strong> — give items, manage players, control weather, spawn NPCs, manage regions and permissions</li>
<li><strong>Two-phase agent</strong> — research (17 tools) → action (46 tools)</li>
<li><strong>Hard permission gate</strong> — keyword-based authorize_node blocks unauthorized tool calls</li>
<li><strong>MCP server</strong> — same 46 tools exposed to Claude Code for server administration</li>
<li><strong>Persistent memory</strong> — per-player conversation history via LangGraph&rsquo;s AsyncSqliteSaver</li>
</ul>
<p>The project is currently in <strong>active testing</strong> and not yet published on GitHub. We&rsquo;re running it on a private Terraria server, iterating on the agent architecture before open-sourcing. If you&rsquo;re interested in the code or want early access, reach out.</p>
<hr>
<p><em>Built with: Python 3.14, LangGraph 1.x, DeepSeek (Anthropic-compatible API), C# .NET 9, TShock v6.1.0, aiohttp, httpx.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>n8n vs Dify: One We Adopted, One We Skipped</title>
      <link>https://aibrew.ai/2026/05/n8n-vs-dify-one-we-adopted-one-we-skipped/</link>
      <pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/2026/05/n8n-vs-dify-one-we-adopted-one-we-skipped/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — n8n and Dify often show up together in self-hosted AI evaluations, but they want to own very different layers of your stack. After evaluating both against a custom self-hosted AI setup, we adopted n8n and skipped Dify. The decision came down to one question — &lt;em&gt;&amp;ldquo;what slice does this want to own, and do I already own that slice?&amp;rdquo;&lt;/em&gt; — and the answer was opposite for the two platforms. This post lays out the framework so you can run the same evaluation on your own stack.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — n8n and Dify often show up together in self-hosted AI evaluations, but they want to own very different layers of your stack. After evaluating both against a custom self-hosted AI setup, we adopted n8n and skipped Dify. The decision came down to one question — <em>&ldquo;what slice does this want to own, and do I already own that slice?&rdquo;</em> — and the answer was opposite for the two platforms. This post lays out the framework so you can run the same evaluation on your own stack.</p>
</blockquote>
<hr>
<h2 id="why-this-comparison-matters-in-2026">Why This Comparison Matters in 2026</h2>
<p>Six months ago the question &ldquo;which OSS AI platform should I run?&rdquo; had maybe three serious answers. Today there are dozens, and they overlap aggressively. Dify and n8n keep showing up together in evaluation lists, partly because they&rsquo;re both written in TypeScript, both self-hostable via Docker, both have visual editors, and both can talk to LLMs.</p>
<p>That surface similarity is misleading. <strong>They want to own entirely different layers of your stack.</strong> Treating them as alternatives is a category error that will cost you a week of deployment work and rework.</p>
<p>What we concluded after evaluating both against an existing self-hosted setup:</p>
<ul>
<li><strong>Dify</strong> wants to be the orchestrator. If you already have one, Dify has nothing to offer.</li>
<li><strong>n8n</strong> wants to be the execution layer. If you don&rsquo;t have one, n8n is one of the best off-the-shelf options available.</li>
</ul>
<hr>
<h2 id="what-dify-is">What Dify Is</h2>
<p>Dify is an open-source LLM application development platform (Apache 2.0, 55k+ stars on GitHub). The pitch:</p>
<ul>
<li><strong>Visual workflow editor</strong> — drag nodes to build AI pipelines</li>
<li><strong>Built-in RAG</strong> — upload docs, get a queryable knowledge base</li>
<li><strong>Agent builder</strong> — pre-packaged prompt templates with tool calling</li>
<li><strong>Model gateway</strong> — abstract over OpenAI / Anthropic / DeepSeek / local</li>
<li><strong>Observability dashboard</strong> — request logs, latency, cost</li>
</ul>
<p>In 2025–2026, Dify replaced its underlying LangChain with a custom &ldquo;Beehive Runtime,&rdquo; which is impressive engineering. The product is genuinely well-built.</p>
<p>The target user: someone who wants to ship an AI app <strong>without writing code</strong> or maintaining infrastructure pieces individually.</p>
<p>Same family: Flowise, Langflow, FastGPT. These are all &ldquo;platform-first&rdquo; AI builders.</p>
<hr>
<h2 id="what-n8n-is">What n8n Is</h2>
<p>n8n is open-source workflow automation (162k+ stars). Think Zapier, but self-hosted and with code escape hatches.</p>
<ul>
<li><strong>400+ SaaS connectors</strong> — Notion, Slack, Stripe, Telegram, GitHub, you name it</li>
<li><strong>Trigger → action → condition</strong> visual workflow editor</li>
<li><strong>Webhooks</strong> — receive external events, route to actions</li>
<li><strong>Polling triggers</strong> — RSS feeds, scheduled jobs, file watches</li>
<li><strong>Native retry/error handling</strong> — every node has retry policies</li>
</ul>
<p>n8n is <strong>not</strong> trying to be an LLM platform. It has nodes that call LLMs, but its core identity is &ldquo;connect arbitrary SaaS systems and react to events.&rdquo;</p>
<p>This distinction is crucial. n8n is <strong>plumbing-first</strong>, with optional AI nodes. Dify is <strong>AI-first</strong>, with everything else folded in.</p>
<hr>
<h2 id="the-triage-question-replace-or-absorb">The Triage Question: Replace or Absorb?</h2>
<p>When you evaluate any platform against an existing stack, the question is <strong>not</strong> &ldquo;is it good?&rdquo; The question is <em>&ldquo;what layer of my stack does this want to own, and do I already own that layer?&rdquo;</em></p>
<p>There are exactly two outcomes:</p>
<ul>
<li><strong>Replace</strong>: the platform wants to own a layer you already have. Adopting it means ripping out working code and replacing it with a less flexible black-box equivalent.</li>
<li><strong>Absorb</strong>: the platform wants to own a layer you don&rsquo;t have yet. Adopting it fills a gap without competing with anything.</li>
</ul>
<p>This frame turns what could have been a fuzzy multi-day debate into clean, fast decisions. The rest of this post applies it to each platform in turn.</p>
<hr>
<h2 id="difys-footprint-across-your-stack">Dify&rsquo;s Footprint Across Your Stack</h2>
<p>Dify wants to own five layers at once. Here&rsquo;s how each maps against a setup that already has a code-based orchestrator (any agent harness — Claude Code, LangGraph, your own):</p>
<table>
  <thead>
      <tr>
          <th>Layer Dify Owns</th>
          <th>If You Don&rsquo;t Have It</th>
          <th>If You Already Have It</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Visual workflow orchestration</td>
          <td>Dify gives you a polished UI in days</td>
          <td>Forces you to migrate working code into drag-and-drop nodes</td>
      </tr>
      <tr>
          <td>RAG pipeline</td>
          <td>Built-in, batteries-included knowledge base</td>
          <td>Typically less flexible than a custom RAG layer; harder to tune chunking, embeddings, hybrid search</td>
      </tr>
      <tr>
          <td>Agent builder</td>
          <td>Pre-packaged templates with tool slots</td>
          <td>A real agent loop with multi-step reasoning is more capable than a prompt template wrapper</td>
      </tr>
      <tr>
          <td>Model gateway</td>
          <td>One layer to swap providers</td>
          <td>One env var in a code-based orchestrator does the same</td>
      </tr>
      <tr>
          <td>Observability dashboard</td>
          <td>First-class request logs and cost tracking</td>
          <td>Existing telemetry stacks (Prometheus, OpenTelemetry, custom logging) tend to be deeper</td>
      </tr>
  </tbody>
</table>
<p>The deeper realization: <strong>Dify is built for people who don&rsquo;t write code but want to ship an AI app.</strong> That&rsquo;s a legitimate market, and Dify serves it well. But if you already have a code-based orchestrator running, adopting Dify means ripping out working pieces and replacing them with less flexible equivalents just to fit inside a visual UI. Net cost: a week of migration, all flexibility lost, zero new capability gained.</p>
<p><strong>Our verdict: skip.</strong> Not because Dify is bad, but because there was no gap left for it to fill.</p>
<hr>
<h2 id="n8ns-footprint-across-your-stack">n8n&rsquo;s Footprint Across Your Stack</h2>
<p>n8n&rsquo;s pitch is structurally different. It doesn&rsquo;t want to be the brain. It wants to be the wiring.</p>
<p>The four core capabilities n8n offers, mapped against a typical custom AI setup:</p>
<table>
  <thead>
      <tr>
          <th>Capability n8n Provides</th>
          <th>If You Don&rsquo;t Have It</th>
          <th>If You Already Have It</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Webhook triggers</td>
          <td>Your system&rsquo;s first event-driven entry points</td>
          <td>Complements time-driven (cron) without conflict</td>
      </tr>
      <tr>
          <td>400+ SaaS connectors</td>
          <td>Save weeks writing API clients for Notion / Slack / Stripe / etc.</td>
          <td>Still useful — gives you connectors you didn&rsquo;t have, doesn&rsquo;t compete with what you do</td>
      </tr>
      <tr>
          <td>Built-in retry + state machine</td>
          <td>Mature retry/error handling out of the box</td>
          <td>Replaces handwritten try/except boilerplate with battle-tested defaults</td>
      </tr>
      <tr>
          <td>RSS / polling triggers</td>
          <td>Channel monitoring without OAuth dances</td>
          <td>Pure addition; nothing in most stacks competes</td>
      </tr>
  </tbody>
</table>
<p>The critical observation: <strong>none of these compete with what an existing orchestrator typically owns.</strong> They sit underneath. They fill gaps that a code-based orchestrator alone would still have:</p>
<ul>
<li>Event-driven entry points (most custom stacks only have cron)</li>
<li>Pre-built SaaS adapters (most custom stacks have no generic adapter layer)</li>
<li>Off-the-shelf retry semantics (most custom stacks have handwritten error handling)</li>
<li>Public RSS polling for protocol-locked services like YouTube (most custom stacks have nothing)</li>
</ul>
<p><strong>Our verdict: absorb.</strong> n8n becomes a dependency — a well-maintained, well-documented, battle-tested execution layer — without competing with anything that already works.</p>
<hr>
<h2 id="the-architecture-pattern-that-emerges">The Architecture Pattern That Emerges</h2>
<p>The mental model after these two decisions:</p>
<pre tabindex="0"><code>                       Orchestrator (decisions, judgment)
                       ─────────────────────────────────
                                   │
            ┌──────────────────────┼──────────────────────┐
            │                      │                      │
            ▼                      ▼                      ▼
       Knowledge layer        Tool interfaces        Time triggers
       (your RAG)              (your APIs /            (cron)
                                MCP servers)
                                   │
                                   ▼
                       ┌────────────────────────┐
                       │  n8n (execution layer) │
                       │  ───────────────────── │
                       │  • webhooks            │
                       │  • SaaS adapters       │
                       │  • RSS / polling       │
                       │  • retry / state       │
                       └────────────────────────┘
</code></pre><p>The hard rule worth setting yourself: <strong>n8n is execution only, never decision.</strong> No AI reasoning inside an n8n workflow. n8n receives signals, dispatches them, retries on failure, and reports back. All judgment stays in the orchestrator&rsquo;s hands.</p>
<p>Why is this rule necessary? Because n8n has LLM nodes. You <em>could</em> put a &ldquo;summarize this email&rdquo; GPT call inside a workflow. The moment you do, you&rsquo;ve split your reasoning across two places — some inside your orchestrator&rsquo;s prompt context, some inside an opaque n8n node — and now you have two systems making decisions with no shared memory. That&rsquo;s the failure mode that turns simple workflows into unmaintainable chains.</p>
<p>Keeping n8n as pure plumbing is the discipline that makes the architecture work.</p>
<hr>
<h2 id="three-gotchas-worth-knowing-before-you-deploy-n8n">Three Gotchas Worth Knowing Before You Deploy n8n</h2>
<p>Three things that tend to surprise people in the first day of running n8n:</p>
<p><strong>1. The REST API doesn&rsquo;t support PATCH for archiving workflows.</strong> You can create and read workflows via the API, but you can&rsquo;t delete or archive them programmatically. Cleanup has to go through the web UI. If you&rsquo;re planning to dynamically generate workflows, factor in manual cleanup or write directly to the SQLite database. (Fixed in n8n 2.22+; the 2.21.x line still has this limitation.)</p>
<p><strong>2. Webhook paths are globally unique, even for inactive workflows.</strong> Delete a workflow but the webhook path stays registered, blocking reuse in any new workflow. Treat the webhook namespace as a flat global you have to manage. Prefix paths with the workflow name from day one.</p>
<p><strong>3. The API key scope doesn&rsquo;t include <code>workflow:execute</code>.</strong> You can read workflows over the API but you can&rsquo;t trigger them programmatically — webhooks are the only execution surface. For most architectures this is actually correct (webhooks ARE the integration point), but it can catch you off guard if you&rsquo;re expecting &ldquo;API to start a workflow run on demand.&rdquo;</p>
<hr>
<h2 id="when-you-should-choose-dify">When You Should Choose Dify</h2>
<p>To be fair: Dify is the right tool when:</p>
<ul>
<li>You <strong>don&rsquo;t want to write code</strong> or maintain individual infrastructure pieces.</li>
<li>You need a <strong>polished UI</strong> for non-technical users to build and tweak workflows.</li>
<li>You want a <strong>one-stop hosted experience</strong> (RAG + model gateway + observability + UI) and don&rsquo;t already have these pieces wired together.</li>
<li>You&rsquo;re building a <strong>customer-facing chatbot</strong> for a small team and need shipping speed over architectural flexibility.</li>
</ul>
<p>If any of those describe you, Dify is a serious choice and we wouldn&rsquo;t argue against it.</p>
<hr>
<h2 id="when-you-should-choose-n8n">When You Should Choose n8n</h2>
<p>n8n is the right tool when:</p>
<ul>
<li>You need to integrate with <strong>specific SaaS products</strong> (Notion, Slack, Stripe, Telegram, etc.) and don&rsquo;t want to write each API client by hand.</li>
<li>You want <strong>event-driven workflows</strong> (webhooks, polling, scheduling) without building your own event bus.</li>
<li>You want a <strong>visual editor</strong> so non-technical teammates can see and modify pipelines.</li>
<li>You&rsquo;re OK with workflows being <strong>execution-only</strong> — no judgment, just plumbing.</li>
</ul>
<p>n8n is <em>not</em> a good choice when:</p>
<ul>
<li>You need <strong>multi-step LLM reasoning</strong> with shared memory across steps. Use an agent harness instead (Claude Code, LangGraph, OpenAI&rsquo;s Agents SDK).</li>
<li>You need <strong>full control over prompt format, token budget, fallback chains</strong>. n8n&rsquo;s LLM nodes are too abstract for serious work.</li>
<li>Your workflow logic <strong>changes weekly</strong>. The visual editor is great for stable workflows; it&rsquo;s a drag for rapidly iterating ones — code is faster to refactor than nodes.</li>
</ul>
<hr>
<h2 id="the-deeper-principle-models-are-commodity-orchestration-is-the-moat">The Deeper Principle: &ldquo;Models Are Commodity, Orchestration Is the Moat&rdquo;</h2>
<p>The Dify-skip / n8n-absorb decision is downstream of a broader principle:</p>
<ul>
<li><strong>Models</strong> (DeepSeek, GPT, Claude, Mistral, Llama) are interchangeable. Swap them with an env var.</li>
<li><strong>Platforms</strong> (Dify, LangFlow, Flowise) are also interchangeable. They package similar capabilities differently.</li>
<li><strong>Orchestration</strong> — the system that connects models, knowledge, tools, and outcomes — is where the leverage is.</li>
</ul>
<p>When you already have a strong orchestrator, you do not need a platform that wants to <em>be</em> your orchestrator. You need plumbing that does plumbing well. That&rsquo;s where n8n earns its place.</p>
<p>This principle generalizes. Every time you evaluate an AI platform, ask: <strong>does this want to own my orchestration layer, or fill a gap underneath it?</strong> If the answer is &ldquo;own,&rdquo; and you already have an orchestrator, skip it. If the answer is &ldquo;fill a gap,&rdquo; and the gap is real, absorb it.</p>
<hr>
<h2 id="closing-thought">Closing Thought</h2>
<p>Two platforms. Opposite decisions. Same underlying logic: <em>what slice does this want to own, and do I already own that slice?</em></p>
<ul>
<li><strong>Dify</strong> wanted to own the orchestration layer → already covered → <strong>reject</strong>.</li>
<li><strong>n8n</strong> wanted to own the execution layer (event triggers, SaaS integration, retry, polling) → not covered → <strong>absorb</strong>.</li>
</ul>
<p>If you&rsquo;re evaluating self-hosted AI tools right now, this is the question to ask first. It saves a lot of pointless deployments and even more pointless rework.</p>
<hr>
<p><em>References</em></p>
<ul>
<li><em><a href="https://github.com/langgenius/dify">Dify on GitHub</a></em> — 55k+ stars, Apache 2.0</li>
<li><em><a href="https://github.com/n8n-io/n8n">n8n on GitHub</a></em> — 162k+ stars, Sustainable Use License</li>
<li><em><a href="https://modelcontextprotocol.io/">Model Context Protocol (MCP) specification</a></em></li>
<li><em><a href="https://aibrew.ai/2026/05/rag-vs-agents-when-to-use-which-with-real-examples-from-our-stack/">Our previous post: RAG vs Agents</a></em></li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>About MyBrew</title>
      <link>https://aibrew.ai/about/</link>
      <pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/about/</guid>
      <description>&lt;h2 id=&#34;what-is-mybrew&#34;&gt;What Is MyBrew&lt;/h2&gt;
&lt;p&gt;MyBrew is a curated site for AI tools and open-source projects. Every recommendation is actually tested before it appears here.&lt;/p&gt;
&lt;h2 id=&#34;what-youll-find&#34;&gt;What You&amp;rsquo;ll Find&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tool Reviews&lt;/strong&gt; — hands-on testing, no fluff. Does it work? Is it worth your time?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open-Source Deep Dives&lt;/strong&gt; — GitHub projects that deserve attention, explained in plain English.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tutorials&lt;/strong&gt; — step-by-step guides to build useful things with AI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comparisons&lt;/strong&gt; — &amp;ldquo;X vs Y for Z&amp;rdquo; — side-by-side breakdowns for real use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Roundups&lt;/strong&gt; — thematic collections when you need options at a glance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;philosophy&#34;&gt;Philosophy&lt;/h2&gt;
&lt;p&gt;The internet doesn&amp;rsquo;t need another AI news aggregator. We brew, not spray — each post is tested, filtered, and written for humans.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="what-is-mybrew">What Is MyBrew</h2>
<p>MyBrew is a curated site for AI tools and open-source projects. Every recommendation is actually tested before it appears here.</p>
<h2 id="what-youll-find">What You&rsquo;ll Find</h2>
<ul>
<li><strong>Tool Reviews</strong> — hands-on testing, no fluff. Does it work? Is it worth your time?</li>
<li><strong>Open-Source Deep Dives</strong> — GitHub projects that deserve attention, explained in plain English.</li>
<li><strong>Tutorials</strong> — step-by-step guides to build useful things with AI.</li>
<li><strong>Comparisons</strong> — &ldquo;X vs Y for Z&rdquo; — side-by-side breakdowns for real use cases.</li>
<li><strong>Roundups</strong> — thematic collections when you need options at a glance.</li>
</ul>
<h2 id="philosophy">Philosophy</h2>
<p>The internet doesn&rsquo;t need another AI news aggregator. We brew, not spray — each post is tested, filtered, and written for humans.</p>
<h2 id="who-writes-this">Who Writes This</h2>
<p>I hold an MSc in Computer Science from The University of Hong Kong (HKU). Current research: LLM backdoor attacks and agent harness design.</p>
<p>A note on how this site is made: articles here are drafted and polished with AI assistance. Writing about AI without using AI would miss the point — AI is a serious force multiplier on quality and pace. But AI stays a tool, not the editor: every claim, framing, and final call is mine. The site itself is AI-friendly by design — partly because that&rsquo;s how the web is increasingly read, mostly because it&rsquo;s the same principle: use the tools.</p>
<p>More time spent breaking things than writing about them — what makes it here is what survived.</p>
<p>Reach me at <code>contact@aibrew.ai</code> — I read everything, reply when I can.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Hello, World</title>
      <link>https://aibrew.ai/2026/05/hello-world/</link>
      <pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/2026/05/hello-world/</guid>
      <description>&lt;h2 id=&#34;welcome-to-mybrew&#34;&gt;Welcome to MyBrew&lt;/h2&gt;
&lt;p&gt;This is the first post. More coming soon.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tool Reviews&lt;/strong&gt; — hands-on testing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tutorials&lt;/strong&gt; — step-by-step AI guides&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comparisons&lt;/strong&gt; — side-by-side breakdowns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Roundups&lt;/strong&gt; — curated collections&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Everything tested. No hype.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="welcome-to-mybrew">Welcome to MyBrew</h2>
<p>This is the first post. More coming soon.</p>
<ul>
<li><strong>Tool Reviews</strong> — hands-on testing</li>
<li><strong>Tutorials</strong> — step-by-step AI guides</li>
<li><strong>Comparisons</strong> — side-by-side breakdowns</li>
<li><strong>Roundups</strong> — curated collections</li>
</ul>
<p>Everything tested. No hype.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Huawei&#39;s τ-Scaling Law: A Real Read of the Paper Behind the Hype</title>
      <link>https://aibrew.ai/2026/05/huaweis-%CF%84-scaling-law-a-real-read-of-the-paper-behind-the-hype/</link>
      <pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/2026/05/huaweis-%CF%84-scaling-law-a-real-read-of-the-paper-behind-the-hype/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Huawei&amp;rsquo;s τ (Tao) Scaling Law, announced at IEEE ISCAS 2026, reframes Moore&amp;rsquo;s Law: instead of shrinking transistors, optimize a time constant τ across the entire computing stack. The paper is real, the production data is concrete, but the &amp;ldquo;first scaling law since Dennard&amp;rdquo; claim deserves scrutiny. This is mostly a solid 3D-integration engineering paper wrapped in a strategic narrative about how China builds high-performance chips without leading-edge lithography.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — Huawei&rsquo;s τ (Tao) Scaling Law, announced at IEEE ISCAS 2026, reframes Moore&rsquo;s Law: instead of shrinking transistors, optimize a time constant τ across the entire computing stack. The paper is real, the production data is concrete, but the &ldquo;first scaling law since Dennard&rdquo; claim deserves scrutiny. This is mostly a solid 3D-integration engineering paper wrapped in a strategic narrative about how China builds high-performance chips without leading-edge lithography.</p>
</blockquote>
<hr>
<h2 id="what-was-announced">What Was Announced</h2>
<p>On May 25, 2026, at the IEEE International Symposium on Circuits and Systems (ISCAS) in Shanghai, He Tingbo — President of Huawei&rsquo;s Semiconductor Business — delivered a keynote titled <em>&ldquo;Exploration and Practice of a New Semiconductor Path.&rdquo;</em> The headline: a new scaling principle Huawei calls <strong>τ (Tao) Scaling</strong>, marketed as China&rsquo;s first systematic semiconductor industry law.</p>
<p>The paper, <em>&ldquo;A Time Scaling Theory for Multi-Layer Electronic Systems,&rdquo;</em> was simultaneously posted to ChinaXiv as a preprint (<a href="https://chinaxiv.org/abs/202605.00224">ChinaXiv:202605.00224</a>). Within hours it had over 30,000 reads and 13,000 downloads — unusual for a preprint server.</p>
<p>This is worth taking seriously precisely because it&rsquo;s published, not a marketing deck.</p>
<hr>
<h2 id="the-core-reframe">The Core Reframe</h2>
<p>For 60 years, Moore&rsquo;s Law has driven semiconductor progress by shrinking transistor dimensions. The paper opens with the industry consensus:</p>
<blockquote>
<p><em>&ldquo;For six decades, Moore&rsquo;s geometric scaling drove progress in semiconductors&hellip; returns from pure dimensional shrinking have flattened, leading-edge design budgets exceed one billion dollars per chip, and cost-per-transistor at the most advanced nodes is no longer falling.&rdquo;</em></p>
</blockquote>
<p>So what&rsquo;s the successor principle? The paper&rsquo;s pivot is the key insight:</p>
<blockquote>
<p><em>&ldquo;Spatial scaling served merely as the instrument for compressing time.&rdquo;</em></p>
</blockquote>
<p>In other words: Moore&rsquo;s Law was never really about transistor area — it was about reducing the time it takes for a system to do something. Users don&rsquo;t care that their chip is 3nm. They care that their app opens in 200ms instead of 300ms.</p>
<p>If time was always the underlying goal, <strong>why not measure progress in time directly?</strong> That&rsquo;s τ scaling: a single characteristic time constant τ as the unifying optimization target across the entire computing stack — from picosecond transistor switching to multi-second AI workload latency, spanning twelve orders of magnitude.</p>
<p>The paper&rsquo;s strongest methodological claim:</p>
<blockquote>
<p><em>&ldquo;τ scaling is the first scaling principle since Dennard to establish a shared optimization target across the entire computing stack.&rdquo;</em></p>
</blockquote>
<p>This is a big claim. We&rsquo;ll revisit it.</p>
<hr>
<h2 id="how-τ-works-four-layers">How τ Works: Four Layers</h2>
<p>The framework decomposes τ into four stack layers, each with its own optimization target:</p>
<table>
  <thead>
      <tr>
          <th>Layer</th>
          <th>What τ measures</th>
          <th>Optimization technique</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Device</strong></td>
          <td>Transistor switching delay</td>
          <td>Lower resistance, parasitic capacitance</td>
      </tr>
      <tr>
          <td><strong>Circuit</strong></td>
          <td>Signal RC delay along wires</td>
          <td><strong>LogicFolding</strong> — vertical 3D stacking</td>
      </tr>
      <tr>
          <td><strong>Chip</strong></td>
          <td>Compute + memory access delay</td>
          <td>Full-stack co-design</td>
      </tr>
      <tr>
          <td><strong>System</strong></td>
          <td>Inter-chip + inter-rack communication</td>
          <td><strong>Unified Bus + Hi-ONE optical I/O</strong></td>
      </tr>
  </tbody>
</table>
<p>The interesting move is that the paper treats <em>frequency, latency, bandwidth, throughput</em> as all being governed by τ at their respective layers. One framework, twelve orders of magnitude.</p>
<hr>
<h2 id="production-demo-1-kirin-2026-soc">Production Demo #1: Kirin 2026 SoC</h2>
<p>This is the most concrete part of the paper. The Kirin 2026 chip — launching this autumn — is the first commercial product using LogicFolding.</p>
<h3 id="what-logicfolding-actually-does">What LogicFolding Actually Does</h3>
<blockquote>
<p><em>&ldquo;LogicFolding is a design methodology that partitions digital, analog, and memory circuits across vertically stacked active tiers.&rdquo;</em></p>
</blockquote>
<p>In plain terms: instead of laying out logic in a single 2D plane, split the design across multiple active silicon layers connected by high-density hybrid bonding. Some signal paths that previously had to traverse long horizontal distances now travel short vertical ones.</p>
<p>The promise:</p>
<blockquote>
<p><em>&ldquo;Signal wires become substantially shorter, parasitic RC decreases sharply, clock skew tightens, and the chip operates at a higher clock frequency at the same device node.&rdquo;</em></p>
</blockquote>
<p>Crucially: <strong>at the same device node</strong>. This isn&rsquo;t a process shrink. It&rsquo;s a structural reorganization that recovers performance from the interconnect, not the transistor.</p>
<h3 id="the-numbers-from-the-paper">The Numbers (from the paper)</h3>
<p>Measured on Kirin 2026:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transistor density</td>
          <td><strong>155 → 238 MTr/mm² (+55%)</strong></td>
      </tr>
      <tr>
          <td>P-core power efficiency</td>
          <td><strong>+41%</strong></td>
      </tr>
      <tr>
          <td>Peak frequency</td>
          <td><strong>2.75 → 3.1 GHz (+13%)</strong></td>
      </tr>
      <tr>
          <td>SRAM operating frequency</td>
          <td><strong>+40%</strong></td>
      </tr>
      <tr>
          <td>Clock buffer count</td>
          <td><strong>−50%</strong></td>
      </tr>
      <tr>
          <td>Clock skew</td>
          <td><strong>−25%</strong></td>
      </tr>
      <tr>
          <td>Critical wire length</td>
          <td><strong>−30%</strong></td>
      </tr>
  </tbody>
</table>
<p>If these hold up under independent measurement, this is a genuine engineering achievement — not just a process node bump.</p>
<hr>
<h2 id="production-demo-2-ai-data-centers">Production Demo #2: AI Data Centers</h2>
<p>The harder test for any scaling principle: does it work at gigawatt scale?</p>
<blockquote>
<p><em>&ldquo;Whether a principle developed in the milliwatt smartphone regime survives translation to the gigawatt regime of AI training and inference.&rdquo;</em></p>
</blockquote>
<p>The paper&rsquo;s answer: yes, but only if you treat τ as a system-level target, not a per-accelerator optimization.</p>
<h3 id="the-bottleneck-reframe">The Bottleneck Reframe</h3>
<p>The paper&rsquo;s most important industry observation:</p>
<blockquote>
<p><em>&ldquo;Modern AI systems are dominated by data, not by compute. Over 80% of energy in large AI clusters is spent on data movement, and over 70% of system cost goes to data storage.&rdquo;</em></p>
</blockquote>
<p>This is the unspoken truth of AI infrastructure: TOPS numbers on chip datasheets are mostly irrelevant when 80% of energy goes to moving bytes between chips, racks, and storage tiers.</p>
<h3 id="three-solutions">Three Solutions</h3>
<p><strong>1. Unified Bus</strong> (灵衢总线) — A memory-semantic fabric eliminating protocol conversions between PCIe / NVLink / RDMA / Ethernet / InfiniBand layers. The claim:</p>
<blockquote>
<p><em>&ldquo;Conversion-free, peer-to-peer transmission.&rdquo;</em></p>
</blockquote>
<p>Measured impact: end-to-end remote access latency from <strong>tens of microseconds to ~100ns</strong> — a roughly <strong>500× reduction</strong> in system τ on the main communication path.</p>
<p><strong>2. Hi-ONE</strong> (High-density Optical-interconnect-Node Engine) — Near-package optical I/O. At multi-Tb/s per chip, copper becomes physically impractical:</p>
<blockquote>
<p><em>&ldquo;At multi-Tb/s per chip, copper becomes physically impractical.&rdquo;</em></p>
</blockquote>
<p>Hi-ONE delivers <strong>8 Tb/s per module</strong>, extends face-to-face distance to <strong>100m</strong>, and matches the chip&rsquo;s UB bandwidth over a single optical link.</p>
<p><strong>3. 3D Folding</strong> — The fan-out dilemma: compute scales with chip area (N²), but I/O and power scale with chip perimeter (N). Solution: fold I/O and power into vertical stack instead of crowding the edge.</p>
<p><strong>Projection</strong>: more than <strong>100× growth in hardware integration by 2035</strong>.</p>
<hr>
<h2 id="the-honest-caveat">The Honest Caveat</h2>
<p>Buried in the paper is one of the most important sentences for understanding what τ scaling is <em>not</em>:</p>
<blockquote>
<p><em>&ldquo;τ is a time law, not a joule law.&rdquo;</em></p>
</blockquote>
<p>Translation: τ scaling solves <em>time</em>, not <em>energy</em>. If you make an AI cluster 10× faster but it also draws 10× more power, you&rsquo;ve just moved the bottleneck from latency to electricity, cooling, and dollars.</p>
<p>The paper acknowledges this and gestures at the obvious complements: protocol overhead reduction, lower per-bit transmission energy, near-memory computing, backside power delivery, dynamic voltage/frequency scaling. But the framework itself doesn&rsquo;t solve energy. Anyone evaluating τ scaling should remember this.</p>
<p>It&rsquo;s worth noting that He Tingbo explicitly acknowledges this in the paper — unlike most marketing-driven &ldquo;new law&rdquo; announcements, which tend to gloss over their boundaries.</p>
<hr>
<h2 id="earned-credit-vs-marketing">Earned Credit vs. Marketing</h2>
<h3 id="what-stands-up">What stands up</h3>
<ul>
<li><strong>Real paper, real data.</strong> ISCAS keynote + ChinaXiv preprint with concrete production numbers. Not a slide deck.</li>
<li><strong>Honest about limits.</strong> The &ldquo;τ is not a joule law&rdquo; caveat shows genuine engineering humility.</li>
<li><strong>Strategically sound.</strong> Without access to leading-edge EUV lithography, China needs a path to high-performance chips that doesn&rsquo;t depend on 2nm or 1nm process nodes. 3D integration plus system-level optimization is that path. The framework gives it a name and a measurable target.</li>
<li><strong>Kirin 2026 ships this autumn.</strong> Verifiable claims have a verification date.</li>
</ul>
<h3 id="what-deserves-scrutiny">What deserves scrutiny</h3>
<p><strong>&ldquo;First scaling principle since Dennard&rdquo;</strong> is a load-bearing claim. But:</p>
<ul>
<li>3D integration has been studied for years. TSMC&rsquo;s CoWoS, Intel&rsquo;s Foveros, AMD&rsquo;s chiplet packaging, Samsung&rsquo;s X-Cube — these are all forms of vertical integration.</li>
<li>HBM is essentially a 3D-folded memory stack.</li>
<li>Imec&rsquo;s CFET research aims at gate-level 3D folding.</li>
</ul>
<p>The paper differentiates LogicFolding from existing 3D IC and chiplets by arguing they operate at the <em>packaging</em> layer, while LogicFolding operates at the <em>circuit topology</em> layer inside the chip. That&rsquo;s a legitimate distinction — but it&rsquo;s an incremental one, not a paradigm break.</p>
<p><strong>&ldquo;1.4nm equivalent density by 2031&rdquo;</strong> is a density target, not a process node. The paper is careful about this — but the surrounding press has not been. Equivalent density via 3D stacking is real; it is not the same as fabricating a true 1.4nm node, and shouldn&rsquo;t be conflated.</p>
<p><strong>&ldquo;381 chips in 6 years using τ scaling&rdquo;</strong> is post-hoc framing. Huawei has been shipping chips for years; retroactively grouping them under a unified principle is good narrative but doesn&rsquo;t validate the principle as predictive.</p>
<p><strong>No public benchmarks against the competition.</strong> TSMC N2, Intel 18A, Samsung 3GAP — where do they sit on this τ chart? The paper doesn&rsquo;t say. Until independent measurement compares apples to apples, the &ldquo;100× by 2035&rdquo; projection is a roadmap, not a result.</p>
<hr>
<h2 id="why-this-matters-strategically">Why This Matters Strategically</h2>
<p>Strip the &ldquo;scaling law&rdquo; framing and what&rsquo;s left is a coherent industry argument:</p>
<blockquote>
<p><em>&ldquo;You don&rsquo;t need the most advanced lithography to build competitive high-performance chips, if you reorganize circuits in 3D and treat the entire system as a single optimization target.&rdquo;</em></p>
</blockquote>
<p>This is the technical case for a China-led semiconductor strategy that doesn&rsquo;t depend on access to ASML&rsquo;s EUV machines. It&rsquo;s also a vision for how AI infrastructure could be built differently — interconnect-centric, system-co-designed, optical at the edges rather than copper everywhere.</p>
<p>Whether or not τ scaling becomes &ldquo;the next Moore&rsquo;s Law,&rdquo; it&rsquo;s a real-world demonstration that the post-Moore era has multiple paths. The question is which path delivers on its claims.</p>
<hr>
<h2 id="what-to-watch">What to Watch</h2>
<ul>
<li><strong>Kirin 2026 launch (Autumn 2026):</strong> Are the 41% efficiency and 55% density gains independently measurable?</li>
<li><strong>ISCAS 2026 paper full text:</strong> Independent review of LogicFolding&rsquo;s claimed RC reductions vs alternative explanations.</li>
<li><strong>Industry response:</strong> Do TSMC, Intel, Samsung adopt τ-style framing? Or counter with their own &ldquo;scaling principle&rdquo; branding?</li>
<li><strong>Energy data:</strong> Since τ doesn&rsquo;t solve energy, what&rsquo;s the actual J/op for AI workloads on Huawei&rsquo;s Ascend silicon vs NVIDIA&rsquo;s latest?</li>
<li><strong>Beyond Kirin:</strong> Does LogicFolding land in Ascend AI chips next? The paper claims AI-system applicability but the production demo is mobile SoC.</li>
</ul>
<hr>
<h2 id="bottom-line">Bottom Line</h2>
<p>The τ Scaling paper is <strong>a solid engineering paper with an oversized strategic narrative wrapped around it.</strong> The technical core — LogicFolding, Unified Bus, Hi-ONE, 3D Folding — is real work with measurable claims. The framing as &ldquo;the next Moore&rsquo;s Law&rdquo; oversells what is, methodologically, an incremental extension of well-known 3D integration techniques combined with system-level co-design.</p>
<p>That&rsquo;s not a criticism. Most real engineering progress is incremental. The marketing layer is what funds the engineering. What matters is whether the Kirin 2026 ships this autumn with the numbers the paper claims. If it does, China just published a credible technical roadmap for high-performance chips that doesn&rsquo;t depend on access to leading-edge lithography. That&rsquo;s a much bigger deal than &ldquo;the next Moore&rsquo;s Law.&rdquo;</p>
<hr>
<p><em>References</em></p>
<ul>
<li><em><a href="https://chinaxiv.org/abs/202605.00224">Tingbo He — A Time Scaling Theory for Multi-Layer Electronic Systems (ChinaXiv preprint)</a></em></li>
<li><em><a href="https://www.huawei.com/cn/news/2026/5/ieee-iscas-tau-scaling">Huawei official announcement — ISCAS 2026 τ scaling</a></em></li>
<li><em><a href="https://www.eefocus.com/article/2019984.html">EEFocus — Deep read of He Tingbo&rsquo;s &ldquo;Time Scaling&rdquo; paper</a></em></li>
<li><em><a href="https://www.gizmochina.com/2026/05/25/huawei-proposes-tao-law-as-alternative-to-moores-law-first-logic-folding-chip-arrives-this-autumn/">Gizmochina — Huawei proposes Tao Law as alternative to Moore&rsquo;s Law</a></em></li>
<li><em><a href="https://www.21jingji.com/article/20260525/herald/1573642c437a5e4e76a15fc1c40f0a35.html">21 Economic Net — What is the Tao Law and how is it different from Moore&rsquo;s Law</a></em></li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>RAG vs Agents: When to Use Which (With Real Examples from Our Stack)</title>
      <link>https://aibrew.ai/2026/05/rag-vs-agents-when-to-use-which-with-real-examples-from-our-stack/</link>
      <pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate>
      <guid>https://aibrew.ai/2026/05/rag-vs-agents-when-to-use-which-with-real-examples-from-our-stack/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — RAG answers from documents. Agents take actions. Most real systems use both: RAG provides context, agents act on it. The hard part isn&amp;rsquo;t picking one — it&amp;rsquo;s knowing which layer of your problem belongs to which pattern.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-this-comparison-matters-right-now&#34;&gt;Why This Comparison Matters Right Now&lt;/h2&gt;
&lt;p&gt;Two things happened in the last six months that make this comparison less academic than it used to be.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;: coding agents crossed a quality threshold around November 2025. Simon Willison&amp;rsquo;s &lt;a href=&#34;https://simonwillison.net/2026/May/19/5-minute-llms/&#34;&gt;five-minute PyCon talk&lt;/a&gt; describes it as the moment agents went from &amp;ldquo;often-work&amp;rdquo; to &amp;ldquo;mostly-work&amp;rdquo; — usable as daily drivers, not just demos. The &amp;ldquo;best model&amp;rdquo; title changed hands five times between Anthropic, OpenAI, and Google in a single month.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — RAG answers from documents. Agents take actions. Most real systems use both: RAG provides context, agents act on it. The hard part isn&rsquo;t picking one — it&rsquo;s knowing which layer of your problem belongs to which pattern.</p>
</blockquote>
<hr>
<h2 id="why-this-comparison-matters-right-now">Why This Comparison Matters Right Now</h2>
<p>Two things happened in the last six months that make this comparison less academic than it used to be.</p>
<p><strong>First</strong>: coding agents crossed a quality threshold around November 2025. Simon Willison&rsquo;s <a href="https://simonwillison.net/2026/May/19/5-minute-llms/">five-minute PyCon talk</a> describes it as the moment agents went from &ldquo;often-work&rdquo; to &ldquo;mostly-work&rdquo; — usable as daily drivers, not just demos. The &ldquo;best model&rdquo; title changed hands five times between Anthropic, OpenAI, and Google in a single month.</p>
<p><strong>Second</strong>: the model labs themselves are pivoting. Greg Brockman: <em>&ldquo;the model alone is no longer the product.&rdquo;</em> AI21 shuttered its model team to focus on agents. DeepSeek spun up its first &ldquo;Harness team.&rdquo; <a href="https://www.latent.space/p/ainews-all-model-labs-are-now-agent">Latent Space called this</a> <em>&ldquo;all model labs are now agent labs.&rdquo;</em></p>
<p>When the people who train the models start saying the model isn&rsquo;t the product, the question of <em>how</em> you wire models into systems becomes the actual engineering work. RAG and agents are the two dominant answers. They solve different problems, and getting the choice wrong wastes a lot of tokens.</p>
<hr>
<h2 id="the-mental-model">The Mental Model</h2>
<h3 id="rag-retrieve-then-generate">RAG: Retrieve, then Generate</h3>
<p>RAG is a fixed four-step pipeline:</p>
<pre tabindex="0"><code>User query
   │
   ▼
Embedding model → vector
   │
   ▼
Vector DB / search index → top-K relevant chunks
   │
   ▼
Chunks injected into the LLM prompt as context
   │
   ▼
LLM writes one answer, grounded in the retrieved text
</code></pre><p>One retrieval. One generation. Cheap, deterministic, easy to debug.</p>
<h3 id="agent-reason-then-act-then-reason-again">Agent: Reason, then Act, then Reason Again</h3>
<p>Agent is a reasoning loop:</p>
<pre tabindex="0"><code>User goal
   │
   ▼
┌──────────────────────────────────────────┐
│   LLM reads the goal                      │
│   ↓                                       │
│   Picks a tool (Read, Edit, Bash, ...)    │
│   ↓                                       │
│   Runtime executes the tool               │
│   ↓                                       │
│   Result feeds back to the LLM            │
│   ↓                                       │
│   LLM reasons about what to do next       │
│   ↓                                       │
│   Picks the next tool                     │
│   ↓                                       │
│   ...loop until task is done              │
└──────────────────────────────────────────┘
</code></pre><p>Every iteration burns tokens. Every step can fail. Errors compound across the loop.</p>
<hr>
<h2 id="a-concrete-example-of-each">A Concrete Example of Each</h2>
<h3 id="rag-in-action-semantic-wiki-search">RAG in Action: Semantic Wiki Search</h3>
<p>We run a personal knowledge base — about 60 markdown files covering project notes, design decisions, and conversation transcripts. Plain <code>grep</code> doesn&rsquo;t cut it because the question and the answer rarely share keywords.</p>
<p>The solution is an MCP server that wraps a vector search:</p>
<pre tabindex="0"><code>MCP server: wiki-search
  Backend: bge-m3 embedding model
  Storage: cosine similarity index over 60+ markdown files
  Input:   natural-language query (English or Chinese)
  Output:  file path + section title + similarity score
</code></pre><p>When I ask Claude Code <em>&ldquo;what did we decide about the auto-pipeline last month?&rdquo;</em>, this happens:</p>
<ol>
<li>Claude Code recognizes it needs to search the wiki and invokes <code>search_wiki(&quot;auto-pipeline decision&quot;)</code></li>
<li>The query gets embedded into a 1024-dim vector</li>
<li>Cosine similarity returns the top 5 matching sections</li>
<li>The matching chunks get injected into Claude&rsquo;s context</li>
<li>Claude answers, grounded in the actual files — no hallucination</li>
</ol>
<p>This is RAG end-to-end. The wiki is <strong>passive</strong>. It gets queried; it doesn&rsquo;t act.</p>
<h3 id="agent-in-action-claude-code-editing-files">Agent in Action: Claude Code Editing Files</h3>
<p>Same Claude Code instance. Different problem: <em>&ldquo;Update the baseURL in mybrew/hugo.yaml to aibrew.ai and add a note in TODO.md.&rdquo;</em></p>
<pre tabindex="0"><code>Iteration 1:
  LLM reasoning: &#34;I need to see the current config first.&#34;
  Tool: Read(&#34;mybrew/hugo.yaml&#34;)
  Result: file contents shown

Iteration 2:
  LLM reasoning: &#34;baseURL is on line 1. I&#39;ll edit it.&#34;
  Tool: Edit(old=&#34;https://mybrew.cc/&#34;, new=&#34;https://aibrew.ai/&#34;)
  Result: edit succeeded

Iteration 3:
  LLM reasoning: &#34;Now update TODO.md to mark this done.&#34;
  Tool: Read(&#34;hugo/TODO.md&#34;)
  Result: file contents shown

Iteration 4:
  LLM reasoning: &#34;Need to add the entry under &#39;Domain config&#39;.&#34;
  Tool: Edit(...)
  Result: edit succeeded

Task complete.
</code></pre><p>Four iterations. Four tool calls. Multiple reasoning steps. The agent decided <em>what</em> to do, <em>how</em> to do it, and <em>when</em> it was done — all on its own.</p>
<h3 id="a-higher-stakes-agent-game-server-control">A Higher-Stakes Agent: Game Server Control</h3>
<p>We also run an agent that controls a Terraria game server through MCP — the bridge exposes ~40 tools (give items, teleport, ban players, spawn bosses, restart server).</p>
<pre tabindex="0"><code>Player in chat: &#34;@ai give me a Zenith&#34;
  → terra_item_lookup(&#34;Zenith&#34;) → resolves to ID 4956
  → terra_give_item(player=&#34;kali&#34;, item=&#34;Zenith&#34;) → SUCCESS
  → Item appears in player&#39;s inventory
</code></pre><p>Compare to a destructive operation:</p>
<pre tabindex="0"><code>Player: &#34;@ai end the world&#34;
  → terra_world_hardmode(confirm=true) requires explicit authorization
  → Refuses without confirmation
  → If confirmed: world permanently enters hardmode (irreversible)
</code></pre><p>This is where the agent pattern gets dangerous. The LLM is now in the driver&rsquo;s seat of a real system. <strong>The blast radius of a wrong tool call is no longer &ldquo;wrong answer&rdquo; — it&rsquo;s &ldquo;wrecked world.&rdquo;</strong> Permission boundaries become first-class design.</p>
<hr>
<h2 id="the-decision-framework">The Decision Framework</h2>
<p>The one-line rule:</p>
<blockquote>
<p><strong>Use RAG when the answer lives in your documents. Use an agent when the answer requires action.</strong></p>
</blockquote>
<p>Here&rsquo;s the longer version:</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>RAG</th>
          <th>Agent</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Goal</strong></td>
          <td>Answer a question</td>
          <td>Complete a task</td>
      </tr>
      <tr>
          <td><strong>Interaction model</strong></td>
          <td>One-shot</td>
          <td>Multi-turn loop</td>
      </tr>
      <tr>
          <td><strong>Token cost</strong></td>
          <td>Low (1× retrieval + 1× generation)</td>
          <td>High (N× reasoning + N× tool calls)</td>
      </tr>
      <tr>
          <td><strong>Latency</strong></td>
          <td>~1–3 seconds</td>
          <td>Seconds to minutes</td>
      </tr>
      <tr>
          <td><strong>Determinism</strong></td>
          <td>High — same query → similar answer</td>
          <td>Low — same goal → different paths</td>
      </tr>
      <tr>
          <td><strong>Debuggability</strong></td>
          <td>Inspect retrieval results</td>
          <td>Trace each reasoning step</td>
      </tr>
      <tr>
          <td><strong>Failure mode</strong></td>
          <td>Wrong/missing context → bad answer</td>
          <td>Tool error compounds → drift</td>
      </tr>
      <tr>
          <td><strong>Blast radius</strong></td>
          <td>Limited to wrong answer</td>
          <td>Touches real systems</td>
      </tr>
      <tr>
          <td><strong>Best for</strong></td>
          <td>Q&amp;A, search, summarization</td>
          <td>Coding, ops, automation, workflows</td>
      </tr>
  </tbody>
</table>
<h3 id="when-you-definitely-want-rag">When You Definitely Want RAG</h3>
<ul>
<li><em>&ldquo;What does our internal API documentation say about rate limits?&rdquo;</em></li>
<li><em>&ldquo;Summarize last week&rsquo;s customer feedback.&rdquo;</em></li>
<li><em>&ldquo;What did the design discussion conclude about authentication?&rdquo;</em></li>
</ul>
<h3 id="when-you-definitely-want-an-agent">When You Definitely Want an Agent</h3>
<ul>
<li><em>&ldquo;Run the test suite and fix any failures.&rdquo;</em></li>
<li><em>&ldquo;Pull yesterday&rsquo;s unread RSS items, pick the three most interesting, and draft a roundup post.&rdquo;</em></li>
<li><em>&ldquo;Refactor this directory to use the new logging API.&rdquo;</em></li>
</ul>
<h3 id="when-you-need-both-most-real-systems">When You Need Both (Most Real Systems)</h3>
<ul>
<li><em>&ldquo;Find the related design doc, then propose a code change consistent with it.&rdquo;</em>
→ RAG to retrieve the doc, agent to make the change.</li>
<li><em>&ldquo;Look up how Pinterest handled MCP auth, then design our auth layer.&rdquo;</em>
→ RAG to gather references, agent to write code.</li>
</ul>
<hr>
<h2 id="hybrid-patterns-rag-powered-agents">Hybrid Patterns: RAG-Powered Agents</h2>
<p>Here&rsquo;s the thing most &ldquo;RAG vs Agent&rdquo; comparisons gloss over: <strong>inside any real agent, RAG is happening at multiple layers</strong>.</p>
<p>A Claude Code session, simplified:</p>
<pre tabindex="0"><code>Session start:
  └─ Load CLAUDE.md into context ............... RAG-on-startup
  └─ Load relevant MEMORY.md files ............. RAG-on-startup

User query:
  └─ Agent reasons about the goal
       │
       ├─ Tool call: search_wiki(&#34;...&#34;) ........ RAG-on-demand
       ├─ Tool call: searxng_web_search(&#34;...&#34;) . RAG-on-demand
       ├─ Tool call: Read(&#34;config.yaml&#34;) ....... Deterministic retrieval
       └─ Tool call: Edit(...) ................. Action
</code></pre><p>The agent loop is the outer shell. RAG calls happen <em>inside</em> the loop, on demand, whenever the agent decides it needs more grounding.</p>
<p>This matches what Pinterest engineers describe in their MCP rollout: the agent surfaces (chat, IDE, CLI) all talk to a common set of MCP servers, some of which are pure retrieval (Presto query, doc search) and some of which are actions (file a ticket, restart a job). The agent decides at runtime which to call.</p>
<hr>
<h2 id="production-case-study-pinterests-mcp-ecosystem">Production Case Study: Pinterest&rsquo;s MCP Ecosystem</h2>
<p>ByteByteGo&rsquo;s writeup of <a href="https://blog.bytebytego.com/p/how-pinterest-built-a-production">Pinterest&rsquo;s MCP rollout</a> is one of the few public production stories.</p>
<h3 id="the-nm-problem">The N×M Problem</h3>
<p>Pinterest engineers work across many systems daily — Presto for data, Spark for batch jobs, Airflow for workflows, internal docs, ticketing. They wanted AI agents that could reach into these systems directly.</p>
<p>The brute-force math:</p>
<pre tabindex="0"><code>5 agent surfaces × 10 internal tools = 50 bespoke integrations
</code></pre><p>Every new surface or new tool multiplied the work. Plus 50 auth flows, 50 token lifecycles, 50 sets of plumbing.</p>
<h3 id="the-mcp-bet">The MCP Bet</h3>
<p>The Model Context Protocol promised to flatten this:</p>
<pre tabindex="0"><code>5 clients + 10 servers = 15 standardized integrations
</code></pre><p>One protocol, used in both directions. Build a client per surface. Wrap each tool in a server. They all speak the same language.</p>
<h3 id="what-mcp-doesnt-solve">What MCP Doesn&rsquo;t Solve</h3>
<p>Pinterest&rsquo;s hard-won lesson: the protocol is the easy part. The real engineering went into the <em>surrounding</em> infrastructure:</p>
<table>
  <thead>
      <tr>
          <th>Concern</th>
          <th>Pinterest&rsquo;s Solution</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Discovery</strong></td>
          <td>Central registry of MCP servers — name, version, owner, endpoint</td>
      </tr>
      <tr>
          <td><strong>Auth (Layer 1)</strong></td>
          <td>Service identity — which agent runtime is making this call</td>
      </tr>
      <tr>
          <td><strong>Auth (Layer 2)</strong></td>
          <td>User identity — whose permissions is the agent acting under</td>
      </tr>
      <tr>
          <td><strong>Deployment</strong></td>
          <td>Unified CI/CD pipeline for all MCP servers</td>
      </tr>
      <tr>
          <td><strong>Observability</strong></td>
          <td>Tool-call metrics from day one — usage, latency, error rate</td>
      </tr>
  </tbody>
</table>
<p>The takeaway: <strong>the more capable your agents become, the more your permission and observability layers matter.</strong> A protocol that lets any agent call any tool is also a protocol that lets any compromised agent call any tool.</p>
<p>This is also why our smaller setup (3 MCP servers: <code>searxng</code>, <code>wiki-search</code>, <code>terra_llm_bridge</code>) puts hard <code>confirm=true</code> gates on destructive operations like banning players, restarting the world, or enabling hardmode. Three servers don&rsquo;t need a registry — but they do need authorization.</p>
<hr>
<h2 id="architecture-comparison-claude-code-vs-openclaw">Architecture Comparison: Claude Code vs OpenClaw</h2>
<p>Two of the most popular agent harnesses today take very different stances. ByteByteGo&rsquo;s <a href="https://blog.bytebytego.com/p/ep214-claude-code-vs-openclaw-5-design">EP214</a> breaks them down on five dimensions:</p>
<h3 id="1-system-scope">1. System Scope</h3>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Claude Code</th>
          <th>OpenClaw</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lifetime</td>
          <td>Short-lived process</td>
          <td>Long-running daemon</td>
      </tr>
      <tr>
          <td>Trigger</td>
          <td>User runs CLI</td>
          <td>WebSocket from Discord/Slack/WhatsApp</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>After task complete</td>
          <td>Never</td>
      </tr>
  </tbody>
</table>
<p>Claude Code is a workhorse you summon. OpenClaw is a butler that&rsquo;s always listening.</p>
<h3 id="2-agent-runtime">2. Agent Runtime</h3>
<ul>
<li><strong>Claude Code</strong>: single async loop — <code>Think → Tool Call → Observe → Repeat</code>. One task at a time per process.</li>
<li><strong>OpenClaw</strong>: per-session queues. The Gateway demultiplexes incoming messages and dispatches them to separate runtime queues.</li>
</ul>
<h3 id="3-extension-model">3. Extension Model</h3>
<ul>
<li><strong>Claude Code</strong>: Four extension primitives, all hooking into the same agent loop:
<ul>
<li><strong>MCP</strong> (external tool servers)</li>
<li><strong>Plugins</strong> (bundled tool sets)</li>
<li><strong>Skills</strong> (named procedures the model can invoke)</li>
<li><strong>Hooks</strong> (event-driven shell commands)</li>
</ul>
</li>
<li><strong>OpenClaw</strong>: Manifest-first plugins. All plugins go through a central Registry before being made available to the Agent.</li>
</ul>
<h3 id="4-memory">4. Memory</h3>
<ul>
<li><strong>Claude Code</strong>: <code>CLAUDE.md</code> loaded into context at session start. Subdirectories have their own <code>CLAUDE.md</code> that gets appended when you <code>cd</code> into them.</li>
<li><strong>OpenClaw</strong>: <code>MEMORY.md</code> separated from daily notes. Hybrid vector + keyword search across structured sections.</li>
</ul>
<h3 id="5-multi-agent-topology">5. Multi-Agent Topology</h3>
<ul>
<li><strong>Claude Code</strong>: Lead → subagent pattern. Main agent delegates work to spawned subagents.</li>
<li><strong>OpenClaw</strong>: Route-and-delegate. Inbound channels route to dedicated agents that hand off to shared subagents.</li>
</ul>
<p>The deeper pattern: <strong>Claude Code optimizes for &ldquo;one session, one task.&rdquo;</strong> OpenClaw optimizes for &ldquo;many concurrent conversations, ambient presence.&rdquo; Both are correct for their respective use cases. Don&rsquo;t pick the wrong one for yours.</p>
<hr>
<h2 id="failure-modes-and-anti-patterns">Failure Modes and Anti-Patterns</h2>
<h3 id="rag-failure-modes">RAG Failure Modes</h3>
<p><strong>1. Retrieval misses the relevant chunk.</strong> Your embedding model thinks the question and the answer are semantically distant when they aren&rsquo;t. Mitigation: hybrid search (vector + keyword), reranking, query expansion.</p>
<p><strong>2. Retrieval returns too many irrelevant chunks.</strong> Context window fills with noise. Mitigation: stricter top-K, similarity threshold, post-retrieval filtering.</p>
<p><strong>3. The answer isn&rsquo;t actually in your corpus.</strong> RAG can&rsquo;t fabricate truth — if the knowledge isn&rsquo;t indexed, the model still doesn&rsquo;t know. Mitigation: a confidence check, or a fallback to web search.</p>
<p><strong>4. Chunking destroyed the structure.</strong> You split a markdown file mid-table, mid-code-block, mid-argument. Mitigation: structure-aware chunking (by heading, by paragraph, by semantic unit).</p>
<h3 id="agent-failure-modes">Agent Failure Modes</h3>
<p><strong>1. Reasoning drift.</strong> The agent gets stuck in a loop, repeatedly trying variations of the same failed approach. Mitigation: max-step limits, distinct-tool-call constraints, explicit &ldquo;what have I tried&rdquo; memory.</p>
<p><strong>2. Permission overreach.</strong> The agent does too much. It was asked to fix one test, it refactored half the file. Mitigation: explicit scope in the prompt, narrow tool permissions, human-in-the-loop for destructive ops.</p>
<p><strong>3. Tool-call cascade failure.</strong> A single bad tool call (e.g., a malformed path) gets followed by five reasoning steps trying to &ldquo;fix&rdquo; the symptom rather than the root cause. Mitigation: clear error messages from tools, &ldquo;try once then escalate&rdquo; tool design.</p>
<p><strong>4. Spending money on the wrong thing.</strong> A 20-step agent loop costs 20× a single LLM call. If RAG would have answered the question, you just paid 20× to get a worse answer. Mitigation: ask &ldquo;could this be a single retrieval?&rdquo; before going to agent mode.</p>
<h3 id="the-worst-anti-pattern-agent-when-rag-works">The Worst Anti-Pattern: Agent-When-RAG-Works</h3>
<p>The single most expensive mistake teams make: building an agent for a problem that&rsquo;s actually a search problem.</p>
<p>If your users are asking <em>&ldquo;where in the docs does it say…&rdquo;</em>, you don&rsquo;t need an agent. You need a search box wired to a vector index. Stop spending tokens on multi-step reasoning to find something a single retrieval call would surface.</p>
<hr>
<h2 id="what-this-means-for-builders">What This Means for Builders</h2>
<p>A practical checklist if you&rsquo;re starting a new AI feature:</p>
<ol>
<li><strong>Frame the problem as a verb.</strong> <em>&ldquo;Answer questions about X&rdquo;</em> → RAG. <em>&ldquo;Do X on behalf of the user&rdquo;</em> → agent.</li>
<li><strong>If you can answer it with one retrieval, do.</strong> Cheaper, faster, more predictable.</li>
<li><strong>If you go agent, design permissions on day one.</strong> Not day fifty. Pinterest&rsquo;s two-layer auth wasn&rsquo;t a feature — it was a survival requirement.</li>
<li><strong>Plan for hybrid.</strong> Real agents will need RAG-style retrieval inside their loop. Pick a protocol (MCP is the obvious default) and stick to it.</li>
<li><strong>Instrument everything.</strong> Tool call counts, retrieval hit rates, drift indicators. You can&rsquo;t tune what you can&rsquo;t see.</li>
<li><strong>Set a budget per task.</strong> Both in tokens and in iterations. Agents without budgets find creative ways to spend forever on the wrong thing.</li>
</ol>
<hr>
<h2 id="closing-thought">Closing Thought</h2>
<p>The RAG-versus-agent framing made sense in 2023, when these were two distinct paradigms competing for the same job. In 2026, they&rsquo;re complementary layers of the same system.</p>
<p>The interesting question isn&rsquo;t <em>which one to use</em>. It&rsquo;s <em>which slice of your problem belongs in which layer</em>. Get that division right and you ship something useful. Get it wrong and you&rsquo;ll spend a quarter rebuilding it.</p>
<p>For most teams shipping today, the answer looks like this:</p>
<pre tabindex="0"><code>                ┌───────────────────────────────┐
                │      Agent loop (outer)        │
                │   reasoning + tool selection   │
                └──────────┬────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
   RAG retrieval     Action tools       Computation
   (knowledge)       (mutate state)     (math, code)
</code></pre><p>Agent decides. RAG informs. Tools act. That&rsquo;s the whole stack.</p>
<hr>
<p><em>References</em></p>
<ul>
<li><em><a href="https://blog.bytebytego.com/p/ep216-rags-vs-agents">ByteByteGo EP216 — RAGs vs Agents</a></em></li>
<li><em><a href="https://blog.bytebytego.com/p/how-pinterest-built-a-production">ByteByteGo — How Pinterest Built a Production MCP Ecosystem</a></em></li>
<li><em><a href="https://blog.bytebytego.com/p/ep214-claude-code-vs-openclaw-5-design">ByteByteGo EP214 — Claude Code vs. OpenClaw: 5 Design Dimensions</a></em></li>
<li><em><a href="https://simonwillison.net/2026/May/19/5-minute-llms/">Simon Willison — The Last Six Months in LLMs in Five Minutes</a></em></li>
<li><em><a href="https://www.latent.space/p/ainews-all-model-labs-are-now-agent">Latent.Space — All Model Labs Are Now Agent Labs</a></em></li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
