Ben Novak

RAG 3.0 — Agentic Architecture

How this search application evolved from a fixed pipeline to an autonomous agent that plans, retrieves, evaluates, and iterates.

1RAG Evolution

Previous

RAG 2.0 — Fixed Pipeline

A rigid, linear flow: each query follows the same fixed steps regardless of complexity. Two separate LLM calls with hardcoded routing logic.

Classify (Haiku) → Embed → Search → Synthesize (Sonnet)
One search pass, no iteration
Router decides mode upfront
Same pipeline for simple & complex queries
No ability to refine or retry
Current

🧠RAG 3.0 — Agentic

Haiku classifies intent cheaply, then routes: simple lookups go straight to search, while complex queries get a Sonnet agent that plans, retrieves, and iterates autonomously.

Haiku classifies → routes to fast path or agent
HYBRID: embed + search (no Sonnet cost)
AGENTIC: Sonnet agent with tools + iteration
Right model for each task = cost optimized
Agent self-evaluates and refines if needed

2Agentic Loop Flow

🔍

User Query

Natural-language search query enters the system

e.g. "What PPE do I need for metal grinding?" or "3M N95 respirator"

Query Classification (Claude Haiku)

Routes

Haiku classifies intent as HYBRID, LLM_AUGMENTED, or OFF_TOPIC — fast and cheap ($0.80/M)

Determines the optimal pipeline path. HYBRID queries skip the agent entirely. LLM_AUGMENTED queries are handed off to the Sonnet agent. OFF_TOPIC queries are rejected immediately. Also produces a refined query optimized for the search index.

🧠

Agent Plans Approach (Sonnet — LLM_AUGMENTED only)

Claude Sonnet 4 receives the query with available tools and decides its strategy

For complex queries only. The agent analyzes what product categories are needed and determines which tools to call. HYBRID queries skip this step and go straight to embed + search.

🛠️

Tool Execution

Loop

Agent calls tools: search_products, compare_products, get_product_details, calculate_project_cost

Each tool call hits the same hybrid retrieval infrastructure (BM25 + HNSW vector + semantic reranking via Azure AI Search). Results are accumulated across calls.

🔄

Evaluate & Iterate

Loop

Agent reviews tool results and decides: more searches needed, or enough to answer?

If initial results are insufficient, the agent refines its query and searches again. Bounded to 5 tool calls max and 30-second timeout for cost control.

Final Answer

Agent synthesizes a grounded expert answer from all gathered results

Structured response with summary, product recommendations (from actual results only), safety notes, and pro tips. Streamed to the UI via Server-Sent Events.

📦

SSE Stream Delivery

Each reasoning step, tool call, and result streams to the frontend in real-time

Users watch the agent think — every tool call and result appears as an animated timeline in the Agent Reasoning Panel.

3Agent Tools

🔍

search_products

Hybrid search against the industrial product catalog using BM25 + vector + semantic reranking. Supports filters for category, brand, price range.

Example Input

{"query": "N95 respirator", "filters": {"categories": ["Safety"]}}

Returns: Array of scored products with facet counts

⚖️

compare_products

Side-by-side comparison of 2-4 products on price, brand, category, and key attributes.

Example Input

{"product_ids": ["SA-001", "SA-002", "SA-003"]}

Returns: Comparison table with all product attributes

📋

get_product_details

Full detail lookup for a specific product by ID. Checks accumulated results first, then falls back to Azure search.

Example Input

{"product_id": "EL-001"}

Returns: Complete product record with all attributes

💰

calculate_project_cost

Calculates total project cost from a bill of materials. Useful for project-based queries like "What do I need to wire a circuit?"

Example Input

{"items": [{"product_id": "EL-001", "name": "...", "price": 8.49, "quantity": 2}]}

Returns: Itemized breakdown with subtotals and grand total

4Adaptive Strategy

The agent adapts its behavior to query complexity — no hardcoded routing rules needed. Here are three example traces showing how the same agent handles different query types:

"3M N95 respirator"HYBRID
1.search_products({query: "3M N95 respirator"}) → 12 results

Simple lookup — agent makes one search call and returns results directly. ~1.5s total.

🧠"What PPE do I need for metal grinding?"AGENTIC
1.search_products({query: "metal grinding PPE safety equipment"}) → 15 results
2.search_products({query: "face shield grinding eye protection"}) → 8 results
3.search_products({query: "welding gloves heat resistant"}) → 6 results
4.Final answer with 5 recommendations + safety notes

Advisory query — agent searches multiple categories to build a comprehensive PPE kit. ~4s total.

🧠"DeWalt vs Milwaukee 20V drill"AGENTIC
1.search_products({query: "DeWalt 20V drill"}) → 4 results
2.search_products({query: "Milwaukee 20V drill"}) → 3 results
3.compare_products({product_ids: ["HT-010", "HT-011"]}) → comparison table
4.Final answer with recommendation

Comparison query — agent searches both brands, then uses compare tool for side-by-side analysis. ~3.5s total.

5Technology Stack

Next.js 14

Frontend & API Routes

React-based framework with SSE streaming API routes. App Router with Suspense for streaming UI state.

Claude Haiku 3.5

Query Classification

Fast, cheap intent classifier ($0.80/M). Routes HYBRID queries to direct search and LLM_AUGMENTED queries to the Sonnet agent. Runs on every query.

Claude Sonnet 4

Agentic Reasoning & Synthesis

Multi-step reasoning agent with tool_use for complex queries. Only invoked for LLM_AUGMENTED queries — HYBRID queries skip it entirely.

Claude tool_use API

Agent Tool Framework

Native tool calling interface lets the agent invoke search, compare, and calculate tools autonomously within a bounded loop.

Azure AI Search

Hybrid Retrieval Engine

BM25 keyword matching + HNSW vector index + semantic reranking via RRF. Supports faceted navigation and OData filters.

Azure OpenAI

Query Embedding

text-embedding-3-small generates 1536-dim vectors for semantic similarity search against the product catalog.

Server-Sent Events

Real-Time Streaming

Agent reasoning steps, tool calls, and results stream to the UI in real-time via SSE — users watch the agent think.

6Cost Model & Safety Bounds

Two-stage architecture: Haiku classifies every query cheaply, then only complex queries invoke Sonnet. HYBRID queries never touch the agent.

ComponentModel / ServiceWhenCost
Query classificationClaude Haiku 3.5Every query$0.80/M
Agent (reasoning + synthesis)Claude Sonnet 4LLM_AUGMENTED only$3.00/M
Query embeddingtext-embedding-3-smallEvery search call$0.02/M
Retrieval + rerankingAzure AI SearchEvery search callper-query

5

Max Tool Calls

Prevents runaway cost on complex queries

30s

Max Timeout

Hard timeout forces synthesis from gathered results

0 calls

Off-Topic Guard

Agent responds without tools for irrelevant queries

7Search Platform Comparison

The same RAG pipeline ships against four retrieval engines: Azure AI Search (managed), Apache Solr 9.6 (self-hosted on Azure ACI), Elasticsearch 9.1 (self-hosted on Azure ACI), and Google Vertex AI (managed). Only the retrieval call differs — the classifier, agent loop, synthesis prompt, and UI are identical across all four. The first three share an Azure OpenAI embedding pipeline; Vertex is a managed black-box that fuses semantic + keyword internally with its own embedder. This is the platform-agnostic RAG architecture payoff: swap engines, keep the pipeline.

Azure AI Search

Microsoft-hosted managed service. BM25 + vector + proprietary semantic reranker in one API.

  • ✓ RRF fusion (native)
  • ✓ Cross-encoder reranker (built-in)
  • ✓ Scoring profiles / facets
  • ✗ You own the operational model
NDCG@10 (hybrid_semantic): 87%

Apache Solr 9.6

Apache 2.0, self-hosted on Azure Container Instances. Largest customization surface, fully permissive license.

  • ✓ BM25 (edismax) + KNN vector
  • ⧗ RRF via app-level code (10.1 ships native)
  • ✓ Scoring profiles / facets
  • ✗ No native cross-encoder reranker
NDCG@10 (hybrid): 79%

Elasticsearch 9.1

Elastic License, self-hosted on Azure ACI. Strongest AI-native story — ELSER, inference APIs, native reranker retriever.

  • ✓ RRF retriever (native, single request)
  • ⚠ Platinum license for RRF / reranker / ELSER
  • ✓ Terms aggs / filters
  • ◇ Cross-encoder reranker available via inference API
NDCG@10 (hybrid RRF): 79%

Google Vertex AI

Fully managed Discovery Engine. Internal Gecko-class embedder, semantic + keyword fusion in one API call, no infrastructure.

  • ✓ Hybrid retrieval (single API call)
  • ✓ Internal embedder — no separate embed cost
  • ✓ Free tier covers small catalogs
  • ✗ Black-box: no separable bm25/vector or score control
NDCG@10 (hybrid): 69%

NDCG@10 across modes (measured)

ModeAzureSolrElasticsearchVertex AI
BM2576%73%75%n/a
Vector (KNN)85%85%85%n/a
Hybrid (RRF)84%79%79%69%
+ Semantic reranker87%n/an/a (planned)n/a (managed)

Numbers pulled live from data/retrieval-eval-results-*.json — re-run npm run evaluate:retrieval:* to refresh. Full breakdown with latencies, per-mode per-query deltas, and ablation chains lives on the /evaluation dashboard (Section 8).

The architectural win

90% of the pipeline is engine-agnostic. Only src/lib/search.ts, src/lib/search-solr.ts, src/lib/search-es.ts, and src/lib/search-vertex.ts differ. Same classifier, same filter-merge, same chip-bar UX, same agent loop. Four retrieval engines sit behind identical contracts (Vertex skips the embed step since it embeds internally).

Known follow-up: cross-encoder reranking for Solr + ES

The ~7-13 NDCG gap to Azure's hybrid_semantic comes from its cross-encoder reranker. Closing that gap for Solr and ES requires deploying Cohere rerank-v3.5 on Azure AI Foundry (~$1/1k queries) and wiring it as a post-processing step. Planned, not yet shipped.