RAGAS Evaluation
Systematic quality measurement of our RAG pipeline using RAGAS (Retrieval Augmented Generation Assessment) metrics, scored using the official RAGAS Python framework with Claude as the evaluator LLM.
Last evaluated: April 26, 2026 • 15 test cases
What is RAGAS?
Traditional Evaluation
- •Manual spot-checking of search results
- •Vibes-based quality assessment
- •No systematic measurement framework
- •Hard to detect regressions between changes
RAGAS Framework
- ✓4 standardized metrics covering retrieval + generation
- ✓LLM-as-judge for automated, scalable evaluation
- ✓Curated test dataset with ground truth answers
- ✓Reproducible scores for regression detection
Implementation note: Scored using the official RAGAS Python library (v0.4) with Claude Haiku as the evaluator LLM. A two-stage pipeline runs queries through the live RAG system (TypeScript), then scores the outputs using the industry-standard RAGAS framework (Python).
Overall Scores
Routing
Query classification accuracy
Faithfulness
Answers grounded in retrieved context
Relevancy
Answers address the actual question
Ctx Precision
Retrieved docs are relevant & well-ranked
Ctx Recall
Retrieved docs cover the ground truth
Scores by Query Type
⚡Simple Lookup
5 queries🧠Advisory
3 queries⚖️Comparison
2 queries🚫Off-Topic
3 queries🧪Edge Case
2 queriesIndividual Test Results
01✓3M N95 respiratorHybrid53%
Retrieved Products (20)
Generated Answer
We have two excellent 3M N95 respirator options available. Both are NIOSH-approved and filter at least 95% of airborne particles, with the key difference being whether you need extended comfort features for longer wear periods. Recommended products: 3M 8210 N95 Particulate Respirator (20 pack): Sta...
Latency: 9444ms
02✓20-amp GFCI circuit breakerHybrid93%
Retrieved Products (20)
Generated Answer
For a 20-amp GFCI circuit breaker, I recommend the Square D QO 20-Amp GFCI Circuit Breaker which provides Class A 5mA GFCI protection with self-test functionality. If you need dual protection for areas like bedrooms or kitchens, consider the AFCI/GFCI combination breaker that meets current NEC requi...
Latency: 8862ms
03✓Klein Tools wire stripperHybrid78%
Retrieved Products (20)
Generated Answer
For Klein Tools wire strippers, we have the Klein Tools 11055 Wire Stripper/Cutter available. This precision tool handles both solid and stranded wire in the most common AWG ranges with hardened steel construction to prevent conductor damage. Recommended products: Klein Tools 11055 Wire Stripper/Cu...
Latency: 10334ms
04✓Belden Cat6 plenum cable 1000 ftHybrid95%
Retrieved Products (20)
Generated Answer
I found the Belden Cat6 plenum cable you're looking for. The Belden 2413 DataTwist is available in 1000 ft length with CMP plenum rating, perfect for air-handling spaces without conduit. Recommended products: Belden 2413 Cat6 Plenum-Rated Cable, 1000 ft (Blue): This is exactly what you requested - ...
Latency: 8334ms
05✓Axis indoor IP dome cameraHybrid94%
Retrieved Products (20)
Generated Answer
For Axis indoor IP dome cameras, I can recommend the Axis P3245-LV 2 MP Network Dome Camera with IR. This indoor fixed dome camera offers 1080p HDTV resolution with Lightfinder 2.0 low-light technology and built-in OptimizedIR up to 40 m range. Recommended products: Axis P3245-LV 2 MP Network Dome ...
Latency: 9013ms
06✓What PPE do I need for metal grinding?Agentic72%
Retrieved Products (29)
Generated Answer
For metal grinding, you need comprehensive PPE to protect against sparks, metal particles, noise, and potential cuts. Essential equipment includes safety glasses, respiratory protection, cut-resistant or welding gloves, and hearing protection for prolonged work. Recommended products: Honeywell Uvex...
Latency: 20464ms
07✓What do I need to wire a 20-amp outdoor circuit?Agentic62%
Retrieved Products (33)
Generated Answer
For a 20-amp outdoor circuit, you'll need a GFCI circuit breaker for electrical panel protection, weather-resistant GFCI outlet, appropriate wire, and weatherproof cover. The circuit requires GFCI protection due to outdoor moisture exposure and specific components rated for wet locations. Recommend...
Latency: 56329ms
08✓What do I need to install a Cat6A network drop end-to-end?Agentic66%
Retrieved Products (23)
Generated Answer
For a complete Cat6A network drop installation, you'll need bulk cable, keystone jacks for termination, wall faceplates, patch panel ports, and patch cords to connect equipment. This setup provides 10 Gigabit capability with proper shielding for industrial environments. Recommended products: Belden...
Latency: 71982ms
09✓DeWalt vs Milwaukee drillAgentic47%
Retrieved Products (13)
Generated Answer
Both DeWalt and Milwaukee offer excellent brushless drills with similar performance. Milwaukee drills typically deliver higher torque and include larger batteries in kits, while DeWalt offers more compact options and the unique FLEXVOLT ADVANTAGE technology for increased power with compatible batter...
Latency: 53263ms
10✓Which is better for home use: GFCI or AFCI breaker?Agentic67%
Retrieved Products (18)
Generated Answer
For home use, neither GFCI nor AFCI alone is "better" - they protect against different hazards and current NEC code requires both types of protection in different areas. GFCI protects against ground faults (shock/electrocution), while AFCI protects against arc faults (fires). The best solution is a ...
Latency: 30895ms
11✓What is the best pizza recipe?Off-Topic100%
Latency: 1257ms
12✓How do I train for a marathon?Off-Topic100%
Latency: 1341ms
13✓Explain quantum computingOff-Topic100%
Latency: 1314ms
14✓safetyHybrid20%
Retrieved Products (20)
Generated Answer
Safety equipment is essential for workplace protection across multiple hazard categories. Based on our product catalog, we offer comprehensive safety solutions including respiratory protection, fall protection, head and eye protection, hand protection, hearing protection, and specialized welding saf...
Latency: 11677ms
15✓I need to install an IP camera and replace a circuit breakerAgentic63%
Retrieved Products (23)
Generated Answer
For your IP camera installation, the Axis P3245-LV offers excellent value at $689 with 1080p resolution, low-light capabilities, and 40m infrared range for indoor use. For circuit breaker replacement, the Square D QO 20-Amp Standard breaker ($8.49) is the most economical choice, while GFCI ($42.99) ...
Latency: 17589ms
Methodology
01 Faithfulness
Extracts atomic factual claims from the generated answer, then verifies each claim against the retrieved contexts. A claim must be directly supported — not merely plausible. Score = supported claims / total claims.
Scored by: RAGAS + Claude Haiku
02 Answer Relevancy
Generates hypothetical questions the answer would address, embeds them alongside the original query, and computes average cosine similarity. High similarity = answer stays on-topic.
Scored by: RAGAS + Azure OpenAI Embeddings
03 Context Precision
For each retrieved document (in search rank order), judges relevance to the query. Computes precision at each rank position, rewarding relevant documents appearing at the top. Measures retrieval ranking quality.
Scored by: RAGAS + Claude Haiku
04 Context Recall
Extracts factual claims from the ground truth answer, then checks whether the retrieved contexts contain supporting information. Measures how completely the retrieval covers the ideal answer. This is the only metric requiring ground truth.
Scored by: RAGAS + Claude Haiku
Evaluation Pipeline
Retrieval Ablation Study
Head-to-head comparison of 4 retrieval strategies on 12 labeled queries using standard IR metrics. Unlike RAGAS (end-to-end), these isolate pure retrieval quality from generation. The production mode (Hybrid + Semantic Reranking) is highlighted — this ablation justifies that architectural choice.
| Retrieval Mode | R@5 | R@10 | P@5 | P@10 | MRR | NDCG@10 | Latency |
|---|---|---|---|---|---|---|---|
BM25 Only Pure keyword search using BM25 scoring. No vectors, no reranking. | 65% | 79% | 30% | 24% | 87% | 76% | 53ms |
Vector Only Pure vector search using cosine similarity. No keyword matching. | 79% | 90% | 47% | 29% | 90% | 85% | 44ms |
Hybrid (no rerank) BM25 + vector fusion. No semantic reranking layer. | 72% | 88% | 40% | 28% | 92% | 84% | 65ms |
Hybrid + Semantic RerankingProduction Production: BM25 + vector + Microsoft semantic reranker. | 77% | 91% | 45% | 30% | 92% | 87% | 115ms |
Hybrid + Semantic + Field BoostingEnhanced Production + scoring profile: name×5, sku×4, brand×3, category×1.5 weights. | 77% | 91% | 45% | 30% | 92% | 87% | 102ms |
Per-Query Results
013M N95 respiratorsimple lookup3 GT
0220-amp GFCI circuit breakersimple lookup2 GT
03Klein Tools wire strippersimple lookup1 GT
04Belden Cat6 plenum cable 1000 ftsimple lookup2 GT
05Axis indoor IP dome camerasimple lookup1 GT
06What PPE do I need for metal grinding?advisory8 GT
07What do I need to wire a 20-amp outdoor circuit?advisory6 GT
08What do I need to install a Cat6A network drop end-to-end?advisory6 GT
09DeWalt vs Milwaukee drillcomparison2 GT
10Which is better for home use: GFCI or AFCI breaker?comparison2 GT
11safetyedge case7 GT
12I need to install an IP camera and replace a circuit breakeredge case3 GT
IR Metrics
Embedding Model Comparison
Comparison of embedding models using Hybrid + Semantic Reranking mode (production configuration). Each model was used to embed both documents and queries, with separate Azure AI Search indexes per model.
| Embedding Model | Dims | R@5 | R@10 | P@5 | MRR | NDCG@10 |
|---|---|---|---|---|---|---|
text-embedding-3-smallProduction | 1,536 | 77% | 91% | 45% | 92% | 87% |
text-embedding-3-large | 3,072 | 76% | 92% | 43% | 92% | 87% |
text-embedding-3-large (reduced) | 1,536 | 77% | 92% | 45% | 92% | 87% |
Cohere Embed v4 | 1,536 | 77% | 91% | 45% | 92% | 86% |
Methodology: Each embedding model was used to index the full product catalog into a separate Azure AI Search index, then the same test queries were run against each index. BM25 scores are identical across models (keyword search is embedding-agnostic); differences appear in the vector and hybrid modes where embedding quality directly impacts retrieval.
Platform Comparison: Azure AI Search vs Apache Solr vs Elasticsearch vs Vertex AI
Same 12 labeled queries. Three of the four engines share the same Azure OpenAI embeddings (text-embedding-3-small, 1,536 dims); Vertex AI uses its own internal embedder. Solr and Elasticsearch are both self-hosted on Azure Container Instances; try them live at /solr-search, /es-search, and /vertex-search.
How each engine does hybrid: Azure fuses BM25 + vector natively with its own Reciprocal Rank Fusion. Elasticsearch uses the retriever.rrf API (gated behind a Platinum license — trial active for this demo; see /es-search). Solr does app-level RRF in our code (SOLR-17319 ships first-party RRF in 10.1). The first three implement the same formula; results differ primarily because of engine-level BM25 and KNN tuning. Vertex AI is a managed black-box: Google fuses semantic + keyword internally with its own embedder, with no public API to disable either side — so it gets one row (hybrid only) and no separable bm25/vector breakdown. Only Azure has a cross-encoder semantic reranker, so the hybrid_semantic row below is the reference ceiling.
| Mode | Platform | R@5 | R@10 | P@5 | MRR | NDCG@10 | Latency |
|---|---|---|---|---|---|---|---|
BM25 Only Pure keyword search using BM25 scoring. No vectors, no reranking. | Azure | 65% | 79% | 30% | 87% | 76% | 53ms |
Solr | 70% | 79% | 37% | 79% | 73% | 44ms | |
Elasticsearch | 70% | 82% | 37% | 79% | 75% | 40ms | |
Vector Only Pure vector search using cosine similarity. No keyword matching. | Azure | 79% | 90% | 47% | 90% | 85% | 44ms |
Solr | 79% | 90% | 47% | 90% | 85% | 39ms | |
Elasticsearch | 79% | 90% | 47% | 90% | 85% | 38ms | |
Hybrid (no rerank) BM25 + vector fusion. No semantic reranking layer. | Azure | 72% | 88% | 40% | 92% | 84% | 65ms |
Solr | 76% | 85% | 42% | 88% | 79% | 74ms | |
Elasticsearch | 78% | 86% | 45% | 83% | 79% | 184ms | |
Vertex AI | 62% | 66% | 53% | 83% | 69% | 510ms | |
Hybrid + Semantic RerankingAzure only Microsoft semantic reranker (no Solr equivalent) | Azure | 77% | 91% | 45% | 92% | 87% | 115ms |
Why you'd pick Azure AI Search
- Managed service, no servers to patch
- Semantic reranker out of the box (+ significant NDCG win)
- Azure-native auth, billing, compliance
- Best retrieval quality in our measurements
Why you'd pick Apache Solr
- Apache 2.0 — fully permissive, no restrictions
- Largest ecosystem for custom scoring / analyzers
- All features free at every scale
- Mature cluster management (SolrCloud + ZooKeeper)
Why you'd pick Elasticsearch
- Native RRF retriever (single-request hybrid)
- Strong AI-native story: ELSER, inference APIs, cross-encoder reranker
- Largest mindshare + vendor-backed tooling (Kibana, Elastic Cloud)
- ⚠ RRF/reranker/ELSER gated behind Platinum license
Why you'd pick Vertex AI
- Fully managed — no embedder, no infra, one API
- Semantic + keyword fusion built-in (Google's Gecko-class embedder)
- Free tier covers small catalogs (~10 GB)
- ⚠ Black-box: no separable bm25/vector or scoring control