Agentic RAG with self-correction
Production RAG systems often suffer from "Contextual Blindness" where the model retrieves irrelevant data but attempts to answer anyway. Agentic RAG adds a Self-Correction (Maker-Checker) loop to validate retrieval quality before generation.
Self-Correcting RAG Loop
Loading diagram...
Grounding Verification
In this pattern, the Checker agent doesn't just look for "an answer." It evaluates the Candidate Context against a specific rubric: "Does this context contain facts required to satisfy the User Query?" If not, it instructs the Maker to try a different search strategy, effectively automating the "retry" logic humans do manually.
Evaluating agents is fundamentally different from evaluating deterministic software
Traditional software testing assumes deterministic behavior: the same input always produces the same output. Agent systems violate that assumption. Quality evaluation for agents requires statistical approaches, diverse test datasets, and continuous measurement—not single-pass unit tests.
Agent evaluation approaches and when to use them
| Approach | When to use | Strengths | Limitations |
|---|---|---|---|
| Offline benchmarks | Early development, model selection, regression testing | Fast, reproducible, good for catching major regressions | May not reflect real-world usage patterns, can be gamed |
| Online A/B testing | Production deployment, comparing model or prompt versions | Measures real-world performance, captures actual user impact | Requires significant traffic, slow to converge, ethical concerns for some domains |
| Human evaluation | Complex tasks, safety-critical decisions, quality assessment | Captures nuance and context that automated metrics miss | Expensive, slow, subjective, does not scale well |
| Automated regression | Continuous integration, prompt changes, model updates | Fast, repeatable, integrates into CI/CD pipelines | Requires maintaining evaluation datasets, may miss edge cases |
| Adversarial testing | Security validation, jailbreak resistance, safety testing | Finds vulnerabilities and failure modes that normal tests miss | Cannot cover all possible attacks, requires adversarial expertise |
Effective agent evaluation requires a layered approach: automated regression tests for fast feedback during development, human evaluation for quality assurance, and adversarial testing for security validation. The evaluation dataset itself must evolve as the agent encounters new edge cases in production.
Agent Development Lifecycle (ADLC) — Testing and Evaluation
Salesforce Architect guide covering the ADLC Testing & Validation phase, including evaluation dataset management, regression suites, adversarial testing, and outer-loop continuous tuning for non-deterministic systems.
Read the ADLC Testing and Evaluation sectionProduction reliability requires accepting and bounding non-determinism
You cannot eliminate non-determinism in agent systems—the same input will sometimes produce different outputs. Production reliability comes from bounding the blast radius of bad outputs, detecting low-confidence decisions, and having clear escalation paths when the agent is uncertain.
Strategies for handling non-deterministic agent behavior
| Strategy | How it works | Best for |
|---|---|---|
| Temperature control | Lower temperature reduces randomness, higher temperature increases creativity but also variance | Balancing consistency vs. creativity based on task requirements |
| Deterministic tool routing | Use rule-based or semantic routing to choose tools rather than letting the model decide | Reducing variance in tool selection for common, repetitive tasks |
| Validation layers | Post-processing checks that validate outputs against schemas, rules, or business logic before use | Catching hallucinations, format errors, and policy violations before they affect downstream systems |
| Confidence thresholds | Require the agent to estimate confidence and reject or escalate low-confidence results | Filtering uncertain decisions and surfacing cases that need human review |
| Retry with consensus | Run the same request multiple times and use voting or aggregation to produce a more stable result | Improving consistency for critical decisions where latency is acceptable |
Confidence-based decision flow for agent outputs
Loading diagram...
Accept some non-determinism while bounding the blast radius
Perfection is not the goal—consistency within acceptable bounds is. Focus on detecting and handling the cases where non-determinism produces bad outcomes, rather than trying to eliminate variation entirely. Confidence thresholds, validation layers, and clear escalation patterns are more practical than attempting fully deterministic behavior.
Observability, rollback, and cost management are production requirements, not afterthoughts
Running agent systems in production requires operational disciplines that go beyond prompt engineering. You need traces for debugging, metrics for performance, budgets for cost control, and rollback procedures for when agents misbehave. These must be designed before launch, not bolted on after incidents.
Operational concerns for production agent systems
| Concern | Why it matters | Key practices |
|---|---|---|
| Agent tracing | Debugging multi-agent workflows requires full execution traces, not just final outputs | Log every tool call, intermediate output, and routing decision; correlate traces across agents |
| Token cost tracking | Multi-step workflows can have surprisingly high token costs that only appear in production | Track tokens per agent, per tool, and per user; set budgets and alerts |
| Latency monitoring | Agent response times vary based on model, tools, and workflow complexity | Measure end-to-end latency, break down by step, and track percentiles |
| Error classification | Agent errors come from models, tools, prompts, or data—root cause requires categorization | Classify errors by type, track error rates per category, and alert on anomalies |
| Rollback procedures | Agents may produce correct results today but fail after a model or prompt change | Version prompts and model configurations, maintain canary deployments, and have rollback triggers |
Cost management deserves explicit attention because token usage scales with workflow complexity. Simple tasks should use cheap models, repeated queries should use caching, and expensive models should be reserved for complex reasoning steps. Model routing—choosing the right model for each subtask—can reduce costs by 60-80% without sacrificing quality.
Observability is a pre-production requirement, not a post-launch add-on
Design tracing, metrics, and logging into your agent architecture from day one. Without observability, you cannot debug failures, measure performance, or prove compliance. Trying to add observability after a production incident is too late—you need the data before the problem occurs.
Cloud-native evaluation & state mapping
| Concept / Tool | AWS | Azure | GCP |
|---|---|---|---|
| Statistical Evaluation | Bedrock Model Evaluation | Azure AI Studio Eval SDK | Vertex AI Model Evaluation |
| Task Ledger / State | Step Functions (Express) | Semantic Kernel Store | Reasoning Engine State |
| Durable Memory | Amazon OpenSearch Serverless | Azure AI Search (Vector) | Vertex AI Vector Search |
Knowledge Check
Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.