Enterprise Agentic Architecture - Reliability, Evaluation, and Production Operations

Agentic RAG with self-correction

Production RAG systems often suffer from "Contextual Blindness" where the model retrieves irrelevant data but attempts to answer anyway. Agentic RAG adds a Self-Correction (Maker-Checker) loop to validate retrieval quality before generation.

Self-Correcting RAG Loop

100%drag to pan

Loading diagram...

Grounding Verification

In this pattern, the Checker agent doesn't just look for "an answer." It evaluates the Candidate Context against a specific rubric: "Does this context contain facts required to satisfy the User Query?" If not, it instructs the Maker to try a different search strategy, effectively automating the "retry" logic humans do manually.

Evaluating agents is fundamentally different from evaluating deterministic software

Traditional software testing assumes deterministic behavior: the same input always produces the same output. Agent systems violate that assumption. Quality evaluation for agents requires statistical approaches, diverse test datasets, and continuous measurement—not single-pass unit tests.

Agent evaluation approaches and when to use them

Approach	When to use	Strengths	Limitations
Offline benchmarks	Early development, model selection, regression testing	Fast, reproducible, good for catching major regressions	May not reflect real-world usage patterns, can be gamed
Online A/B testing	Production deployment, comparing model or prompt versions	Measures real-world performance, captures actual user impact	Requires significant traffic, slow to converge, ethical concerns for some domains
Human evaluation	Complex tasks, safety-critical decisions, quality assessment	Captures nuance and context that automated metrics miss	Expensive, slow, subjective, does not scale well
Automated regression	Continuous integration, prompt changes, model updates	Fast, repeatable, integrates into CI/CD pipelines	Requires maintaining evaluation datasets, may miss edge cases
Adversarial testing	Security validation, jailbreak resistance, safety testing	Finds vulnerabilities and failure modes that normal tests miss	Cannot cover all possible attacks, requires adversarial expertise

Effective agent evaluation requires a layered approach: automated regression tests for fast feedback during development, human evaluation for quality assurance, and adversarial testing for security validation. The evaluation dataset itself must evolve as the agent encounters new edge cases in production.

Agent Development Lifecycle (ADLC) — Testing and Evaluation

Salesforce Architect guide covering the ADLC Testing & Validation phase, including evaluation dataset management, regression suites, adversarial testing, and outer-loop continuous tuning for non-deterministic systems.

Read the ADLC Testing and Evaluation section

Production reliability requires accepting and bounding non-determinism

You cannot eliminate non-determinism in agent systems—the same input will sometimes produce different outputs. Production reliability comes from bounding the blast radius of bad outputs, detecting low-confidence decisions, and having clear escalation paths when the agent is uncertain.

Strategies for handling non-deterministic agent behavior

Strategy	How it works	Best for
Temperature control	Lower temperature reduces randomness, higher temperature increases creativity but also variance	Balancing consistency vs. creativity based on task requirements
Deterministic tool routing	Use rule-based or semantic routing to choose tools rather than letting the model decide	Reducing variance in tool selection for common, repetitive tasks
Validation layers	Post-processing checks that validate outputs against schemas, rules, or business logic before use	Catching hallucinations, format errors, and policy violations before they affect downstream systems
Confidence thresholds	Require the agent to estimate confidence and reject or escalate low-confidence results	Filtering uncertain decisions and surfacing cases that need human review
Retry with consensus	Run the same request multiple times and use voting or aggregation to produce a more stable result	Improving consistency for critical decisions where latency is acceptable

Confidence-based decision flow for agent outputs

100%drag to pan

Loading diagram...

Accept some non-determinism while bounding the blast radius

Perfection is not the goal—consistency within acceptable bounds is. Focus on detecting and handling the cases where non-determinism produces bad outcomes, rather than trying to eliminate variation entirely. Confidence thresholds, validation layers, and clear escalation patterns are more practical than attempting fully deterministic behavior.

Observability, rollback, and cost management are production requirements, not afterthoughts

Running agent systems in production requires operational disciplines that go beyond prompt engineering. You need traces for debugging, metrics for performance, budgets for cost control, and rollback procedures for when agents misbehave. These must be designed before launch, not bolted on after incidents.

Operational concerns for production agent systems

Concern	Why it matters	Key practices
Agent tracing	Debugging multi-agent workflows requires full execution traces, not just final outputs	Log every tool call, intermediate output, and routing decision; correlate traces across agents
Token cost tracking	Multi-step workflows can have surprisingly high token costs that only appear in production	Track tokens per agent, per tool, and per user; set budgets and alerts
Latency monitoring	Agent response times vary based on model, tools, and workflow complexity	Measure end-to-end latency, break down by step, and track percentiles
Error classification	Agent errors come from models, tools, prompts, or data—root cause requires categorization	Classify errors by type, track error rates per category, and alert on anomalies
Rollback procedures	Agents may produce correct results today but fail after a model or prompt change	Version prompts and model configurations, maintain canary deployments, and have rollback triggers

Cost management deserves explicit attention because token usage scales with workflow complexity. Simple tasks should use cheap models, repeated queries should use caching, and expensive models should be reserved for complex reasoning steps. Model routing—choosing the right model for each subtask—can reduce costs by 60-80% without sacrificing quality.

Observability is a pre-production requirement, not a post-launch add-on

Design tracing, metrics, and logging into your agent architecture from day one. Without observability, you cannot debug failures, measure performance, or prove compliance. Trying to add observability after a production incident is too late—you need the data before the problem occurs.

Cloud-native evaluation & state mapping

Concept / Tool	AWS	Azure	GCP
Statistical Evaluation	Bedrock Model Evaluation	Azure AI Studio Eval SDK	Vertex AI Model Evaluation
Task Ledger / State	Step Functions (Express)	Semantic Kernel Store	Reasoning Engine State
Durable Memory	Amazon OpenSearch Serverless	Azure AI Search (Vector)	Vertex AI Vector Search

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 6

Why is evaluating agent systems fundamentally different from evaluating traditional software?

← PreviousCross-Platform Agent Stacks Next →Final Assessment

Approach

When to use

Strengths

Limitations

Offline benchmarks

Early development, model selection, regression testing

Fast, reproducible, good for catching major regressions

May not reflect real-world usage patterns, can be gamed

Online A/B testing

Production deployment, comparing model or prompt versions

Measures real-world performance, captures actual user impact

Requires significant traffic, slow to converge, ethical concerns for some domains

Human evaluation

Complex tasks, safety-critical decisions, quality assessment

Captures nuance and context that automated metrics miss

Expensive, slow, subjective, does not scale well

Automated regression

Continuous integration, prompt changes, model updates

Fast, repeatable, integrates into CI/CD pipelines

Requires maintaining evaluation datasets, may miss edge cases

Adversarial testing

Security validation, jailbreak resistance, safety testing

Finds vulnerabilities and failure modes that normal tests miss

Cannot cover all possible attacks, requires adversarial expertise

Strategy

How it works

Best for

Temperature control

Lower temperature reduces randomness, higher temperature increases creativity but also variance

Balancing consistency vs. creativity based on task requirements

Deterministic tool routing

Use rule-based or semantic routing to choose tools rather than letting the model decide

Reducing variance in tool selection for common, repetitive tasks

Validation layers

Post-processing checks that validate outputs against schemas, rules, or business logic before use

Catching hallucinations, format errors, and policy violations before they affect downstream systems

Confidence thresholds

Require the agent to estimate confidence and reject or escalate low-confidence results

Filtering uncertain decisions and surfacing cases that need human review

Retry with consensus

Run the same request multiple times and use voting or aggregation to produce a more stable result

Improving consistency for critical decisions where latency is acceptable

Concern

Why it matters

Key practices

Agent tracing

Debugging multi-agent workflows requires full execution traces, not just final outputs

Log every tool call, intermediate output, and routing decision; correlate traces across agents

Token cost tracking

Multi-step workflows can have surprisingly high token costs that only appear in production

Track tokens per agent, per tool, and per user; set budgets and alerts

Latency monitoring

Agent response times vary based on model, tools, and workflow complexity

Measure end-to-end latency, break down by step, and track percentiles

Error classification

Agent errors come from models, tools, prompts, or data—root cause requires categorization

Classify errors by type, track error rates per category, and alert on anomalies

Rollback procedures

Agents may produce correct results today but fail after a model or prompt change

Version prompts and model configurations, maintain canary deployments, and have rollback triggers

Concept / Tool

AWS

Azure

GCP

Statistical Evaluation

Bedrock Model Evaluation

Azure AI Studio Eval SDK

Vertex AI Model Evaluation

Task Ledger / State

Step Functions (Express)

Semantic Kernel Store

Reasoning Engine State

Durable Memory

Amazon OpenSearch Serverless

Azure AI Search (Vector)

Vertex AI Vector Search

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 6

Reliability, Evaluation, and Production Operations

Knowledge Tree

Agentic RAG with self-correction

Self-Correcting RAG Loop

Grounding Verification

Evaluating agents is fundamentally different from evaluating deterministic software

Agent Development Lifecycle (ADLC) — Testing and Evaluation

Production reliability requires accepting and bounding non-determinism

Confidence-based decision flow for agent outputs

Accept some non-determinism while bounding the blast radius

Observability, rollback, and cost management are production requirements, not afterthoughts

Observability is a pre-production requirement, not a post-launch add-on

Knowledge Check

Why is evaluating agent systems fundamentally different from evaluating traditional software?

Reliability, Evaluation, and Production Operations - Knowledge Tree

Reliability, Evaluation, and Production Operations

Knowledge Tree

Agentic RAG with self-correction

Self-Correcting RAG Loop

Grounding Verification

Evaluating agents is fundamentally different from evaluating deterministic software

Agent Development Lifecycle (ADLC) — Testing and Evaluation

Production reliability requires accepting and bounding non-determinism

Confidence-based decision flow for agent outputs

Accept some non-determinism while bounding the blast radius

Observability, rollback, and cost management are production requirements, not afterthoughts

Observability is a pre-production requirement, not a post-launch add-on

Knowledge Check

Why is evaluating agent systems fundamentally different from evaluating traditional software?

Reliability, Evaluation, and Production Operations - Knowledge Tree