ADLC vs SDLC: why traditional lifecycle models are not enough
Traditional SDLC assumes deterministic behavior: given the same input and code, a system produces the same output. AI agents break that assumption. Even with identical prompts, tools, and context, an agent can choose different reasoning paths, call tools in different orders, or produce varied text depending on model version, temperature settings, or hidden state. ADLC extends SDLC with explicit phases for non-deterministic testing, evaluation dataset management, and continuous tuning.
Key differences between SDLC and ADLC
| Dimension | SDLC approach | ADLC approach |
|---|---|---|
| Testing | Unit and integration tests assert exact outputs for given inputs | Evaluation metrics, golden datasets, and statistical bounds over multiple runs |
| Versioning | Code commits and semantic versioning | Prompt versions, tool contract versions, model versions, and evaluation dataset versions |
| Deployment | Binaries or containers are immutable once deployed | Model endpoints can be updated upstream; prompts can change without code deployment |
| Monitoring | Error rates, latency, and resource metrics | Quality metrics, hallucination rates, approval rates, and evaluation scores over time |
| Rollback | Revert to previous binary or container image | Revert prompt, tool policy, model version, or routing rules independently |
| Regression | Test suite catches breaking changes | Evaluation suite detects quality drift; prompts may need tuning even without code changes |
Prompts are code, but they are not the only code
Modern agent systems treat prompts as versioned artifacts alongside tool contracts, evaluation datasets, and orchestration logic. The "prompt as code" discipline is necessary but not sufficient: you must also version the evaluation criteria, golden examples, and policy rules that define acceptable behavior.
Agent Development Lifecycle (ADLC)
Salesforce Architect guide defining the ADLC phases (Ideation, Development, Testing, Deployment, Monitoring), inner- and outer-loop activities, and non-deterministic testing strategies for production agents.
Read the Salesforce ADLC guideADLC phases: from ideation to production
ADLC organizes agent development into five phases: ideation, development, testing and validation, deployment, and monitoring and tuning. The inner loop (ideation, development, testing) supports rapid iteration, while the outer loop (deployment, monitoring, and feeding insights back into development) handles continuous improvement in production.
Agent Development Lifecycle phases
Loading diagram...
ADLC phase details
| Phase | Description | Key activities | Primary risk | Exit criteria |
|---|---|---|---|---|
| Ideation | Define the agent's scope, autonomy level, and success metrics | Identify user workflows, select tools, define policy boundaries, draft evaluation questions | Building an agent for a problem better solved by deterministic automation | Clear use case, bounded tool set, and measurable success criteria |
| Development (Inner Loop) | Rapid iteration on prompts, tool contracts, and orchestration logic | Draft prompts, implement tools, run local tests, adjust temperature and routing rules | Overfitting to a narrow test set or missing edge cases | Agent completes end-to-end workflows in a sandbox environment |
| Testing & Validation | Systematic evaluation against golden datasets and adversarial inputs | Run evaluation suites, test failure modes, validate tool contract compliance, review reasoning traces | Silent regression where quality drifts without obvious errors | Evaluation metrics meet baselines; failure modes are understood and mitigated |
| Deployment | Controlled release to production with gradual rollout and observability | Configure canary deployments, set up monitoring dashboards, document rollback procedures, train reviewers | Unexpected behavior in production due to scale, data shifts, or model drift | Successful pilot run with no critical incidents; rollback paths tested |
| Monitoring & Tuning (Outer Loop) | Continuous observation and improvement based on production signals | Track quality metrics, collect edge cases, re-run evaluations, tune prompts and policies | Drift where quality degrades gradually without clear triggers | Ongoing; tuning loops feed back into development when thresholds are breached |
Testing non-deterministic systems
Testing agents requires shifting from "does this exact output match?" to "is this behavior acceptable within bounds?" A well-designed test strategy combines golden datasets for happy-path validation, adversarial inputs for failure-mode testing, regression suites for drift detection, and human evaluation for nuanced quality judgments.
Testing strategies for AI agents
| Strategy | Purpose | Key techniques | Limitations |
|---|---|---|---|
| Golden paths | Validate that the agent handles expected workflows correctly | Curated input-output pairs, reference reasoning chains, tool call sequences | Does not catch edge cases or novel situations |
| Adversarial inputs | Test failure modes, safety boundaries, and robustness | Malicious prompts, out-of-scope requests, ambiguous or contradictory inputs | Hard to exhaust; may miss subtle safety issues |
| Regression suites | Detect quality drift over time as prompts or models change | Periodic evaluation runs, metric baselines, threshold alerts | Requires stable evaluation datasets and clear success metrics |
| A/B evaluation | Compare candidate prompts, models, or configurations in production | Canary deployments, interleaved trials, blind human ratings | Requires traffic volume and careful experimental design |
| Human evaluation | Assess nuanced quality, safety, and appropriateness | Expert review, crowdsourced ratings, clinical or domain-specific rubrics | Expensive, slow, and subject to bias or inconsistency |
Evaluation datasets must be versioned and treated as first-class artifacts. When you change a prompt, tool, or model, you should re-run the previous evaluation dataset to check for regression. When you discover new edge cases in production, add them to the dataset for future testing. This discipline creates a feedback loop where production experience continuously strengthens the test suite.
Test tool contracts separately from agent reasoning
Tool contracts (APIs, MCP servers, FHIR operations) should have their own unit tests independent of the agent. This separation isolates failures: if the tool contract tests pass but the agent fails, the problem is in planning or tool selection. If the tool contract tests fail, the problem is in the tool implementation.
Knowledge Check
Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.