Enterprise Agentic Architecture - Agent Development Lifecycle (ADLC)

ADLC vs SDLC: why traditional lifecycle models are not enough

Traditional SDLC assumes deterministic behavior: given the same input and code, a system produces the same output. AI agents break that assumption. Even with identical prompts, tools, and context, an agent can choose different reasoning paths, call tools in different orders, or produce varied text depending on model version, temperature settings, or hidden state. ADLC extends SDLC with explicit phases for non-deterministic testing, evaluation dataset management, and continuous tuning.

Key differences between SDLC and ADLC

Dimension	SDLC approach	ADLC approach
Testing	Unit and integration tests assert exact outputs for given inputs	Evaluation metrics, golden datasets, and statistical bounds over multiple runs
Versioning	Code commits and semantic versioning	Prompt versions, tool contract versions, model versions, and evaluation dataset versions
Deployment	Binaries or containers are immutable once deployed	Model endpoints can be updated upstream; prompts can change without code deployment
Monitoring	Error rates, latency, and resource metrics	Quality metrics, hallucination rates, approval rates, and evaluation scores over time
Rollback	Revert to previous binary or container image	Revert prompt, tool policy, model version, or routing rules independently
Regression	Test suite catches breaking changes	Evaluation suite detects quality drift; prompts may need tuning even without code changes

Prompts are code, but they are not the only code

Modern agent systems treat prompts as versioned artifacts alongside tool contracts, evaluation datasets, and orchestration logic. The "prompt as code" discipline is necessary but not sufficient: you must also version the evaluation criteria, golden examples, and policy rules that define acceptable behavior.

Agent Development Lifecycle (ADLC)

Salesforce Architect guide defining the ADLC phases (Ideation, Development, Testing, Deployment, Monitoring), inner- and outer-loop activities, and non-deterministic testing strategies for production agents.

Read the Salesforce ADLC guide

ADLC phases: from ideation to production

ADLC organizes agent development into five phases: ideation, development, testing and validation, deployment, and monitoring and tuning. The inner loop (ideation, development, testing) supports rapid iteration, while the outer loop (deployment, monitoring, and feeding insights back into development) handles continuous improvement in production.

Agent Development Lifecycle phases

100%drag to pan

Loading diagram...

ADLC phase details

Phase	Description	Key activities	Primary risk	Exit criteria
Ideation	Define the agent's scope, autonomy level, and success metrics	Identify user workflows, select tools, define policy boundaries, draft evaluation questions	Building an agent for a problem better solved by deterministic automation	Clear use case, bounded tool set, and measurable success criteria
Development (Inner Loop)	Rapid iteration on prompts, tool contracts, and orchestration logic	Draft prompts, implement tools, run local tests, adjust temperature and routing rules	Overfitting to a narrow test set or missing edge cases	Agent completes end-to-end workflows in a sandbox environment
Testing & Validation	Systematic evaluation against golden datasets and adversarial inputs	Run evaluation suites, test failure modes, validate tool contract compliance, review reasoning traces	Silent regression where quality drifts without obvious errors	Evaluation metrics meet baselines; failure modes are understood and mitigated
Deployment	Controlled release to production with gradual rollout and observability	Configure canary deployments, set up monitoring dashboards, document rollback procedures, train reviewers	Unexpected behavior in production due to scale, data shifts, or model drift	Successful pilot run with no critical incidents; rollback paths tested
Monitoring & Tuning (Outer Loop)	Continuous observation and improvement based on production signals	Track quality metrics, collect edge cases, re-run evaluations, tune prompts and policies	Drift where quality degrades gradually without clear triggers	Ongoing; tuning loops feed back into development when thresholds are breached

Testing non-deterministic systems

Testing agents requires shifting from "does this exact output match?" to "is this behavior acceptable within bounds?" A well-designed test strategy combines golden datasets for happy-path validation, adversarial inputs for failure-mode testing, regression suites for drift detection, and human evaluation for nuanced quality judgments.

Testing strategies for AI agents

Strategy	Purpose	Key techniques	Limitations
Golden paths	Validate that the agent handles expected workflows correctly	Curated input-output pairs, reference reasoning chains, tool call sequences	Does not catch edge cases or novel situations
Adversarial inputs	Test failure modes, safety boundaries, and robustness	Malicious prompts, out-of-scope requests, ambiguous or contradictory inputs	Hard to exhaust; may miss subtle safety issues
Regression suites	Detect quality drift over time as prompts or models change	Periodic evaluation runs, metric baselines, threshold alerts	Requires stable evaluation datasets and clear success metrics
A/B evaluation	Compare candidate prompts, models, or configurations in production	Canary deployments, interleaved trials, blind human ratings	Requires traffic volume and careful experimental design
Human evaluation	Assess nuanced quality, safety, and appropriateness	Expert review, crowdsourced ratings, clinical or domain-specific rubrics	Expensive, slow, and subject to bias or inconsistency

Evaluation datasets must be versioned and treated as first-class artifacts. When you change a prompt, tool, or model, you should re-run the previous evaluation dataset to check for regression. When you discover new edge cases in production, add them to the dataset for future testing. This discipline creates a feedback loop where production experience continuously strengthens the test suite.

Test tool contracts separately from agent reasoning

Tool contracts (APIs, MCP servers, FHIR operations) should have their own unit tests independent of the agent. This separation isolates failures: if the tool contract tests pass but the agent fails, the problem is in planning or tool selection. If the tool contract tests fail, the problem is in the tool implementation.

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 6

Why does traditional SDLC testing assume determinism, and why does that assumption break for AI agents?

← PreviousMulti-Agent Orchestration Patterns Next →Human-Agent Collaboration and UX Patterns

Dimension

SDLC approach

ADLC approach

Testing

Unit and integration tests assert exact outputs for given inputs

Evaluation metrics, golden datasets, and statistical bounds over multiple runs

Versioning

Code commits and semantic versioning

Prompt versions, tool contract versions, model versions, and evaluation dataset versions

Deployment

Binaries or containers are immutable once deployed

Model endpoints can be updated upstream; prompts can change without code deployment

Monitoring

Error rates, latency, and resource metrics

Quality metrics, hallucination rates, approval rates, and evaluation scores over time

Rollback

Revert to previous binary or container image

Revert prompt, tool policy, model version, or routing rules independently

Regression

Test suite catches breaking changes

Evaluation suite detects quality drift; prompts may need tuning even without code changes

Phase

Description

Key activities

Primary risk

Exit criteria

Ideation

Define the agent's scope, autonomy level, and success metrics

Identify user workflows, select tools, define policy boundaries, draft evaluation questions

Building an agent for a problem better solved by deterministic automation

Clear use case, bounded tool set, and measurable success criteria

Development (Inner Loop)

Rapid iteration on prompts, tool contracts, and orchestration logic

Draft prompts, implement tools, run local tests, adjust temperature and routing rules

Overfitting to a narrow test set or missing edge cases

Agent completes end-to-end workflows in a sandbox environment

Testing & Validation

Systematic evaluation against golden datasets and adversarial inputs

Run evaluation suites, test failure modes, validate tool contract compliance, review reasoning traces

Silent regression where quality drifts without obvious errors

Evaluation metrics meet baselines; failure modes are understood and mitigated

Deployment

Controlled release to production with gradual rollout and observability

Configure canary deployments, set up monitoring dashboards, document rollback procedures, train reviewers

Unexpected behavior in production due to scale, data shifts, or model drift

Successful pilot run with no critical incidents; rollback paths tested

Monitoring & Tuning (Outer Loop)

Continuous observation and improvement based on production signals

Track quality metrics, collect edge cases, re-run evaluations, tune prompts and policies

Drift where quality degrades gradually without clear triggers

Ongoing; tuning loops feed back into development when thresholds are breached

Strategy

Purpose

Key techniques

Limitations

Golden paths

Validate that the agent handles expected workflows correctly

Curated input-output pairs, reference reasoning chains, tool call sequences

Does not catch edge cases or novel situations

Adversarial inputs

Test failure modes, safety boundaries, and robustness

Malicious prompts, out-of-scope requests, ambiguous or contradictory inputs

Hard to exhaust; may miss subtle safety issues

Regression suites

Detect quality drift over time as prompts or models change

Periodic evaluation runs, metric baselines, threshold alerts

Requires stable evaluation datasets and clear success metrics

A/B evaluation

Compare candidate prompts, models, or configurations in production

Canary deployments, interleaved trials, blind human ratings

Requires traffic volume and careful experimental design

Human evaluation

Assess nuanced quality, safety, and appropriateness

Expert review, crowdsourced ratings, clinical or domain-specific rubrics

Expensive, slow, and subject to bias or inconsistency

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 6

Agent Development Lifecycle (ADLC)

Knowledge Tree

ADLC vs SDLC: why traditional lifecycle models are not enough

Prompts are code, but they are not the only code

Agent Development Lifecycle (ADLC)

ADLC phases: from ideation to production

Agent Development Lifecycle phases

Testing non-deterministic systems

Test tool contracts separately from agent reasoning

Knowledge Check

Why does traditional SDLC testing assume determinism, and why does that assumption break for AI agents?

Agent Development Lifecycle (ADLC) - Knowledge Tree

Agent Development Lifecycle (ADLC)

Knowledge Tree

ADLC vs SDLC: why traditional lifecycle models are not enough

Prompts are code, but they are not the only code

Agent Development Lifecycle (ADLC)

ADLC phases: from ideation to production

Agent Development Lifecycle phases

Testing non-deterministic systems

Test tool contracts separately from agent reasoning

Knowledge Check

Why does traditional SDLC testing assume determinism, and why does that assumption break for AI agents?

Agent Development Lifecycle (ADLC) - Knowledge Tree