ADW — Agentic Development Workflow — Case Study

Challenge

With frontier model capabilities growing at a rapid pace, the harness around the LLM has become the bottleneck. It decides what the model sees, what it’s allowed to do, where deterministic code takes over, and how its output integrates with the rest of a developer’s workflow. A well engineered harness tackles two specific pain points:

LLMs as a pipeline

LLMs are tools, they are not the pipeline; they are great at reasoning (writing code, creating commit messages and PR descriptions, converting ambiguous input into a decision), but their non-deterministic nature makes them unreliable in following instructions. This is why everything else (running tests, linting, git operations, deploying, etc) should be handled by scripts which the LLM invokes

HIL friction

The cognitive load of having to approve and manually start each phase of the SDLC makes parallelizing work extremely hard and hits a ceiling. Inserting HIL review points only where critical frees up the developer’s attention.

A custom harness also allows encoding the team’s conventions (branch names, PR templates, validation commands, deploy gates, task-manager integrations, etc) and procedures into the workflow directly. This creates a personalized, fine-tuned and robust SDLC that fits the team’s requirements and expectations.

My Approach

I created ADW as a python CLI tool that works as a harness around an LLM and uses claude code under the hood. It was built around 3 principles: customization, observability and determinism.

Determinism

The LLM is only used for two specific tasks - writing (code, specs, documentation, etc) and decision making. Everything else is handled by scripts that the harness invokes. At the core of the tool lives the orchestrator that manages the whole run and invokes the next step based on the run’s settings.

DET / LLM accounting across a full run

Design decisions

LLM outputs markers, not actions. In ship, the LLM emits DEPLOYMENT_STATUS: SUCCESS, PR_MERGE_APPROVED: true, etc. A deterministic bash script parses those and executes gh pr merge, npm publish, etc. The model never runs deployment — it only decides whether to.
Extensions are the customization seam. BuildExtension / DocumentExtension / ShipExtension add phase-specific deterministic logic without touching the core phase runner. This is how ship-phase gets its post-merge cleanup without build-phase knowing about branches.
Every phase boundary is a checkpoint. Snapshot → task-manager comment → label update → context persist → global-index update happens at every transition. Since state lives on disk, runs are resumable and observable.

Customization

A full run consists of 5 phases: Plan -> Build -> Validate -> Document -> Ship and each phase is a discrete, resumable unit with its own prompt template, input files, and output artifacts. The tool has a three-tier config hierarchy: built-in -> user-level -> project-level, where each tier takes precedence over the previous one. Users can customize each phase by modifying the following files:

pre.sh -> a pluggable script that executes before the phase begins. Users can extend or overwrite the built-in pre-script for each phase
prompt.md -> the phase execution where the LLM work happens. Users can overwrite the built-in prompt
post.sh -> same as pre but runs after the phase has finished

The tool ships with robust and sensible defaults for everything and is already efficient out of the box.

Observability

Structured logging with secret redaction, a dangerous-command detector, and an audit trail of every tool call the LLM makes allows for full visibility of the run’s details. The tool has the ability to also post comments on the task’s ticket (if a task manager is provided) and saves each phase’s output, as well as the full run’s logs to files persisted in a run-specific folder. This also allows pausing and resuming runs if HIL is required at any point. Each run executes in a separate worktree for isolation and concurrency.

Results & Impact

ADW has been in daily use for ~2 months (late January → end of March 2026), shipping work across 9 projects spanning 6 tech stacks — Python (backend & CLI), Node/TypeScript, Next.js, React, Swift/iOS, and Kotlin/Android. The harness is language-agnostic; it has not been tuned
per-project beyond the normal three-tier config.

Throughput & reliability

231 runs tracked across all projects, with a peak of 74 runs in a single week
85.7% completion rate (198/231 finished without erroring out)
60% of runs went from ticket to merged PR with zero human intervention — 139 runs completed the full plan → build → validate → document → ship pipeline autonomously
Only 9% hard failures (21 runs); the remainder were intentional HIL pauses
~142 PRs opened autonomously by ADW — roughly 1 in 4 of the 570 merged PRs across the 9 projects in the window

Speed

Median run duration: 30 minutes (p75: 39 min, p90: 55 min)
In the ADW repo itself, median PR cycle time from open to merge is 6 minutes

The maturity curve

Because the tool is self-hosted — built using itself — the feedback loop is tight: every shortfall in a run becomes a prompt, script, or phase-contract fix in the next version. The effect over the 2-month window is measurable:

First 25% of runs: 26% reached ship autonomously
Last 25% of runs: 84% reached ship autonomously
3.2× improvement with no model upgrades — entirely from harness improvements

Visual Assets

ADW system architecture:
Foundation column (Config, Git Worktrees, Task Managers, Logging & Redaction,
Security / Audit) alongside a vertical flow of Triggers → Orchestration →
Execution → Executor (Claude Code CLI)

Tech Stack

Language: Python 3.13
CLI: Typer, Rich
Web / API: FastAPI, Uvicorn, python-multipart
AI / LLM: Claude Code CLI (subprocess integration)
Integrations: GitHub, Linear, Jira (task manager adapters); GitHub & Linear webhooks
Observability: Structured multi-transport logging, live log streaming, secret redaction
Testing & tooling: pytest, pytest-asyncio, pytest-cov (80% coverage floor), mypy (strict), ruff
Packaging: Hatchling, uv