Daily Brief: Agent Reliability Is a Workflow Design Problem
The durable product-builder skill is no longer picking the strongest model and hoping for the best. It is designing the workflow around the model so the agent knows what to do first, which tool shapes are safe, when to hand work off, and what evidence must exist before the run counts as done.
The strongest July 5 cluster formed around reliability failures and reliability fixes in coding-agent workflows. Digg surfaced Lucas Beyer questioning an agent run that tried to fix tests before even completing the audit, exposing how much outcome quality depends on sequencing, not just capability. At the same time, Digg surfaced Theo Browne reporting that a substantial CLAUDE.md steering file drove broken pull requests effectively to zero in his workflow, mainly by clarifying routing and model handoffs. Simon Willison sharpened the same lesson from a different angle on July 4: frontier models can get better while third-party tool use gets worse when the harness and tool schema do not match the model’s training. The pattern is consistent. Reliability is moving out of raw model choice and into workflow design.
Many teams still evaluate agent products as if the main question were which model is smartest. That misses where the practical leverage now lives. If sequencing is vague, a strong agent burns time on the wrong subtask. If tool contracts are brittle, a strong agent still fails the run. If routing is unclear, expensive models do cheap work and cheap models do risky work. Product builders who treat reliability as a systems-design job can improve quality, cost, and trust without waiting for the next model release.
Redesign one agent workflow around explicit control surfaces. Trigger: PR prep, bug triage, audit, migration, or refactor with a real verifier. Context: objective, repo boundary, priority order, risk class, and a standing instruction file for routing and conventions. Tools: one planner or orchestrator, one implementation path, one audit path, and only tool shapes the chosen model reliably uses. Verifier: tests, lint or type checks, screenshots, data diffs, and a final audit that runs before any completion claim. Budget: model-selection rules, token cap, runtime cap, write scope, and escalation rules for flaky tools or privileged actions. Artifacts: instruction file, plan, run log, diff, verifier outputs, and unresolved-risk note. Stop condition: the workflow either produces passing evidence in the required order or halts with a named sequencing or tool-contract failure instead of pretending the work is complete.
Pick one agent workflow that feels “randomly unreliable” and forbid vague autonomy for a week. Write down the task order, routing rules, and required evidence explicitly. If reliability improves fast, the problem was workflow design more than model quality.
Full context at Digg Tech. Bring back one decision, test, or workflow change.
Read the original ↗Keep Going