Shift in: bringing quality engineering into the agent loop

"There will be greater change in the next five years than in the last 40 years." — Martin Woodward, VP Developer Relations, GitHub (2024)

If you've spent any time around coding agents in 2025 and 2026, that quote stops feeling like hype and starts feeling like a description. Agents now read code, run tests, edit files, ship pull requests, and reason about failures. They are not autocomplete anymore — they are autonomous actors that perceive an environment, make decisions, and change state.

This creates a problem for those of us who care about quality. The familiar moves — shift left to catch defects early, shift right to learn from production — were designed for a world where humans read pipeline output and act on it. Agents don't read humans. They read tools. So if we want quality engineering to influence what an agent does next, we need a new direction. I call it shift in.

This post unpacks where the idea comes from, why it matters, and what it looks like in practice. Along the way, we'll define the building blocks of modern AI agents — context, tools, skills, hooks, the ReAct loop — so we share a language before we get to the punchline.

Two old testing problems, applied to agents

When I look at any AI system — assistant, agent, swarm — I find it useful to (over)simplify it down to two problems that anyone with a testing background already knows:

The communication problem: How do we tell the system what we want? In testing language, this is the gap between intent and specification. In agentic systems, it's the gap between a user's goal and the prompt, context, and instructions the model actually receives.
The oracle problem: How do we know the output is correct? In testing language, this is the test oracle — the source of truth that decides pass or fail. In agentic systems, it's the evaluation, the human-in-the-loop, the LLM-as-judge, the pipeline check, and eventually the happy customer.

Every design decision in agentic development is, in the end, an attempt to reduce one of these two problems. Better prompts, richer context, retrieval-augmented generation — all communication. Evals, guardrails, regression tests, judges — all oracle. Once you see the two columns, the building blocks slot into place.

The building blocks of an AI agent

Before we can talk about where quality fits, we need a shared vocabulary. AI tooling moves fast, and the same word often means different things in different talks. Here's how I use the terms in this post.

Large Language Model (LLM): The reasoning core of the agent. Given a context, it predicts the next tokens. Models like Claude, GPT, Gemini, and Llama are LLMs. The model is not the agent; it's the brain inside it.

Context: Everything the LLM can "see" on a given turn: the system prompt, the conversation history, the tool definitions, retrieved documents, and the latest tool result. Context windows are finite, which is why managing them is its own discipline — see.

System prompt: The instructions placed at the top of the context that define the agent's role, behavior, and constraints. Think of it as the agent's job description plus its rules of engagement.

Memory: A mechanism for retaining and retrieving information across turns or sessions. Short-term memory lives in the context window; long-term memory lives in a vector database, a file, or both.

Retrieval-Augmented Generation (RAG): Before answering, the agent searches a knowledge source for relevant chunks and pastes them into the context. RAG is how an agent stays grounded in your domain terminology, your codebase, your documentation, your tickets instead of hallucinating.

Vector database: A database that stores text as numerical embeddings so it can be searched by meaning rather than keyword. The plumbing under most RAG systems.

Tool: A callable function the agent uses to act in the world: read a file, run a command, query an API, execute code. Each tool has a name, a description, and an input schema. The LLM decides when to call which tool; the orchestrator actually runs it.

Tool use: The act of an LLM emitting a structured request to call a tool, the orchestrator executing it, and the result being fed back into the context. Tool use is what turns a chatbot into an agent.

MCP: Model Context Protocol. An open standard for connecting agents to tools and data sources without writing custom integrations every time. Think USB-C for AI tooling. Anthropic introduced it in late 2024 — see the.

Skill: A packaged capability — usually a folder with a description, instructions, and any scripts or references the agent needs to perform a specific task well. Skills let you encode "how we do X here" as something the agent can load on demand. They differ from tools in scope: a tool does one thing; a skill bundles knowledge and procedure for a whole task.

Hook: A deterministic enforcement point that fires automatically around tool use. PreToolUse runs before a tool is called (block, modify, or allow). PostToolUse runs after (validate, log, transform). Stop runs when the agent thinks it's done (force one more check). Hooks are how you bake non-negotiable rules into the loop without trusting the model to remember them.

Guardrail: A safety layer that filters what goes into the LLM and what comes out — blocking prompt injection, sanitizing PII, refusing dangerous actions. Guardrails sit on the boundary; hooks sit inside the loop.

Eval: A test for an agent. Like a unit test, but the input is fuzzy (a goal) and the output is fuzzy (a trajectory or artifact), so the assertion is fuzzy too — often a rubric scored by a judge model or a human.

Orchestration: The control logic that runs the loop: pulling the latest message, calling the LLM, executing tool calls, deciding when to stop. The orchestrator is plain code; the LLM is the reasoner. Conflating them is one of the most common mistakes in agent design.

ReAct: Reason → Act → Observe. The dominant agent architecture today. Each iteration: the LLM reasons about the goal, picks an action, the orchestrator executes it, and the result becomes a new observation that feeds the next iteration. Original paper:Yao et al., 2022.

From chat to agent: the ReAct loop

A plain chat assistant is one-shot. You send a prompt, it sends a reply, and you do something with it. The model suggests; the human executes. No feedback loop, no tools, no file access.

A coding agent is different. It's given a goal and a toolbox, then it runs in a loop. It reads files, writes code, runs tests, observes errors, and tries again. Complex tasks routinely take 20 to 80 iterations.

One iteration of the ReAct loop:

Receive: Orchestrator appends the latest message — user input or tool result — to the conversation history.
Think: LLM reads the full context, reasons, and decides: emit text, call a tool, or end the turn.
Act: Orchestrator checks the stop reason. Tool use → execute it. End turn → task complete.
Observe: Tool output is appended as a tool message. Loop restarts with enriched context.

Two things wrap this loop and matter for our story. First, guard rails on max turns, max tokens, and max cost — without them, the orchestrator would loop forever. Second, interruption points for human-in-the-loop pauses on high-risk actions like delete, deploy, and external API calls.

This is the loop quality engineers need to think about. Not the DevOps loop on the architecture diagram, that's the outer loop. The ReAct loop is the inner loop, the one that runs hundreds of times per task, the one where the agent makes its actual decisions.

Where specialized agents fit in the SDLC

Zoom out from a coding agent and look at how specialized agents map onto the familiar DevOps loop — Plan, Design, Code, Build, Test, Release, Deploy, Operate, Monitor. We can already place agent teams along it:

Specification agents on the left: producing UX briefs, user personas, journey maps, BDD/ATDD specifications, and keyword-driven test scaffolds. Testing moves earlier — classical shift left.
CD pipeline agents on the right: running acceptance tests and performance tests in staging,, deploy canary releases, observe the system and initiate rollbacks and roll-forwards. Feedback from production — classical shift right.

Familiar territory. We've been talking about shift left and shift right for years, and the moves still apply when the actor in the loop happens to be an agent. But there's a third-place quality that has to live now, and it isn't on the outside of the SDLC at all.

Three directions for quality engineering in agentic development

Same Quality Engineering practices. Different consumers of the feedback.

← Shift left — Specification agents: Humans and agents read specs. Quality Engineering shapes intent before code exists: UX design briefs, user persona/journey/story, BDD/ATDD specifications, keyword-driven acceptance test generations. All Communicating the intent. Use dedicated Agents to make the design practices unavoidable

↓ Shift in — Inside the ReAct loop: The coding agent acts based on its instructions and using skills to solve problems, then reads the tool output. Quality Engineering practices drive proper behaviors and become tools the agent calls every iteration. A well-defined Test Design skill helps the agent to define valuable test cases. Unit tests, API tests, local integration tests, and continuous testing cycles act as fast feedback within the inner loop

Shift right → — CD pipeline agents: Production serves as the Oracle to close the loop: Acceptance tests in staging, performance benchmarking in production, canary release/dark launch to control the impact radius, rollback/roll-forward to react fast, observing how real users meet the system. All quality engineering practices validating correctness and eventually giving feedback.

Shift in: Quality engineering inside the agent's loop

Here's the core idea. Coding agents perceive the world through tool output. That's their entire sensory layer. Whatever feedback we want them to act on must arrive as a structured tool result inside the ReAct loop.

That means quality engineering has a new consumer.

The shift in thesis: if you want quality to influence what the agent does next, quality must live inside the loop — as a tool the agent can call, a hook that fires automatically, a skill the agent can load, or an eval that scores the trajectory. The CI/CD pipeline becomes machine-readable sensory feedback for the agent, but what other feedback can we feed to it?.

Concretely, shift in means rebuilding four layers of QE practice with the agent as the reader.

1. Quality Engineering practices dictate behavior

A well-defined instruction file that focuses on quality aspects, a thoroughly reviewed, specific skill that tells the agent HOW to solve specific problems, or even a specialized sub-agent that gives expert opinion, or works on delegated tasks, all are needed to bring our Quality Engineering practices to the agentic world, to make our voice heard.

2. Tests as tools, the agent calls every iteration

A unit test suite that runs in a few seconds and returns a structured pass/fail report is the perfect ReAct observation. The agent writes a function, calls the test tool, reads the failures, fixes the code, and calls it again. That's the loop working. Now imagine the same with API contract tests, fast integration tests, and mutation tests on the changed lines. Every fast, deterministic, structured signal you can put inside the loop saves the agent from wandering — and saves you from cleaning up its wandering later.

3. Hooks as deterministic guard rails and compliance enforcers

A PreToolUse hook can block a destructive command before it runs. A PostToolUse hook can require a lint pass after every file edit, or it can enforce writing an audit log. A Stop hook can refuse to mark a task complete until tests pass and coverage is above the threshold. Hooks are how you stop hoping the agent remembers to follow your standards and start enforcing them. They are the place where your DoR/DoD checks, your SAST fast scans, your dependency vulnerability scans, your style rules — your whole non-negotiable layer — actually become non-negotiable.

4. Pull request gates as the bridge between the inner loop and the outer loop

The agent's local loop runs unit, API, and integration tests fast. The pull request gate adds the slower, expensive stuff: full SAST, full regression, complexity analysis, coverage gates. This is where the inner ReAct loop meets the outer DevOps loop. The agent sees the PR check result as another tool observation and responds — fixing what failed, justifying what was a false positive, asking for human review when stuck.

Notice what didn't change. We're still doing static analysis. We're still running unit tests. We're still gating on coverage. The QE practices are the same. What changed is who or what runs the command, reads the result, and decides what to do next. That's the whole game.

What this means if you do quality for a living

Three implications stand out for QE practitioners.

Your tests have a new audience: If the consumer of test output is an agent, the output format matters as much as the test logic. Cryptic stack traces that an experienced human can decode are not great signals for an agent. Structured, named, deterministic output — JSON reports, machine-readable diffs, clear assertions — gives the agent something to act on.

Your CI is becoming an API: Pipelines have always been read by humans through dashboards and PR checks. Now they're being read by agents through tool calls. The same pipeline serves both audiences, but the API surface — the part the agent calls and parses — has to be designed deliberately.Continuous delivery for ML already touched this idea; agentic development pushes it further.

Your role is upstream of the agent: Building good evals, designing tool contracts, writing skills, and instruction files that capture testing know-how, owning the hook layer — that's QE work in an agentic stack. The work doesn't disappear. It shifts in to the inner loop.

Closing the loop

Shift left was the right answer when the loop was human-driven, and the slowest part was rework after release. Shift right was the right answer when production became its own source of truth, and we needed to learn from it. Both are still right.

Shift in is the answer for the part of the loop that is now agent-driven. The agent's perception is bounded by what it can read as a tool result, so feedback on quality has to live there too. The skills don't change — testing fundamentals, the oracle problem, the communication problem, the discipline of designing for failure. What changes is the reader of the feedback, and that changes everything about how we package it.

The next five years really will see more change than the last forty. Quality engineering gets to be in the loop, literally, if we choose to put it there.

Shift in: bringing quality engineering into the agent loop

Eficode