Skip to content

Python: Add AgentLoopMiddleware for re-running agents in a loop#6174

Merged
eavanvalkenburg merged 8 commits into
microsoft:mainfrom
eavanvalkenburg:agent_loop
Jun 12, 2026
Merged

Python: Add AgentLoopMiddleware for re-running agents in a loop#6174
eavanvalkenburg merged 8 commits into
microsoft:mainfrom
eavanvalkenburg:agent_loop

Conversation

@eavanvalkenburg

@eavanvalkenburg eavanvalkenburg commented May 29, 2026

Copy link
Copy Markdown
Member

Motivation and Context

Many agent scenarios require re-running an agent until some condition is met (or indefinitely): iterative refinement, working through a todo list, draining background tasks, or judging whether the original question was actually answered. Today each of these requires bespoke orchestration around the agent run loop.

This adds a single, reusable middleware that drives the loop, so users get these patterns out of the box while keeping full control over the per-iteration input and the stopping condition.

Description

Adds AgentLoopMiddleware, an AgentMiddleware that re-runs the wrapped agent by calling call_next() repeatedly. One configurable class covers three common patterns, each with a convenience classmethod factory that forwards to the full constructor:

  • Ralph loopAgentLoopMiddleware.ralph(...): no exit criteria, bounded only by an optional max_iterations. Includes feedback tracking inspired by the Ralph technique: record_feedback accumulates a progress log exposed to every callback via the progress kwarg and (by default) injected into the next iteration's input (inject_progress); fresh_context=True restarts each pass from the original task plus the log; is_complete (marker string or callable) stops early when the agent signals completion.
  • PredicateAgentLoopMiddleware.with_predicate(should_continue, ...): loops while a callable returns True. Helper factories todos_remaining(provider) and background_tasks_running(provider) cover the TodoProvider / BackgroundAgentsProvider cases.
  • JudgeAgentLoopMiddleware.with_judge(judge_client, ...): a second chat client decides whether the original request was answered, using a JudgeVerdict structured-output response (with a text fallback for clients that don't honor structured output); loops while the answer is "no".

Additional capabilities (cross-cutting, on the base constructor):

  • Approval handlingon_approval_request auto-resolves pending user_input_request content (e.g. tools with approval_mode="always_require"), bounded by max_approval_rounds and exempt from max_iterations.
  • next_message controls the next iteration's input (defaults to a short "continue" nudge); returning None reuses the prior messages.
  • Supports both streaming and non-streaming runs, handling MiddlewareTermination cleanly.

Exports AgentLoopMiddleware, JudgeVerdict, todos_remaining, and background_tasks_running from agent_framework. Adds unit tests, a sample (samples/02-agents/middleware/agent_loop_middleware.py) demonstrating all four patterns, and documentation.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Copilot AI review requested due to automatic review settings May 29, 2026 15:07
@moonbox3 moonbox3 added documentation Improvements or additions to documentation python labels May 29, 2026
@moonbox3

moonbox3 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework/_harness
   _loop.py226697%460, 468, 538, 608, 651, 726
TOTAL39366447888% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
7899 34 💤 0 ❌ 0 🔥 2m 2s ⏱️

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 74%

✓ Correctness

The AgentLopMiddleware implementation is well-designed and handles both streaming and non-streaming paths correctly. The middleware properly interacts with the existing AgentContext/pipeline infrastructure, correctly manages the ResponseStream for lazy streaming execution, and handles MiddlewareTermination cleanly. The approval handling logic, judge condition, and feedback tracking all function as specified by the tests. One minor documentation/behavior mismatch exists: max_approval_rounds is documented as capping 'consecutive' rounds but the counter never resets after non-approval iterations, effectively making it a 'total' cap.

✓ Security Reliability

The AgentLopMiddleware implementation is well-structured with proper input validation (constructor rejects invalid max_iterations/max_approval_rounds), defensive type checks on call_next() results, clean MiddlewareTermination handling in streaming, and appropriate use of copy semantics for the progress log exposed to callbacks. No blocking security or reliability issues found. The approval callback invocation order (callback before cap check) is intentional per the test assertions. The judge fallback parsing is conservative (NOT_ANSWERED takes precedence). Memory growth in unbounded loops is by-design and documented.

✓ Test Coverage

The test suite is comprehensive, covering constructor validation, all three factory patterns, non-streaming and streaming loops, approval handling, provider helpers, and edge cases. However, there are several notable gaps: async callback variants are never tested directly (all tested callbacks are synchronous), provider helper predicates are unit-tested in isolation but never integrated into a full loop, streaming + judge mode combination is untested, and explicitly returning None from record_feedback to skip an entry has no coverage.

✓ Design Approach

The loop middleware is broadly aligned with the existing harness and middleware pipeline, but I found one design-level issue in judge mode: it reduces the original request to plain text before asking the judge, which breaks the stated “judge whether the original request was answered” behavior for multimodal or otherwise non-text inputs that the core message model already supports.


Automated review by eavanvalkenburg's agents

Comment thread python/packages/core/agent_framework/_harness/_loop.py Outdated
Comment thread python/packages/core/agent_framework/_harness/_loop.py Outdated
@eavanvalkenburg eavanvalkenburg marked this pull request as ready for review June 9, 2026 09:02

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 5 | Confidence: 84%

✓ Correctness

The AgentLopMiddleware implementation is well-structured and correct. The core loop logic (non-streaming and streaming), stop evaluation, feedback propagation, progress tracking, session snapshotting for fresh_context, and judge integration are all handled properly and extensively tested. No correctness bugs found.

✓ Security Reliability

The implementation is generally well-structured with proper safety caps (max_iterations defaults), defensive type checks, and progress-list copying to prevent callback mutation. The main reliability concern is an asymmetry in MiddlewareTermination handling between streaming and non-streaming paths: the streaming path catches it cleanly (preserving prior iteration results), while the non-streaming path lets it propagate, losing the aggregated transcript from completed iterations.

✓ Test Coverage

The test suite is thorough overall—covering construction, predicate patterns, judge mode, streaming, feedback tracking, fresh_context, session handling, and provider helpers. However, there are notable coverage gaps: (1) non-streaming MiddlewareTermination test (only streaming is tested), despite different implementation behavior between the two paths; (2) the usage_details accumulation logic (add_usage_details across iterations) is never exercised because RecordingChatClient produces no usage data; (3) async variants of record_feedback and next_message are untested (only should_continue has an async test).

✓ Failure Modes

The implementation is well-structured with proper error propagation. The streaming path catches MiddlewareTermination cleanly, the non-streaming path correctly relies on the framework's contextlib.suppress at the executor level, exceptions from the judge client and user predicates propagate to calers rather than being silently swallowed, and the session snapshot/restore pattern correctly rebuilds from the snapshot each time to avoid aliasing. No high-severity silent failure modes or data-loss paths were identified.

✗ Design Approach

The loop is close, but I found one blocking design bug in the non-streaming path: a downstream MiddlewareTermination can be silently turned into a duplicate iteration because the pipeline suppresses the exception and leaves the previous context.result in place. I also found one smaller design gap in the background_tasks_running helper: it reads persisted task state without using the provider’s own refresh path, so it can loop once more on stale RUNING status.

Flagged Issues

  • Non-streaming loop iterations can reuse a stale context.result after suppressed MiddlewareTermination instead of stopping cleanly (python/packages/core/agent_framework/_harness/_loop.py:436-444, python/packages/core/agent_framework/_middleware.py:921-931).

Suggestions

  • Refresh background task state through BackgroundAgentsProvider before checking for RUNING tasks, matching the provider's existing pattern (python/packages/core/agent_framework/_harness/_loop.py:767-775, python/packages/core/agent_framework/_harness/_background_agents.py:211-236, 408-428, 517-518).

Automated review by eavanvalkenburg's agents

Comment thread python/packages/core/agent_framework/_harness/_loop.py Outdated

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 78% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by moonbox3's agents

Comment thread python/samples/02-agents/middleware/agent_loop_middleware.py Outdated
Comment thread python/packages/core/agent_framework/_harness/_loop.py Outdated
eavanvalkenburg and others added 8 commits June 12, 2026 08:37
Add `AgentLoopMiddleware`, an `AgentMiddleware` that re-runs the wrapped
agent in a loop. A single configurable class covers three common patterns,
each with a convenience classmethod factory:

- Ralph loop (`.ralph(...)`): no exit criteria, with feedback tracking
  (`record_feedback`/`progress`), progress injection (`inject_progress`),
  optional fresh context per iteration (`fresh_context`), and an early-stop
  completion signal (`is_complete`).
- Predicate (`.with_predicate(...)`): loop while a `should_continue` callable
  returns True (e.g. paired with `todos_remaining`/`background_tasks_running`).
- Judge (`.with_judge(...)`): a second chat client decides whether the original
  request was answered, using a `JudgeVerdict` structured-output response.

The loop also auto-resolves pending function-approval / user-input requests via
an `on_approval_request` callable (bounded by `max_approval_rounds`), and the
next iteration's input is controlled by `next_message`. Supports both streaming
and non-streaming runs.

Exports `AgentLoopMiddleware`, `JudgeVerdict`, `todos_remaining`, and
`background_tasks_running`. Adds tests, a sample, and docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- with_judge: add criteria list with {{criteria}} templating into judge
  instructions plus an agent-side instruction; add fresh_context, additional
  judge feedback relay; default judge max_iterations.
- should_continue is now required and positional; supports (bool, str|None)
  feedback tuples surfaced to next_message/record_feedback via feedback kwarg.
- Judge forwards full multi-modal request and response messages.
- Default max_iterations=10 (explicit None = unbounded); removed is_complete and
  Ralph terminology; ShouldContinueResult is a real TypeAlias.
- Sample: stream all loops, print iteration counts via injected user-block
  boundaries (robust to function calling), <role>: content formatting, per-method
  expected output, and a looping todo sample.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Resolve pyright errors in _loop.py: drop the always-true final_result None
  check (the while loop always assigns it) and cast finish_reason to the
  AgentResponse constructor's expected type.
- Apply pyupgrade --py310-plus: import TypeAlias from typing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pyright infers AgentResponse.finish_reason as including str and rejects the
direct assignment, while mypy considers a cast redundant. Drop the cast and
suppress only pyright with a targeted reportArgumentType ignore, satisfying
both type checkers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a second AgentLoopMiddleware sample that composes two criteria in one
should_continue predicate: a TodoProvider check (evaluated first) and a
report-style judge chat client (evaluated once todos are complete) that grades
the assembled report against shared requirements. Register it in the middleware
samples README.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rework the todo+judge sample to compose two AgentLoopMiddleware on the agent
itself (middleware=[judge_loop, todo_loop]) instead of a single hand-written
predicate. The inner todos_remaining loop drafts the report todo-by-todo and the
outer with_judge loop re-runs it until an editor chat client judges the report
publication-ready, reusing the built-in helpers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AgentLoopMiddleware.fresh_context previously only reset context.messages,
so with an attached session each iteration still reloaded the local
transcript or re-threaded the service-side conversation id and the model
saw the accumulated history. Snapshot the session once before the loop
(via to_dict) and restore it (from_dict + field copy) between iterations,
so every pass starts from the pre-loop baseline. The final iteration's
pass is persisted (no restore after the terminating iteration), so a
subsequent agent.run continues from there.

Removed the obsolete warning, updated docstrings and core AGENTS.md, and
added tests: a snapshot/restore round-trip, a session-reset
streaming x fresh_context x inject_progress x store matrix across multiple
runs and loop iterations, and response_format parsing across the loop.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Jun 12, 2026
Merged via the queue into microsoft:main with commit 1acd242 Jun 12, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants