Python: Add AgentLoopMiddleware for re-running agents in a loop#6174
Conversation
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 74%
✓ Correctness
The AgentLopMiddleware implementation is well-designed and handles both streaming and non-streaming paths correctly. The middleware properly interacts with the existing AgentContext/pipeline infrastructure, correctly manages the ResponseStream for lazy streaming execution, and handles MiddlewareTermination cleanly. The approval handling logic, judge condition, and feedback tracking all function as specified by the tests. One minor documentation/behavior mismatch exists:
max_approval_roundsis documented as capping 'consecutive' rounds but the counter never resets after non-approval iterations, effectively making it a 'total' cap.
✓ Security Reliability
The AgentLopMiddleware implementation is well-structured with proper input validation (constructor rejects invalid max_iterations/max_approval_rounds), defensive type checks on call_next() results, clean MiddlewareTermination handling in streaming, and appropriate use of copy semantics for the progress log exposed to callbacks. No blocking security or reliability issues found. The approval callback invocation order (callback before cap check) is intentional per the test assertions. The judge fallback parsing is conservative (NOT_ANSWERED takes precedence). Memory growth in unbounded loops is by-design and documented.
✓ Test Coverage
The test suite is comprehensive, covering constructor validation, all three factory patterns, non-streaming and streaming loops, approval handling, provider helpers, and edge cases. However, there are several notable gaps: async callback variants are never tested directly (all tested callbacks are synchronous), provider helper predicates are unit-tested in isolation but never integrated into a full loop, streaming + judge mode combination is untested, and explicitly returning None from record_feedback to skip an entry has no coverage.
✓ Design Approach
The loop middleware is broadly aligned with the existing harness and middleware pipeline, but I found one design-level issue in judge mode: it reduces the original request to plain text before asking the judge, which breaks the stated “judge whether the original request was answered” behavior for multimodal or otherwise non-text inputs that the core message model already supports.
Automated review by eavanvalkenburg's agents
8230a86 to
4dd8c6c
Compare
There was a problem hiding this comment.
Automated Code Review
Reviewers: 5 | Confidence: 84%
✓ Correctness
The AgentLopMiddleware implementation is well-structured and correct. The core loop logic (non-streaming and streaming), stop evaluation, feedback propagation, progress tracking, session snapshotting for fresh_context, and judge integration are all handled properly and extensively tested. No correctness bugs found.
✓ Security Reliability
The implementation is generally well-structured with proper safety caps (
max_iterationsdefaults), defensive type checks, and progress-list copying to prevent callback mutation. The main reliability concern is an asymmetry inMiddlewareTerminationhandling between streaming and non-streaming paths: the streaming path catches it cleanly (preserving prior iteration results), while the non-streaming path lets it propagate, losing the aggregated transcript from completed iterations.
✓ Test Coverage
The test suite is thorough overall—covering construction, predicate patterns, judge mode, streaming, feedback tracking, fresh_context, session handling, and provider helpers. However, there are notable coverage gaps: (1) non-streaming MiddlewareTermination test (only streaming is tested), despite different implementation behavior between the two paths; (2) the usage_details accumulation logic (add_usage_details across iterations) is never exercised because RecordingChatClient produces no usage data; (3) async variants of record_feedback and next_message are untested (only should_continue has an async test).
✓ Failure Modes
The implementation is well-structured with proper error propagation. The streaming path catches MiddlewareTermination cleanly, the non-streaming path correctly relies on the framework's contextlib.suppress at the executor level, exceptions from the judge client and user predicates propagate to calers rather than being silently swallowed, and the session snapshot/restore pattern correctly rebuilds from the snapshot each time to avoid aliasing. No high-severity silent failure modes or data-loss paths were identified.
✗ Design Approach
The loop is close, but I found one blocking design bug in the non-streaming path: a downstream
MiddlewareTerminationcan be silently turned into a duplicate iteration because the pipeline suppresses the exception and leaves the previouscontext.resultin place. I also found one smaller design gap in thebackground_tasks_runninghelper: it reads persisted task state without using the provider’s own refresh path, so it can loop once more on staleRUNINGstatus.
Flagged Issues
- Non-streaming loop iterations can reuse a stale
context.resultafter suppressedMiddlewareTerminationinstead of stopping cleanly (python/packages/core/agent_framework/_harness/_loop.py:436-444,python/packages/core/agent_framework/_middleware.py:921-931).
Suggestions
- Refresh background task state through
BackgroundAgentsProviderbefore checking forRUNINGtasks, matching the provider's existing pattern (python/packages/core/agent_framework/_harness/_loop.py:767-775,python/packages/core/agent_framework/_harness/_background_agents.py:211-236,408-428,517-518).
Automated review by eavanvalkenburg's agents
5f07e95 to
c0c1c77
Compare
Add `AgentLoopMiddleware`, an `AgentMiddleware` that re-runs the wrapped agent in a loop. A single configurable class covers three common patterns, each with a convenience classmethod factory: - Ralph loop (`.ralph(...)`): no exit criteria, with feedback tracking (`record_feedback`/`progress`), progress injection (`inject_progress`), optional fresh context per iteration (`fresh_context`), and an early-stop completion signal (`is_complete`). - Predicate (`.with_predicate(...)`): loop while a `should_continue` callable returns True (e.g. paired with `todos_remaining`/`background_tasks_running`). - Judge (`.with_judge(...)`): a second chat client decides whether the original request was answered, using a `JudgeVerdict` structured-output response. The loop also auto-resolves pending function-approval / user-input requests via an `on_approval_request` callable (bounded by `max_approval_rounds`), and the next iteration's input is controlled by `next_message`. Supports both streaming and non-streaming runs. Exports `AgentLoopMiddleware`, `JudgeVerdict`, `todos_remaining`, and `background_tasks_running`. Adds tests, a sample, and docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- with_judge: add criteria list with {{criteria}} templating into judge
instructions plus an agent-side instruction; add fresh_context, additional
judge feedback relay; default judge max_iterations.
- should_continue is now required and positional; supports (bool, str|None)
feedback tuples surfaced to next_message/record_feedback via feedback kwarg.
- Judge forwards full multi-modal request and response messages.
- Default max_iterations=10 (explicit None = unbounded); removed is_complete and
Ralph terminology; ShouldContinueResult is a real TypeAlias.
- Sample: stream all loops, print iteration counts via injected user-block
boundaries (robust to function calling), <role>: content formatting, per-method
expected output, and a looping todo sample.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Resolve pyright errors in _loop.py: drop the always-true final_result None check (the while loop always assigns it) and cast finish_reason to the AgentResponse constructor's expected type. - Apply pyupgrade --py310-plus: import TypeAlias from typing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pyright infers AgentResponse.finish_reason as including str and rejects the direct assignment, while mypy considers a cast redundant. Drop the cast and suppress only pyright with a targeted reportArgumentType ignore, satisfying both type checkers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a second AgentLoopMiddleware sample that composes two criteria in one should_continue predicate: a TodoProvider check (evaluated first) and a report-style judge chat client (evaluated once todos are complete) that grades the assembled report against shared requirements. Register it in the middleware samples README. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rework the todo+judge sample to compose two AgentLoopMiddleware on the agent itself (middleware=[judge_loop, todo_loop]) instead of a single hand-written predicate. The inner todos_remaining loop drafts the report todo-by-todo and the outer with_judge loop re-runs it until an editor chat client judges the report publication-ready, reusing the built-in helpers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AgentLoopMiddleware.fresh_context previously only reset context.messages, so with an attached session each iteration still reloaded the local transcript or re-threaded the service-side conversation id and the model saw the accumulated history. Snapshot the session once before the loop (via to_dict) and restore it (from_dict + field copy) between iterations, so every pass starts from the pre-loop baseline. The final iteration's pass is persisted (no restore after the terminating iteration), so a subsequent agent.run continues from there. Removed the obsolete warning, updated docstrings and core AGENTS.md, and added tests: a snapshot/restore round-trip, a session-reset streaming x fresh_context x inject_progress x store matrix across multiple runs and loop iterations, and response_format parsing across the loop. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c0c1c77 to
ee33936
Compare
Motivation and Context
Many agent scenarios require re-running an agent until some condition is met (or indefinitely): iterative refinement, working through a todo list, draining background tasks, or judging whether the original question was actually answered. Today each of these requires bespoke orchestration around the agent run loop.
This adds a single, reusable middleware that drives the loop, so users get these patterns out of the box while keeping full control over the per-iteration input and the stopping condition.
Description
Adds
AgentLoopMiddleware, anAgentMiddlewarethat re-runs the wrapped agent by callingcall_next()repeatedly. One configurable class covers three common patterns, each with a convenience classmethod factory that forwards to the full constructor:AgentLoopMiddleware.ralph(...): no exit criteria, bounded only by an optionalmax_iterations. Includes feedback tracking inspired by the Ralph technique:record_feedbackaccumulates a progress log exposed to every callback via theprogresskwarg and (by default) injected into the next iteration's input (inject_progress);fresh_context=Truerestarts each pass from the original task plus the log;is_complete(marker string or callable) stops early when the agent signals completion.AgentLoopMiddleware.with_predicate(should_continue, ...): loops while a callable returnsTrue. Helper factoriestodos_remaining(provider)andbackground_tasks_running(provider)cover theTodoProvider/BackgroundAgentsProvidercases.AgentLoopMiddleware.with_judge(judge_client, ...): a second chat client decides whether the original request was answered, using aJudgeVerdictstructured-output response (with a text fallback for clients that don't honor structured output); loops while the answer is "no".Additional capabilities (cross-cutting, on the base constructor):
on_approval_requestauto-resolves pendinguser_input_requestcontent (e.g. tools withapproval_mode="always_require"), bounded bymax_approval_roundsand exempt frommax_iterations.next_messagecontrols the next iteration's input (defaults to a short "continue" nudge); returningNonereuses the prior messages.MiddlewareTerminationcleanly.Exports
AgentLoopMiddleware,JudgeVerdict,todos_remaining, andbackground_tasks_runningfromagent_framework. Adds unit tests, a sample (samples/02-agents/middleware/agent_loop_middleware.py) demonstrating all four patterns, and documentation.Contribution Checklist