Python: Defensively strip pickle markers on internal deserialize paths#6507
Python: Defensively strip pickle markers on internal deserialize paths#6507BiswajeetRay7 wants to merge 2 commits into
Conversation
strip_pickle_markers() is applied at the HTTP entry points today, but the internal deserialize_value() calls in _app.py do not strip. Wrapping these with strip_pickle_markers() (already imported) removes reliance on every entry point remembering to sanitize. No known exploit; defense-in-depth hardening only.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates Azure Functions agent deserialization flow to remove pickle markers before reconstructing typed objects, likely to standardize payload handling and avoid pickle-coded data.
Changes:
- Strip pickle markers before deserializing executor messages.
- Strip pickle markers before deserializing shared state snapshots.
| # Reconstruct message - deserialize_value restores the original typed objects | ||
| # from the encoded data (with type markers) | ||
| message = deserialize_value(message_data) | ||
| message = deserialize_value(strip_pickle_markers(message_data)) |
| # Deserialize shared state values to reconstruct dataclasses/Pydantic models | ||
| deserialized_state: dict[str, Any] = { | ||
| str(k): deserialize_value(v) for k, v in shared_state_snapshot.items() | ||
| str(k): deserialize_value(strip_pickle_markers(v)) for k, v in shared_state_snapshot.items() |
|
Thanks for the review. Good points — I'll update the stale comment to reflect the marker-stripping, and refactor the duplicated deserialize_value(strip_pickle_markers(...)) into a small local helper to centralize the behavior. |
Signed-off-by: Biswajeet Ray <raybiswajeet2@gmail.com>
|
Follow-up on the helper-function suggestion: on reflection I kept the two call sites inline rather than adding a _deserialize_safely() helper. With only two call sites, a module-level helper added a bit of indirection for little gain, so I opted to keep it explicit. The stale comment has been updated in a1dd6ea. Happy to refactor into a helper if you'd prefer — just let me know. |
ahmedmuhsin
left a comment
There was a problem hiding this comment.
Thanks for thinking about this defensively. I have one concern: I think these two internal deserialize sinks are the wrong layer to strip at, and stripping here would silently null legitimate typed payloads. Details inline.
| message = deserialize_value(message_data) | ||
| # Reconstruct message: strip untrusted pickle/type markers first | ||
| # (defense-in-depth), then deserialize_value restores typed objects. | ||
| message = deserialize_value(strip_pickle_markers(message_data)) |
There was a problem hiding this comment.
These two internal sinks (this line and the shared_state_snapshot deserialize on line 327) are not untrusted input. message_data and the shared-state values are produced by the orchestrator itself via serialize_value, and are expected to carry __pickled__ / __type__ markers, because that is how typed objects (dataclasses, Pydantic models, Message, AgentExecutorResponse) round-trip across the JSON activity hop.
strip_pickle_markers replaces any top-level marker dict with None. A typed message or typed shared-state value is a top-level marker dict after serialize_value, so:
deserialize_value(strip_pickle_markers(<typed object>)) -> deserialize_value(None) -> None
The net effect is that any workflow passing a typed object between executors, or storing one in shared state, would have it silently nulled. Primitives survive (no markers), which is likely why a quick run looks fine, but typed payloads break. A test with a dataclass or Pydantic message plus a typed shared-state value would show this.
On the threat model: the untrusted boundaries (the initial client input and the HITL response path) are already sanitized at the entry points in #6418, which is the right place, since that external JSON should never contain markers in the first place. Stripping on the internal deserialize conflates the trusted transport format with the untrusted boundary.
One heads up: #6418 extracts this inline activity body into a shared activity.py used by both the Azure Functions and standalone hosts, so these exact lines move. Worth coordinating so any change lands in the shared engine rather than only here.
|
You're absolutely right, and thank you for the detailed explanation. I missed that message_data and the shared-state values are produced internally by serialize_value and are expected to carry pickled / type markers — so stripping here would null legitimate typed payloads, breaking workflows that pass typed objects between executors or store them in shared state. The untrusted boundaries are already correctly sanitized at the entry points in #6418, which is the right layer. My change is incorrect, so I'll close this PR. Thanks again for catching it and for the pointer to the shared activity.py refactor. |
Summary
This is a defense-in-depth hardening change, not a fix for a known exploit.
strip_pickle_markers() (in _serialization.py) is the control that prevents untrusted pickled / type markers from reaching pickle.loads() via deserialize_value. It is applied at the HTTP entry points today (e.g. send_hitl_response), but the internal deserialize_value() calls in _app.py — reconstructing message_data and shared_state_snapshot values — do not strip.
I traced the current entry points and did not find an attacker-reachable path to these internal sinks, so this is not a reported vulnerability. The intent is purely to remove the design's reliance on every present and future entry point remembering to sanitize: a new untrusted entry point or a refactor could otherwise silently route external data into one of these unprotected deserialize_value() calls.
Description
Wrap the two internal deserialize_value() calls in _app.py with strip_pickle_markers() (already imported in this file). Negligible runtime cost.
Honest framing: no PoC, no known exploit — defense-in-depth only.
Signed-off-by: Biswajeet Ray raybiswajeet2@gmail.com