Python: Adopt azure-ai-contentunderstanding `to_llm_input` in CU context provider by changjian-wang · Pull Request #5796 · microsoft/agent-framework

changjian-wang · 2026-05-13T01:54:11Z

This pull request refactors how Content Understanding (CU) analysis results are rendered and managed, focusing on improving LLM input formatting, filtering telemetry noise, and supporting more robust vector store uploads. The main themes are a shift from custom extraction/formatting to standardized rendering using the Azure SDK, defensive handling of noisy telemetry, and more careful upload logic to avoid polluting the vector store with empty documents.

LLM Input Rendering and Formatting:

Replaces custom extraction and formatting logic with the Azure SDK's to_llm_input function for rendering AnalysisResult objects into LLM-friendly YAML-front-matter-prefixed text, ensuring consistent and standardized output. (_render_for_llm, _render_search_payload, removal of extract_sections and format_result) [1] [2] [3]
Adds _has_renderable_body to detect and skip uploads of documents that contain only YAML front matter and no meaningful body, preventing empty or useless entries in the vector store. [1] [2]

Telemetry and Defensive Filtering:

Introduces a regex-based filter to remove SDK-internal telemetry lines (e.g., LLMStats:) from the rai_warnings YAML list before rendering, reducing noise in LLM inputs and outputs. [1] [2]

Vector Store Upload Logic:

Refactors vector store upload logic to use the new search_payload (which defaults to a chunking-friendly, fields-stripped format) and ensures only documents with a renderable body are uploaded. Falls back to the LLM-injection rendering if search_payload is missing.

General Refactoring and Documentation:

Updates docstrings and user-facing instructions to clarify that extracted fields are now provided as YAML front matter, not JSON, and improves comments for maintainability. [1] [2]

Internal Data Flow and Error Handling:

Updates internal data flow to store both the LLM-rendered result and the vector-search payload in each DocumentEntry, and ensures both are handled correctly in background and error paths. [1] [2] [3] [4]

These changes modernize the codebase, improve reliability, and make the document ingestion pipeline more robust and maintainable.

- Changed the type of `result` in DocumentEntry from dict to str to store LLM-ready text. - Introduced `search_payload` in DocumentEntry for optional alternate rendering. - Updated FileSearchConfig to include `include_fields` option for vector store uploads. - Modified tests to reflect changes in DocumentEntry and FileSearchConfig. - Adjusted integration tests to validate new result structure and rendering. - Removed legacy format_result tests as rendering is now handled by the SDK.

moonbox3 · 2026-05-13T01:57:09Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding
_context_provider.py	259	41	84%	78, 81, 199–202, 305–306, 308, 312–313, 316, 320, 405, 543, 547, 601, 651–652, 675, 677–680, 845, 849, 875–880, 882–887, 895, 904–905
_models.py	38	1	97%	117
TOTAL	38975	4460	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
7819	34 💤	0 ❌	0 🔥	2m 0s ⏱️

Copilot

Pull request overview

This PR updates the Azure Content Understanding Python integration to render CU AnalysisResult objects using the Azure SDK’s standardized to_llm_input output (YAML front matter + body text), and to manage vector-store uploads more defensively by avoiding empty/stub documents and reducing noisy warning telemetry.

Changes:

Replace custom CU extraction/formatting with azure.ai.contentunderstanding.to_llm_input, storing rendered text directly in DocumentEntry.result.
Add DocumentEntry.search_payload plus FileSearchConfig.include_fields to control whether structured fields are included in vector-store uploads.
Update tests and dependency version to align with the new rendering and data model.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
python/packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_context_provider.py	Switch to `to_llm_input` rendering, add telemetry filtering + body-detection helper, and adjust vector store upload behavior.
python/packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_models.py	Update `DocumentEntry` to store rendered strings and add `search_payload`; add `include_fields` option to `FileSearchConfig`.
python/packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_extraction.py	Remove legacy extraction/formatting implementation.
python/packages/azure-contentunderstanding/pyproject.toml	Bump `azure-ai-contentunderstanding` dependency to a newer (beta) version supporting the new rendering approach.
python/packages/azure-contentunderstanding/tests/cu/test_models.py	Update model tests for `result` as `str` and new `search_payload` / `include_fields` fields.
python/packages/azure-contentunderstanding/tests/cu/test_integration.py	Update integration assertions to validate rendered string output rather than dict-based markdown extraction.
python/packages/azure-contentunderstanding/tests/cu/test_context_provider.py	Rework provider tests to validate `to_llm_input` wiring, YAML front matter expectations, and telemetry filtering behavior.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <copilot@github.com>

…on-cu-to-llm-input-adoption # Conflicts: # python/packages/azure-contentunderstanding/pyproject.toml

…ings block Address PR microsoft#5796 review comment: the previous defensive scrubber ran a global regex substitution over the full rendered string, so any markdown body bullet shaped like '- LLMStats: ...' would also be silently deleted. Add a _strip_rai_telemetry helper that confines the substitution to the front-matter rai_warnings: YAML sub-block, leaving the body verbatim. Cover the new behavior with three tests (scoped strip, body preservation, and no-op branches).

changjian-wang · 2026-05-21T14:06:06Z

@yungshinlintw Could you please take a look at this PR when you have a chance? Thanks!

…on-cu-to-llm-input-adoption # Conflicts: # python/packages/azure-contentunderstanding/pyproject.toml

…ring (CU context provider) Address PR microsoft#5796 review: remove the redundant search_payload field and _render_search_payload helper, drop the include_fields opt-in (already covered by output_sections), rename _resolve_pending_tokens -> _resolve_pending_analysis, and have _upload_to_vector_store read entry['result'] directly.

changjian-wang · 2026-05-29T03:37:11Z

Pushed 120cd82b8 addressing the latest review:

R2 — Renamed _resolve_pending_tokens → _resolve_pending_analysis and updated the call-site comment.
R4 — Removed search_payload / _render_search_payload and FileSearchConfig.include_fields. entry["result"] is now the single source of truth; the upload path renders once via _render_for_llm, with field inclusion driven by output_sections.
R3 — Cleaned up the now-dangling decision D2/D3/E1 comments and the design-doc reference in the tests.
R1 (rai_warnings telemetry scrub) — intentionally kept as a transitional belt; see my inline replies. Will be removed in a follow-up PR once the SDK fix ships.

All 85 unit tests in azure-contentunderstanding pass.

…CU context provider) azure-ai-contentunderstanding 1.2.0b2 filters LLMStats telemetry from rai_warnings and emits InputPageNumber page markers in to_llm_input, so the provider's local defense is redundant. - Bump dependency to azure-ai-contentunderstanding>=1.2.0b2 (re-lock uv.lock) - Remove _strip_rai_telemetry and its two regexes; _render_for_llm now returns to_llm_input(...) directly - Delete 4 workaround unit tests for the removed helper

…#5796)

…rce (CU context provider) Mirrors the Python CU adoption (microsoft#5796) on the .NET side. - Bump Azure.AI.ContentUnderstanding 1.2.0-beta.1 -> 1.2.0-beta.2 (and its transitive Azure.Core 1.59.0, System.ClientModel 1.14.0, Microsoft.Identity.Client.Extensions.Msal 4.84.2) - AnalysisRenderer: drop StripTelemetry + regex; ToLlmInput now filters LLMStats upstream - Remove redundant RenderSearchPayload / DocumentEntry.SearchPayload; vector-store upload reads Result (single render source). MarkdownResult retained for get_analyzed_document(section=Markdown) - Remove the obsolete StripTelemetry / RenderSearchPayload / SearchPayload unit tests

Copilot AI review requested due to automatic review settings May 13, 2026 01:54

moonbox3 added the python label May 13, 2026

changjian-wang changed the title ~~after_06_maf_sample_05_large_doc_file_search_output~~ Python: Adopt azure-ai-contentunderstanding to_llm_input in CU context provider May 13, 2026

github-actions Bot changed the title ~~Python: Adopt azure-ai-contentunderstanding to_llm_input in CU context provider~~ Python: after_06_maf_sample_05_large_doc_file_search_output May 13, 2026

Copilot started reviewing on behalf of changjian-wang May 13, 2026 01:55 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread ...s/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_context_provider.py Outdated

Comment thread ...s/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_context_provider.py Outdated

changjian-wang changed the title ~~Python: after_06_maf_sample_05_large_doc_file_search_output~~ Python: Adopt azure-ai-contentunderstanding to_llm_input in CU context provider May 13, 2026

changjian-wang and others added 5 commits May 13, 2026 10:17

Potential fix for pull request finding

51be00b

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

6419e4c

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

0ab5383

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

e6c25be

Add test to ensure page markers are preserved in LLM input

420c336

Co-authored-by: Copilot <copilot@github.com>

changjian-wang marked this pull request as ready for review May 21, 2026 10:51

changjian-wang and others added 3 commits May 21, 2026 18:57

Merge remote-tracking branch 'upstream/main' into changjian-wang/pyth…

cd9cdfb

…on-cu-to-llm-input-adoption # Conflicts: # python/packages/azure-contentunderstanding/pyproject.toml

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

9dbc43c

wangchangjian1130 and others added 5 commits May 21, 2026 22:08

Sync uv.lock with azure-ai-contentunderstanding>=1.2.0b1 dependency bump

63513a3

Merge remote-tracking branch 'upstream/main' into changjian-wang/pyth…

3bb10de

…on-cu-to-llm-input-adoption # Conflicts: # python/packages/azure-contentunderstanding/pyproject.toml

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

1ec70d5

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

037325f

Merge branch 'main' into changjian-wang/python-cu-to-llm-input-adoption

5ca50dc

yungshinlintw reviewed May 28, 2026

View reviewed changes