Python: Adopt azure-ai-contentunderstanding to_llm_input in CU context provider#5796
Open
changjian-wang wants to merge 19 commits into
Open
Conversation
- Changed the type of `result` in DocumentEntry from dict to str to store LLM-ready text. - Introduced `search_payload` in DocumentEntry for optional alternate rendering. - Updated FileSearchConfig to include `include_fields` option for vector store uploads. - Modified tests to reflect changes in DocumentEntry and FileSearchConfig. - Adjusted integration tests to validate new result structure and rendering. - Removed legacy format_result tests as rendering is now handled by the SDK.
to_llm_input in CU context provider
to_llm_input in CU context provider
Contributor
Python Test Coverage Report •
Python Unit Test Overview
|
|||||||||||||||||||||||||||||||||||
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Azure Content Understanding Python integration to render CU AnalysisResult objects using the Azure SDK’s standardized to_llm_input output (YAML front matter + body text), and to manage vector-store uploads more defensively by avoiding empty/stub documents and reducing noisy warning telemetry.
Changes:
- Replace custom CU extraction/formatting with
azure.ai.contentunderstanding.to_llm_input, storing rendered text directly inDocumentEntry.result. - Add
DocumentEntry.search_payloadplusFileSearchConfig.include_fieldsto control whether structured fields are included in vector-store uploads. - Update tests and dependency version to align with the new rendering and data model.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| python/packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_context_provider.py | Switch to to_llm_input rendering, add telemetry filtering + body-detection helper, and adjust vector store upload behavior. |
| python/packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_models.py | Update DocumentEntry to store rendered strings and add search_payload; add include_fields option to FileSearchConfig. |
| python/packages/azure-contentunderstanding/agent_framework_azure_contentunderstanding/_extraction.py | Remove legacy extraction/formatting implementation. |
| python/packages/azure-contentunderstanding/pyproject.toml | Bump azure-ai-contentunderstanding dependency to a newer (beta) version supporting the new rendering approach. |
| python/packages/azure-contentunderstanding/tests/cu/test_models.py | Update model tests for result as str and new search_payload / include_fields fields. |
| python/packages/azure-contentunderstanding/tests/cu/test_integration.py | Update integration assertions to validate rendered string output rather than dict-based markdown extraction. |
| python/packages/azure-contentunderstanding/tests/cu/test_context_provider.py | Rework provider tests to validate to_llm_input wiring, YAML front matter expectations, and telemetry filtering behavior. |
to_llm_input in CU context provider
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <copilot@github.com>
…on-cu-to-llm-input-adoption # Conflicts: # python/packages/azure-contentunderstanding/pyproject.toml
…ings block Address PR microsoft#5796 review comment: the previous defensive scrubber ran a global regex substitution over the full rendered string, so any markdown body bullet shaped like '- LLMStats: ...' would also be silently deleted. Add a _strip_rai_telemetry helper that confines the substitution to the front-matter rai_warnings: YAML sub-block, leaving the body verbatim. Cover the new behavior with three tests (scoped strip, body preservation, and no-op branches).
Member
Author
|
@yungshinlintw Could you please take a look at this PR when you have a chance? Thanks! |
…on-cu-to-llm-input-adoption # Conflicts: # python/packages/azure-contentunderstanding/pyproject.toml
yungshinlintw
suggested changes
May 29, 2026
…ring (CU context provider) Address PR microsoft#5796 review: remove the redundant search_payload field and _render_search_payload helper, drop the include_fields opt-in (already covered by output_sections), rename _resolve_pending_tokens -> _resolve_pending_analysis, and have _upload_to_vector_store read entry['result'] directly.
Member
Author
|
Pushed
All 85 unit tests in |
eavanvalkenburg
approved these changes
Jun 1, 2026
This was referenced Jun 3, 2026
…CU context provider) azure-ai-contentunderstanding 1.2.0b2 filters LLMStats telemetry from rai_warnings and emits InputPageNumber page markers in to_llm_input, so the provider's local defense is redundant. - Bump dependency to azure-ai-contentunderstanding>=1.2.0b2 (re-lock uv.lock) - Remove _strip_rai_telemetry and its two regexes; _render_for_llm now returns to_llm_input(...) directly - Delete 4 workaround unit tests for the removed helper
changjian-wang
added a commit
to changjian-wang/agent-framework
that referenced
this pull request
Jun 12, 2026
changjian-wang
added a commit
to changjian-wang/agent-framework
that referenced
this pull request
Jun 12, 2026
…rce (CU context provider) Mirrors the Python CU adoption (microsoft#5796) on the .NET side. - Bump Azure.AI.ContentUnderstanding 1.2.0-beta.1 -> 1.2.0-beta.2 (and its transitive Azure.Core 1.59.0, System.ClientModel 1.14.0, Microsoft.Identity.Client.Extensions.Msal 4.84.2) - AnalysisRenderer: drop StripTelemetry + regex; ToLlmInput now filters LLMStats upstream - Remove redundant RenderSearchPayload / DocumentEntry.SearchPayload; vector-store upload reads Result (single render source). MarkdownResult retained for get_analyzed_document(section=Markdown) - Remove the obsolete StripTelemetry / RenderSearchPayload / SearchPayload unit tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors how Content Understanding (CU) analysis results are rendered and managed, focusing on improving LLM input formatting, filtering telemetry noise, and supporting more robust vector store uploads. The main themes are a shift from custom extraction/formatting to standardized rendering using the Azure SDK, defensive handling of noisy telemetry, and more careful upload logic to avoid polluting the vector store with empty documents.
LLM Input Rendering and Formatting:
to_llm_inputfunction for renderingAnalysisResultobjects into LLM-friendly YAML-front-matter-prefixed text, ensuring consistent and standardized output. (_render_for_llm,_render_search_payload, removal ofextract_sectionsandformat_result) [1] [2] [3]_has_renderable_bodyto detect and skip uploads of documents that contain only YAML front matter and no meaningful body, preventing empty or useless entries in the vector store. [1] [2]Telemetry and Defensive Filtering:
LLMStats:) from therai_warningsYAML list before rendering, reducing noise in LLM inputs and outputs. [1] [2]Vector Store Upload Logic:
search_payload(which defaults to a chunking-friendly, fields-stripped format) and ensures only documents with a renderable body are uploaded. Falls back to the LLM-injection rendering ifsearch_payloadis missing.General Refactoring and Documentation:
Internal Data Flow and Error Handling:
DocumentEntry, and ensures both are handled correctly in background and error paths. [1] [2] [3] [4]These changes modernize the codebase, improve reliability, and make the document ingestion pipeline more robust and maintainable.