Skip to content

strip internal-only DOCX hyperlink anchors to avoid dead links#2131

Open
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:DOCX-internal-hyperlinks
Open

strip internal-only DOCX hyperlink anchors to avoid dead links#2131
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:DOCX-internal-hyperlinks

Conversation

@martian7777

Copy link
Copy Markdown

Problem

DOCX internal Table-of-Contents (TOC) and cross-reference hyperlinks (represented as <w:hyperlink w:anchor="..."> elements with no external relationship ID) were being translated into dead Markdown links (e.g., [Executive Summary](#_Toc12345)). Because these bookmark anchors do not resolve in the final Markdown document, they introduce dead link noise for reading or LLM consumption.

Solution

  • Modified the DOCX XML pre-processing step (pre_process.py) to search for <w:hyperlink> elements that contain a w:anchor attribute but lack an external relationship r:id attribute.
  • Unwrapped these internal-only hyperlink elements so that they render as plain text in the final Markdown output, keeping their text content and formatting intact without emitting the dead link wrapper.
  • Renamed the internal _pre_process_math helper function to _pre_process_xml to better represent its expanded pre-processing responsibilities.

Testing

  • Added test_docx_internal_hyperlinks in test_module_misc.py that verifies a DOCX file containing a w:anchor hyperlink converts to plain text rather than a Markdown link.
  • Verified that all other DOCX conversion tests pass successfully without regressions.

@martian7777

Copy link
Copy Markdown
Author

this solution was for #2125

@martian7777

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant