Skip to content

Fix _is_nonsense_url dropping trailing-slash language roots (/en/)#2035

Open
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/nonsense-url-trailing-slash-lang-roots
Open

Fix _is_nonsense_url dropping trailing-slash language roots (/en/)#2035
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/nonsense-url-trailing-slash-lang-roots

Conversation

@jichaowang02-lang

Copy link
Copy Markdown

Summary

AsyncUrlSeeder._is_nonsense_url filters "very short paths", but the check is inconsistent:

if len(path.strip('/')) < 3 and path not in ['/', '/en', '/de', '/fr', '/es', '/it']:
    return True

The length test strips slashes (path.strip('/')), but the whitelist membership uses the un-stripped path (which keeps trailing slashes). So the canonical trailing-slash form of a language root is dropped as nonsense:

url before after
/en kept kept ✅
/en/ filtered ❌ kept ✅
/de/, /fr/, /es/, /it/ filtered ❌ kept ✅
/ab/, /x (real junk) filtered filtered ✅

Trailing slashes are the conventional directory form emitted by sitemaps and by urljoin normalization, so with filter_nonsense_urls enabled (the default) legitimate localized landing pages (/en/, /de/, …) were silently discarded from seeding results.

Fix

Compare the slash-stripped path against the whitelist too, so a language root is kept with or without a trailing slash. Genuinely short non-language paths are still filtered.

Testing

$ pytest tests/unit/test_nonsense_url_language_roots.py -q
11 passed

Adds tests/unit/test_nonsense_url_language_roots.py; the trailing-slash language-root cases fail on the current code (5 failed) and pass with this fix, while short-junk/utility URLs stay filtered.

The "very short path" filter measured length on the slash-stripped path
(`len(path.strip('/')) < 3`) but matched the *un-stripped* path against the
language-root whitelist (`path not in ['/', '/en', ...]`). So the canonical
trailing-slash form of a language root — `/en/`, `/de/`, `/fr/`, `/es/`,
`/it/`, which is what sitemaps and urljoin normalization commonly emit — was
filtered out as nonsense, even though `/en` was correctly kept. With
filter_nonsense_urls on (the default), these valid localized landing pages
were silently dropped from seeding results.

Compare the slash-stripped path against the whitelist as well, so a language
root is kept regardless of a trailing slash; genuinely short junk paths
(`/ab/`, `/x`) are still filtered.

Adds tests/unit/test_nonsense_url_language_roots.py; the trailing-slash cases
fail on the old code and pass with this fix.
Copilot AI review requested due to automatic review settings June 23, 2026 00:59

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants