Fix _is_nonsense_url dropping trailing-slash language roots (/en/) by jichaowang02-lang · Pull Request #2035 · unclecode/crawl4ai

jichaowang02-lang · 2026-06-23T00:59:40Z

Summary

AsyncUrlSeeder._is_nonsense_url filters "very short paths", but the check is inconsistent:

if len(path.strip('/')) < 3 and path not in ['/', '/en', '/de', '/fr', '/es', '/it']:
    return True

The length test strips slashes (path.strip('/')), but the whitelist membership uses the un-stripped path (which keeps trailing slashes). So the canonical trailing-slash form of a language root is dropped as nonsense:

url	before	after
`/en`	kept	kept ✅
`/en/`	filtered ❌	kept ✅
`/de/`, `/fr/`, `/es/`, `/it/`	filtered ❌	kept ✅
`/ab/`, `/x` (real junk)	filtered	filtered ✅

Trailing slashes are the conventional directory form emitted by sitemaps and by urljoin normalization, so with filter_nonsense_urls enabled (the default) legitimate localized landing pages (/en/, /de/, …) were silently discarded from seeding results.

Fix

Compare the slash-stripped path against the whitelist too, so a language root is kept with or without a trailing slash. Genuinely short non-language paths are still filtered.

Testing

$ pytest tests/unit/test_nonsense_url_language_roots.py -q
11 passed

Adds tests/unit/test_nonsense_url_language_roots.py; the trailing-slash language-root cases fail on the current code (5 failed) and pass with this fix, while short-junk/utility URLs stay filtered.

The "very short path" filter measured length on the slash-stripped path (`len(path.strip('/')) < 3`) but matched the *un-stripped* path against the language-root whitelist (`path not in ['/', '/en', ...]`). So the canonical trailing-slash form of a language root — `/en/`, `/de/`, `/fr/`, `/es/`, `/it/`, which is what sitemaps and urljoin normalization commonly emit — was filtered out as nonsense, even though `/en` was correctly kept. With filter_nonsense_urls on (the default), these valid localized landing pages were silently dropped from seeding results. Compare the slash-stripped path against the whitelist as well, so a language root is kept regardless of a trailing slash; genuinely short junk paths (`/ab/`, `/x`) are still filtered. Adds tests/unit/test_nonsense_url_language_roots.py; the trailing-slash cases fail on the old code and pass with this fix.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Copilot AI review requested due to automatic review settings June 23, 2026 00:59

Copilot AI reviewed Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix _is_nonsense_url dropping trailing-slash language roots (/en/)#2035

Fix _is_nonsense_url dropping trailing-slash language roots (/en/)#2035
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/nonsense-url-trailing-slash-lang-roots

jichaowang02-lang commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jichaowang02-lang commented Jun 23, 2026

Summary

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants