Skip to content

[WIP] Update CLDR data to 44.1.0 and add hour/duration support#1343

Open
serhii73 wants to merge 3 commits into
masterfrom
cldr-update-44
Open

[WIP] Update CLDR data to 44.1.0 and add hour/duration support#1343
serhii73 wants to merge 3 commits into
masterfrom
cldr-update-44

Conversation

@serhii73

Copy link
Copy Markdown
Collaborator

Summary

  • Upgrades CLDR source data from 31.0.1 → 44.1.0: 288 new locales added, all existing locale JSONs updated
  • Adds hour/duration unit to relative-type for all locales (e.g. "2 hours ago" now parses correctly in more languages)
  • Regenerates all dateparser/data/date_translation_data/*.py files from the new source data
  • Modernizes dateparser_scripts/ to use pathlib.Path and the unified cldr-json repository layout
  • Restores Ukrainian words removed by the CLDR upgrade (uk.yaml)
  • Updates docs/supported_locales.rst and languages_info.py
  • Updates 34 test inputs to reflect CLDR 44.1.0 abbreviation/name changes
  • Adds /cldr-json/ to .gitignore

Finishes #1216 (by @Gallaecio). The original PR was a draft with merge conflicts; this re-implements the same work cleanly on top of current master, preserving the possessive-quantifier optimization from #1335.

Test plan

  • All 1376 existing tests pass (python -m pytest tests/test_languages.py)
  • Test inputs updated where CLDR 44.1.0 changed abbreviations (bs-Latn, ce, kl, qu, so, sr, sw, zu, am, as, brx, hy, ig, kok, mr, nn, de, eu, gu, mk, chr, bs-Cyrl)
  • Basic smoke test: dateparser.parse("2 hours ago") and dateparser.parse("vor 2 Stunden") both return valid dates

🤖 Generated with Claude Code

- Upgrade CLDR source data from version 31.0.1 to 44.1.0: 288 new
  locales added, all existing locale JSONs updated in
  dateparser_data/cldr_language_data/date_translation_data/
- Add hour/duration unit to relative-type data for all locales
- Regenerate all dateparser/data/date_translation_data/*.py from the
  new source data, keeping master's possessive-quantifier optimization
  in relative-type-regex patterns
- Modernize dateparser_scripts/ to use pathlib.Path and unified
  cldr-json repository layout (utils.py, write_complete_data.py,
  get_cldr_data.py, order_languages.py)
- Restore Ukrainian words removed by the CLDR upgrade (uk.yaml)
- Update docs/supported_locales.rst and languages_info.py
- Update 34 test inputs in test_languages.py to reflect CLDR 44.1.0
  abbreviation/name changes (bs-Latn, ce, kl, qu, so, sr, sw, zu,
  am, as, brx, hy, ig, kok, mr, nn, de, eu, gu, mk, chr, bs-Cyrl)
- Add /cldr-json/ to .gitignore

Finishes PR #1216 (Gallaecio:cldr-update).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@serhii73 serhii73 mentioned this pull request Jun 24, 2026
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 1.12360% with 88 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.53%. Comparing base (33e913c) to head (123c81e).

Files with missing lines Patch % Lines
dateparser/data/date_translation_data/aa.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/ab.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/an.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/ann.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/apc.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/arn.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/az-Arab.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/ba.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/bal-Arab.py 0.00% 1 Missing ⚠️
dateparser/data/date_translation_data/bal-Latn.py 0.00% 1 Missing ⚠️
... and 78 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1343      +/-   ##
==========================================
- Coverage   97.11%   92.53%   -4.59%     
==========================================
  Files         235      379     +144     
  Lines        2909     3053     +144     
==========================================
  Hits         2825     2825              
- Misses         84      228     +144     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

serhii73 and others added 2 commits June 24, 2026 13:12
- en.yaml: restore 'in {0} weeks time' and 'in {0} weeks\' time' patterns
  for 'in \1 week' that CLDR 44 dropped; regenerate en.py
- af.yaml: restore 'sek' (second abbreviation) dropped by CLDR 44; regenerate af.py
- en-US: CLDR 44 has no en-US.json (US settings merged into base en);
  add minimal en-US.json, regenerate en-US.py, add en-US back to
  languages_info.py locale list for 'en'
- tests/test_freshness_date_parser.py: update Cherokee (chr) test input
  from uppercase to lowercase Cherokee encoding used by CLDR 44 patterns
- tests/test_search.py: update Danish detection test to text that includes
  'tirsdag' and 'januar' (distinctly Danish vs Swedish), since CLDR 44
  sv.py changes caused the old text to be ambiguous

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The en-US locale is loaded via en.py's locale_specific["en-US"] section,
not as a standalone language file. Adding it to en.yaml ensures the
locale_specific entry gets the correct name ("en-US") and date_order
(MDY). Remove the standalone en-US.json that was added by mistake.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment on lines -499 to +504
"en-US",
"en-VC",
"en-VG",
"en-VI",
"en-VU",
"en-WS",
"en-US",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weirdly unnecessary.


json_dict = {}
json_dict = OrderedDict()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

@@ -61,7 +61,7 @@ et
eu
ewo
fa 'fa-AF'
ff 'ff-CM', 'ff-GN', 'ff-MR'
ff

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean in user terms? Would a user that was passing this somewhere now get an error? If so, could we handle this differently, provide support for these locales but with the same data as ff? (if that was not the case already)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes in test expectations necessary? (here and in the other test files)

@serhii73 serhii73 changed the title Update CLDR data to 44.1.0 and add hour/duration support [WIP] Update CLDR data to 44.1.0 and add hour/duration support Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants