feat(tokenizers): add experimental Korean tokenizer#1018
Merged
Conversation
…hitespace coverage
The Korean tokenizer uses Intl.Segmenter('ko', { granularity: 'word' }), which
segments Korean on whitespace (UAX#29). Space-less compounds such as the Hangul
for "Seoul National University" therefore stay a single token and are not split
into "Seoul" + "university" (unlike the Japanese/Mandarin tokenizers, which
dictionary-split their equivalents).
The existing city-name searches already pass because Orama matches the radix
index by prefix (default exact: false), so the city name matches the whole
compound token. Add comments explaining this and two assertions that make the
behavior explicit and regression-proof:
- searching the shared "university" suffix returns 0 hits (compounds are not
split), the inverse of japanese.test.ts where the suffix matches all three
universities;
- a space-separated multi-word document is found when searching its middle
word, proving whitespace word segmentation works.
No tokenizer behavior changed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why add Korean support?
@orama/tokenizerscurrently supports experimental CJK tokenizers for Japanese andMandarin, but not Korean.
This makes Korean text search inconsistent compared to other CJK languages.
Adding a Korean tokenizer:
@orama/tokenizers/korean)What changed
New tokenizer
packages/tokenizers/src/korean.tsIntl.Segmenter("ko", { granularity: "word" })Exports and package wiring
packages/tokenizers/src/index.tskoreantokenizer exportpackages/tokenizers/package.json./koreanexport entries (ESM/CJS types + runtime files)koreankeywordTests
packages/tokenizers/tests/korean.test.ts