feat(tokenizers): add experimental Korean tokenizer by dayongkr · Pull Request #1018 · oramasearch/orama

dayongkr · 2026-02-16T05:32:47Z

Why add Korean support?

@orama/tokenizers currently supports experimental CJK tokenizers for Japanese and
Mandarin, but not Korean.
This makes Korean text search inconsistent compared to other CJK languages.

Adding a Korean tokenizer:

enables proper Korean word segmentation for indexing/search
keeps language support more consistent across CJK
provides an official import path for Korean users (@orama/tokenizers/korean)

What changed

New tokenizer

Added packages/tokenizers/src/korean.ts
- Implements Korean tokenization using Intl.Segmenter("ko", { granularity: "word" })
- Follows the same structure and behavior as existing Japanese/Mandarin tokenizers

Exports and package wiring

Updated packages/tokenizers/src/index.ts
- Added korean tokenizer export
Updated packages/tokenizers/package.json
- Added ./korean export entries (ESM/CJS types + runtime files)
- Included Korean test in the package test script
- Added korean keyword

Tests

Added packages/tokenizers/tests/korean.test.ts
- Validates Korean tokenization/search behavior with Korean city and university names
- Keeps style aligned with existing Japanese/Mandarin tokenizer tests

…hitespace coverage The Korean tokenizer uses Intl.Segmenter('ko', { granularity: 'word' }), which segments Korean on whitespace (UAX#29). Space-less compounds such as the Hangul for "Seoul National University" therefore stay a single token and are not split into "Seoul" + "university" (unlike the Japanese/Mandarin tokenizers, which dictionary-split their equivalents). The existing city-name searches already pass because Orama matches the radix index by prefix (default exact: false), so the city name matches the whole compound token. Add comments explaining this and two assertions that make the behavior explicit and regression-proof: - searching the shared "university" suffix returns 0 hits (compounds are not split), the inverse of japanese.test.ts where the suffix matches all three universities; - a space-separated multi-word document is found when searching its middle word, proving whitespace word segmentation works. No tokenizer behavior changed.

dayongkr and others added 2 commits February 16, 2026 14:18

feat(tokenizers): add experimental korean tokenizer

d45b5de

thatjuan merged commit d184b20 into oramasearch:main Jun 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(tokenizers): add experimental Korean tokenizer#1018

feat(tokenizers): add experimental Korean tokenizer#1018
thatjuan merged 2 commits into
oramasearch:mainfrom
dayongkr:add-korean-tokenizer

dayongkr commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

dayongkr commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why add Korean support?

What changed

New tokenizer

Exports and package wiring

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dayongkr commented Feb 16, 2026 •

edited

Loading