Skip to content

Update C++ tokenizer to match rag_tokenizer.py#3357

Merged
yingfeng merged 1 commit intoinfiniflow:mainfrom
qinling0210:update_c++_tokenizer
May 6, 2026
Merged

Update C++ tokenizer to match rag_tokenizer.py#3357
yingfeng merged 1 commit intoinfiniflow:mainfrom
qinling0210:update_c++_tokenizer

Conversation

@qinling0210
Copy link
Copy Markdown
Contributor

@qinling0210 qinling0210 commented May 6, 2026

What problem does this PR solve?

Update C++ tokenizer to match rag_tokenizer.py in #3356

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

@qinling0210 qinling0210 self-assigned this May 6, 2026
@qinling0210 qinling0210 added the ci PR can be test label May 6, 2026
@qinling0210 qinling0210 marked this pull request as ready for review May 6, 2026 08:00
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

This PR implements language-aware stemming support across the RAG tokenizer. A new CLI option enables users to specify the tokenization language, which maps to Snowball stemmer languages. Position-aware tokenization infrastructure tracks token spans through transformations, and lemmatization is conditionally applied based on the configured language.

Changes

Language-Aware Stemming & Position-Aware Tokenization

Layer / File(s) Summary
Language Mapping & Configuration
src/common/analyzer/rag_analyzer.cppm, src/common/analyzer/rag_analyzer_impl.cpp
SNOWBALL_LANGUAGE_MAP (49–70) maps language names to Snowball Language enum. Header declares new SetLanguage() and InitStemmer() methods; adds private use_lemmatizer_ flag to control conditional stemming.
Public Language APIs
src/common/analyzer/rag_analyzer_impl.cpp
SetLanguage(const std::string &language) (700–719) maps input language via SNOWBALL_LANGUAGE_MAP, initializes stemmer, sets lemmatizer usage, and logs. InitStemmer(Language language) (687–696) initializes stemmer and configures lemmatizer.
Position-Aware Tokenization Infrastructure
src/common/analyzer/rag_analyzer_impl.cpp
New methods TokenizeWithPosition(), TokenizeInnerWithPosition(), EnglishNormalizeWithPosition(), FineGrainedTokenizeWithPosition() and helpers (MapToOriginalPosition(), CalculateTokensLength()) track token spans. Helper functions PCRE2GlobalReplaceWithPosition() and PCRE2GlobalReplace() enable regex substitution with position remapping.
Conditional Lemmatization Integration
src/common/analyzer/rag_analyzer_impl.cpp
Replace direct lemmatization with conditional stemming based on use_lemmatizer_ flag in English normalization (1391–1398), tokenization paths (1755–1766, 1877–1890), and position-aware flows (2210–2218).
Python CLI Wiring
python/infinity_sdk/infinity/rag_tokenizer.py
Add -l/--language argument (573–574) to parser; invoke tokenizer.set_language(args.language) after initialization (579–582).
Tests & Validation
src/unit_test/common/analyzer/rag_analyzer_ut.cpp
New test test_set_language_dutch (311–343) validates Dutch language tokenization by comparing C++ output against Python tokenizer for "huizen" and verifying the stemmed form "huiz".

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as Python CLI
    participant PySDK as RagTokenizer
    participant RAGAnalyzer as C++ RAGAnalyzer
    participant Stemmer as Snowball Stemmer
    participant Tokenizer as Tokenization<br/>Pipeline

    User->>CLI: rag_tokenizer.py -l dutch
    CLI->>PySDK: args.language = 'dutch'
    PySDK->>PySDK: tokenizer.set_language('dutch')
    PySDK->>RAGAnalyzer: SetLanguage('dutch')
    RAGAnalyzer->>RAGAnalyzer: Map 'dutch' via<br/>SNOWBALL_LANGUAGE_MAP
    RAGAnalyzer->>Stemmer: InitStemmer(Language.DUTCH)
    Stemmer->>Stemmer: Init(DUTCH)
    RAGAnalyzer->>RAGAnalyzer: set use_lemmatizer_<br/>for language
    PySDK->>Tokenizer: Tokenize(text)
    Tokenizer->>Tokenizer: Normalize & stem<br/>with use_lemmatizer_
    Tokenizer->>Tokenizer: TokenizeWithPosition()<br/>preserves spans
    Tokenizer-->>User: Stemmed tokens<br/>+ position mappings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • infiniflow/infinity#3356: Both PRs implement language-aware stemming by adding language configuration APIs (RagTokenizer.set_language in Python and RAGAnalyzer.SetLanguage in C++) with corresponding normalization and stemming logic across the tokenization pipeline.

Poem

🐰 From Dutch to Snowball, our stems now bloom bright,
Position-aware paths guide each token's flight.
Lemmas bow gracefully to language's song,
While mappers ensure every span spans along.
Hopping through pipelines with precision so keen! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the pull request: updating the C++ tokenizer to match the Python rag_tokenizer implementation.
Description check ✅ Passed The description addresses the problem being solved and correctly categorizes the change as a bug fix, but lacks implementation details, test descriptions, and specific context about what matching behavior entails.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/common/analyzer/rag_analyzer_impl.cpp (1)

1391-1399: ⚡ Quick win

Extract the lowercase→optional lemmatize→stem flow into one helper.

This exact branch now exists in four places. Any future parity fix will have to touch all four paths, which makes Tokenize() and TokenizeWithPosition() easy to drift apart again. A shared helper here would remove that risk.

Also applies to: 1758-1765, 1880-1887, 2210-2217

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 1391 - 1399, Extract
the repeated lowercase→optional lemmatize→stem sequence into a single private
helper method (e.g., NormalizeAndStem or LemmatizeThenStem) on the analyzer
class and use it from Tokenize(), TokenizeWithPosition(), and the other three
call sites; the helper should take the raw token (or already-lowercased string),
apply lowercase if needed, call wordnet_lemma_->Lemmatize when use_lemmatizer_
is true, then call stemmer_->Stem and return the final stem string, and replace
the duplicated blocks around wordnet_lemma_->Lemmatize and stemmer_->Stem with
calls to this helper at all four locations (the blocks at the shown diff and the
ranges 1758-1765, 1880-1887, 2210-2217).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 51-69: SNOWBALL_LANGUAGE_MAP currently contains entries (e.g.,
"arabic") mapped to STEM_LANG_UNKNOWN which causes SetLanguage to silently keep
previous stemmer; remove any entries that map to STEM_LANG_UNKNOWN and change
SetLanguage to validate lookups against SNOWBALL_LANGUAGE_MAP (or a helper like
GetSnowballLanguage) and fail fast (return an error/throw/clear initialization)
when a requested language is not found or maps to STEM_LANG_UNKNOWN so
unsupported/typo'd languages are rejected rather than silently using the wrong
stemmer; update caller code that relies on SetLanguage to handle the failure
path (mirroring rag_tokenizer.py behavior) and ensure tests cover an unsupported
language case.

In `@src/unit_test/common/analyzer/rag_analyzer_ut.cpp`:
- Around line 318-329: The test currently treats any output from running
python_cmd via popen as valid; update the code around python_cmd/popen/pclose to
fail fast: assert that popen(...) returned a non-null FILE* (pipe) before
reading (e.g., ASSERT_NE/ASSERT_TRUE on pipe), collect stdout into python_result
as before, then capture pclose(...)'s return value and ASSERT_EQ that it
indicates successful execution (zero exit status) before using python_result for
comparisons; reference the variables/functions python_cmd, pipe (FILE* from
popen), python_result, and the pclose return value to locate and modify the
test.

---

Nitpick comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 1391-1399: Extract the repeated lowercase→optional lemmatize→stem
sequence into a single private helper method (e.g., NormalizeAndStem or
LemmatizeThenStem) on the analyzer class and use it from Tokenize(),
TokenizeWithPosition(), and the other three call sites; the helper should take
the raw token (or already-lowercased string), apply lowercase if needed, call
wordnet_lemma_->Lemmatize when use_lemmatizer_ is true, then call stemmer_->Stem
and return the final stem string, and replace the duplicated blocks around
wordnet_lemma_->Lemmatize and stemmer_->Stem with calls to this helper at all
four locations (the blocks at the shown diff and the ranges 1758-1765,
1880-1887, 2210-2217).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e965f8e3-c8c7-4369-b83c-786ff885e9c7

📥 Commits

Reviewing files that changed from the base of the PR and between c2e21a0 and f3fdc5e.

📒 Files selected for processing (4)
  • python/infinity_sdk/infinity/rag_tokenizer.py
  • src/common/analyzer/rag_analyzer.cppm
  • src/common/analyzer/rag_analyzer_impl.cpp
  • src/unit_test/common/analyzer/rag_analyzer_ut.cpp

Comment thread src/common/analyzer/rag_analyzer_impl.cpp
Comment thread src/unit_test/common/analyzer/rag_analyzer_ut.cpp
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/common/analyzer/rag_analyzer_impl.cpp (1)

2318-2326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Position-frame mismatch: UTF8Length mixed with byte offsets.

start_pos and the surrounding fine_positions use byte offsets throughout this file (e.g., the sibling else branch at line 2330 uses t.size()), but here sub_end = sub_pos + UTF8Length(t) advances by character count instead of bytes. For any multi-byte UTF-8 token this produces increasingly wrong spans and breaks position alignment with the original input.

🐛 Proposed fix
                 if (need_append_stk) {
                     unsigned sub_pos = start_pos;
                     for (auto &t : stk) {
-                        unsigned sub_end = sub_pos + UTF8Length(t);
+                        unsigned sub_end = sub_pos + static_cast<unsigned>(t.size());
                         fine_tokens.push_back(t);
                         fine_positions.emplace_back(sub_pos, sub_end);
                         sub_pos = sub_end;
                     }
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 2318 - 2326, The code
advances token end positions using UTF8Length(t) which gives character count
while the rest of the module (and fine_positions using start_pos) expects byte
offsets; this causes wrong spans for multi-byte UTF‑8 tokens. In the
need_append_stk block (referencing need_append_stk, start_pos, stk, fine_tokens,
fine_positions and function UTF8Length), replace the length computation with the
token's byte size (use t.size()) so sub_end advances by bytes and maintains
alignment with the rest of the file; only use UTF8Length if you also convert all
position bookkeeping to character offsets.
🧹 Nitpick comments (1)
src/common/analyzer/rag_analyzer_impl.cpp (1)

1890-1896: 💤 Low value

Missing bound check on end_pos before indexing final_pos_mapping.

This branch checks start_pos < final_pos_mapping.size() but then unconditionally reads final_pos_mapping[end_pos]. The analogous block at lines 1913–1914 correctly checks both endpoints. While the math currently keeps end_pos <= strline.size() (in-bounds), the asymmetry is fragile and a potential OOB read if any upstream calculation drifts.

🛡️ Proposed fix
-                        if (start_pos < final_pos_mapping.size()) {
-                            positions.emplace_back(final_pos_mapping[start_pos], final_pos_mapping[end_pos]);
+                        if (start_pos < final_pos_mapping.size() && end_pos < final_pos_mapping.size()) {
+                            positions.emplace_back(final_pos_mapping[start_pos], final_pos_mapping[end_pos]);
                         } else {
                             positions.emplace_back(static_cast<unsigned>(line.size()), static_cast<unsigned>(line.size()));
                         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 1890 - 1896, The code
maps start_pos and end_pos through final_pos_mapping but only checks start_pos
for bounds; update the branch around the positions.emplace_back call to verify
both start_pos < final_pos_mapping.size() AND end_pos < final_pos_mapping.size()
before indexing final_pos_mapping[end_pos], and if either is out-of-bounds use
the existing fallback (static_cast<unsigned>(line.size()),
static_cast<unsigned>(line.size())); modify the block that references
start_pos/end_pos/final_pos_mapping/positions/line to mirror the guarded check
used elsewhere (e.g., the analogous check near lines 1913–1914).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 687-719: SetLanguage/InitStemmer call stemmer_->Init repeatedly
which leaks the previous StemFunc/SN_env; fix by ensuring the previous state is
deinitialized before re-initializing: either call stemmer_->DeInit() at the
start of RAGAnalyzer::SetLanguage (and/or RAGAnalyzer::InitStemmer) before
calling stemmer_->Init(stem_lang) or modify Stemmer::Init to invoke DeInit() at
its top so Init is safe to call multiple times; update references in
RAGAnalyzer::SetLanguage, RAGAnalyzer::InitStemmer and Stemmer::Init/DeInit
accordingly.

In `@src/common/analyzer/stemmer/stem_UTF_8_arabic.cpp`:
- Around line 38-83: The current hand-written Arabic stemmer treats UTF-8 as
single bytes and mis-handles diacritics/prefixes/suffixes and memory boundaries;
replace this file with the canonical Snowball-generated stem_UTF_8_arabic.c
(from libstemmer) or reimplement using a proper UTF-8-aware Arabic stemmer.
Specifically, remove the byte-oriented code in remove_diacritics,
normalize_hamza_alef, remove_prefix_al, remove_plural_suffix and
arabic_UTF_8_stem and instead use the Snowball-generated logic that handles
multi-byte UTF-8 codepoints and correct slicing; also fix environment handling
in arabic_UTF_8_create_env/close_env so z->lb is set correctly and z->p is
allocated/freed per the Snowball skeleton. Ensure the new implementation
provides the standard among/find_among usage and matches rag_tokenizer.py’s
NLTK/libstemmer behavior.

---

Outside diff comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 2318-2326: The code advances token end positions using
UTF8Length(t) which gives character count while the rest of the module (and
fine_positions using start_pos) expects byte offsets; this causes wrong spans
for multi-byte UTF‑8 tokens. In the need_append_stk block (referencing
need_append_stk, start_pos, stk, fine_tokens, fine_positions and function
UTF8Length), replace the length computation with the token's byte size (use
t.size()) so sub_end advances by bytes and maintains alignment with the rest of
the file; only use UTF8Length if you also convert all position bookkeeping to
character offsets.

---

Nitpick comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 1890-1896: The code maps start_pos and end_pos through
final_pos_mapping but only checks start_pos for bounds; update the branch around
the positions.emplace_back call to verify both start_pos <
final_pos_mapping.size() AND end_pos < final_pos_mapping.size() before indexing
final_pos_mapping[end_pos], and if either is out-of-bounds use the existing
fallback (static_cast<unsigned>(line.size()),
static_cast<unsigned>(line.size())); modify the block that references
start_pos/end_pos/final_pos_mapping/positions/line to mirror the guarded check
used elsewhere (e.g., the analogous check near lines 1913–1914).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34bf1a51-4be6-4421-a7df-94b142066287

📥 Commits

Reviewing files that changed from the base of the PR and between f3fdc5e and c6303a9.

📒 Files selected for processing (6)
  • src/common/analyzer/rag_analyzer_impl.cpp
  • src/common/analyzer/stemmer/stem_UTF_8_arabic.cpp
  • src/common/analyzer/stemmer/stem_UTF_8_arabic.h
  • src/common/analyzer/stemmer/stemmer.cppm
  • src/common/analyzer/stemmer/stemmer_impl.cpp
  • src/unit_test/common/analyzer/rag_analyzer_ut.cpp

Comment on lines +687 to +719
void RAGAnalyzer::InitStemmer(Language language) {
stemmer_->Init(language);
use_lemmatizer_ = (language == STEM_LANG_ENGLISH);
}

void RAGAnalyzer::SetLanguage(const std::string &language) {
std::string lang_key = language;
// Convert to lowercase
std::transform(lang_key.begin(), lang_key.end(), lang_key.begin(), [](unsigned char c) { return std::tolower(c); });
// Trim whitespace
lang_key.erase(lang_key.find_last_not_of(" \t") + 1);
lang_key.erase(0, lang_key.find_first_not_of(" \t"));

Language stem_lang = STEM_LANG_UNKNOWN;
std::string snowball_lang;
for (const auto &pair : SNOWBALL_LANGUAGE_MAP) {
if (pair.first == lang_key) {
stem_lang = pair.second;
snowball_lang = pair.first;
break;
}
}

if (stem_lang != STEM_LANG_UNKNOWN) {
stemmer_->Init(stem_lang);
use_lemmatizer_ = (stem_lang == STEM_LANG_ENGLISH);
LOG_DEBUG(fmt::format("Tokenizer language set to '{}' (Snowball: {}, lemmatizer: {})", language, snowball_lang, use_lemmatizer_));
} else {
// Unsupported language (Chinese, Japanese, Korean, etc.) –
// keep defaults. CJK text uses dictionary segmentation, not stemming.
LOG_DEBUG(fmt::format("Language '{}' has no Snowball stemmer; keeping defaults", language));
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Memory leak: re-Init without DeInit leaks the prior StemFunc and SN_env.

Stemmer::Init unconditionally does stem_function_ = static_cast<void *>(new StemFunc); ... ->env = ...->create(); (see stemmer_impl.cpp lines 87–113). Because the constructor already calls InitStemmer(STEM_LANG_ENGLISH), every subsequent SetLanguage(...) (e.g., the new Dutch/Arabic tests) leaks the previous StemFunc heap allocation and its associated SN_env. Call DeInit first (or fix Stemmer::Init to deinit before re-init).

🛡️ Proposed fix
 void RAGAnalyzer::InitStemmer(Language language) {
+    stemmer_->DeInit();
     stemmer_->Init(language);
     use_lemmatizer_ = (language == STEM_LANG_ENGLISH);
 }
@@
     if (stem_lang != STEM_LANG_UNKNOWN) {
+        stemmer_->DeInit();
         stemmer_->Init(stem_lang);
         use_lemmatizer_ = (stem_lang == STEM_LANG_ENGLISH);

Alternatively, fix the root cause inside Stemmer::Init by invoking DeInit() at the start before allocating a new StemFunc.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void RAGAnalyzer::InitStemmer(Language language) {
stemmer_->Init(language);
use_lemmatizer_ = (language == STEM_LANG_ENGLISH);
}
void RAGAnalyzer::SetLanguage(const std::string &language) {
std::string lang_key = language;
// Convert to lowercase
std::transform(lang_key.begin(), lang_key.end(), lang_key.begin(), [](unsigned char c) { return std::tolower(c); });
// Trim whitespace
lang_key.erase(lang_key.find_last_not_of(" \t") + 1);
lang_key.erase(0, lang_key.find_first_not_of(" \t"));
Language stem_lang = STEM_LANG_UNKNOWN;
std::string snowball_lang;
for (const auto &pair : SNOWBALL_LANGUAGE_MAP) {
if (pair.first == lang_key) {
stem_lang = pair.second;
snowball_lang = pair.first;
break;
}
}
if (stem_lang != STEM_LANG_UNKNOWN) {
stemmer_->Init(stem_lang);
use_lemmatizer_ = (stem_lang == STEM_LANG_ENGLISH);
LOG_DEBUG(fmt::format("Tokenizer language set to '{}' (Snowball: {}, lemmatizer: {})", language, snowball_lang, use_lemmatizer_));
} else {
// Unsupported language (Chinese, Japanese, Korean, etc.) –
// keep defaults. CJK text uses dictionary segmentation, not stemming.
LOG_DEBUG(fmt::format("Language '{}' has no Snowball stemmer; keeping defaults", language));
}
}
void RAGAnalyzer::InitStemmer(Language language) {
stemmer_->DeInit();
stemmer_->Init(language);
use_lemmatizer_ = (language == STEM_LANG_ENGLISH);
}
void RAGAnalyzer::SetLanguage(const std::string &language) {
std::string lang_key = language;
// Convert to lowercase
std::transform(lang_key.begin(), lang_key.end(), lang_key.begin(), [](unsigned char c) { return std::tolower(c); });
// Trim whitespace
lang_key.erase(lang_key.find_last_not_of(" \t") + 1);
lang_key.erase(0, lang_key.find_first_not_of(" \t"));
Language stem_lang = STEM_LANG_UNKNOWN;
std::string snowball_lang;
for (const auto &pair : SNOWBALL_LANGUAGE_MAP) {
if (pair.first == lang_key) {
stem_lang = pair.second;
snowball_lang = pair.first;
break;
}
}
if (stem_lang != STEM_LANG_UNKNOWN) {
stemmer_->DeInit();
stemmer_->Init(stem_lang);
use_lemmatizer_ = (stem_lang == STEM_LANG_ENGLISH);
LOG_DEBUG(fmt::format("Tokenizer language set to '{}' (Snowball: {}, lemmatizer: {})", language, snowball_lang, use_lemmatizer_));
} else {
// Unsupported language (Chinese, Japanese, Korean, etc.) –
// keep defaults. CJK text uses dictionary segmentation, not stemming.
LOG_DEBUG(fmt::format("Language '{}' has no Snowball stemmer; keeping defaults", language));
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 687 - 719,
SetLanguage/InitStemmer call stemmer_->Init repeatedly which leaks the previous
StemFunc/SN_env; fix by ensuring the previous state is deinitialized before
re-initializing: either call stemmer_->DeInit() at the start of
RAGAnalyzer::SetLanguage (and/or RAGAnalyzer::InitStemmer) before calling
stemmer_->Init(stem_lang) or modify Stemmer::Init to invoke DeInit() at its top
so Init is safe to call multiple times; update references in
RAGAnalyzer::SetLanguage, RAGAnalyzer::InitStemmer and Stemmer::Init/DeInit
accordingly.

Comment on lines +38 to +83
static void remove_diacritics(struct SN_env *z) {
int i, j;
for (i = 0, j = 0; i < z->c; i++) {
unsigned char ch = z->p[i];
if (!((ch >= 0x4B && ch <= 0x5F) || ch == 0x70)) {
z->p[j++] = ch;
}
}
z->c = j;
}

static void normalize_hamza_alef(struct SN_env *z) {
for (int i = 0; i < z->c; i++) {
unsigned char ch = z->p[i];
if (ch == 0x22 || ch == 0x23 || ch == 0x24 || ch == 0x25) {
z->p[i] = 0x27;
}
}
}

static void remove_prefix_al(struct SN_env *z) {
if (z->c >= 2 && z->p[0] == 0x27 && z->p[1] == 0x4C) {
slice_from(z, 2);
slice_del(z);
}
}

static void remove_plural_suffix(struct SN_env *z) {
if (z->c >= 1) {
unsigned char last = z->p[z->c - 1];
if (last == 0x29 || last == 0x49) {
z->c--;
}
}
}

int arabic_UTF_8_stem(struct SN_env *z) {
if (z->c < 2)
return 0;

remove_diacritics(z);
normalize_hamza_alef(z);
remove_prefix_al(z);
remove_plural_suffix(z);

return 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

🧩 Analysis chain

🌐 Web query:

Snowball Arabic stemmer source code site

💡 Result:

The Snowball Arabic stemmer source code is in the official Snowball repository as the Snowball Framework algorithm file: https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl.[1] For the human-readable Snowball spec of the Arabic stemming algorithm, Snowball’s official site provides: https://snowballstem.org/algorithms/arabic/stemmer.html.[2] If you need a ready-to-use generated stemmer (rather than the.sbl algorithm definition), Snowball’s main repo explains that the Snowball compiler translates the.sbl program into source code in many target languages (C, Java, Python, etc.).[3]

Citations:


🏁 Script executed:

#!/bin/bash
# Confirm none of the surrounding stemmers in this repo are hand-written;
# they should all be Snowball-generated and follow the same skeleton.
fd -e cpp -e c 'stem_UTF_8_' --exec sh -c 'echo "=== $1 ==="; head -n 30 "$1"' _ {}

Repository: infiniflow/infinity

Length of output: 15018


🏁 Script executed:

cat -n src/common/analyzer/stemmer/stem_UTF_8_arabic.cpp

Repository: infiniflow/infinity

Length of output: 2565


🏁 Script executed:

python3 << 'EOF'
# Verify UTF-8 encoding of Arabic characters mentioned in the review

characters = {
    "Diacritic U+064B": "064B",  # FATHATAN
    "Diacritic U+065F": "065F",  # Last diacritic in range
    "Hamza on Alef U+0622": "0622",
    "Hamza on Alef with Madda U+0623": "0623",
    "Hamza on Alef with Hamza below U+0624": "0624",
    "Hamza on Alef with Hamza above U+0625": "0625",
    "Alef U+0627": "0627",
    "Arabic Waw (و) U+0648": "0648",
    "Arabic Yeh (ي) U+064A": "064A",
    "Arabic Teh Marbuta (ة) U+0629": "0629",
}

print("UTF-8 Encoding Verification:")
print("=" * 60)

for name, codepoint in characters.items():
    # Convert hex codepoint to integer
    code = int(codepoint, 16)
    # Encode as UTF-8
    char = chr(code)
    utf8_bytes = char.encode('utf-8')
    hex_bytes = ' '.join(f'0x{b:02X}' for b in utf8_bytes)
    print(f"{name}: U+{codepoint}")
    print(f"  UTF-8 bytes: {hex_bytes}")
    print()

print("ASCII values for byte constants in review:")
print("=" * 60)
ascii_bytes = {
    "0x4B": 75,   # 'K'
    "0x5F": 95,   # '_'
    "0x70": 112,  # 'p'
    "0x22": 34,   # '"'
    "0x23": 35,   # '#'
    "0x24": 36,   # '$'
    "0x25": 37,   # '%'
    "0x27": 39,   # "'"
    "0x4C": 76,   # 'L'
    "0x29": 41,   # ')'
    "0x49": 73,   # 'I'
}

for byte_hex, ascii_val in ascii_bytes.items():
    print(f"{byte_hex} = {ascii_val} ('{chr(ascii_val) if 32 <= ascii_val < 127 else '?'}')")
EOF

Repository: infiniflow/infinity

Length of output: 1023


🏁 Script executed:

# Search for any Snowball-generated Arabic stemmer or documentation
find src/common/analyzer/stemmer -type f -name "*arabic*" | head -20

Repository: infiniflow/infinity

Length of output: 161


🏁 Script executed:

# Check the header file for the struct definition
cat -n src/common/analyzer/stemmer/stem_UTF_8_arabic.h 2>/dev/null || echo "Header file not found"

Repository: infiniflow/infinity

Length of output: 442


🏁 Script executed:

# Check the api.h for SN_env definition and understanding
cat -n src/common/analyzer/stemmer/api.h 2>/dev/null | head -50

Repository: infiniflow/infinity

Length of output: 1035


Critical: the Arabic stemmer treats UTF-8 bytes as ASCII and will not produce correct Arabic stems.

Every constant in this file is an ASCII byte, but Arabic letters and diacritics are multi-byte UTF-8 sequences:

  • remove_diacritics (lines 38–47): checks ch >= 0x4B && ch <= 0x5F and ch == 0x70 (ASCII 'K'..'_', 'p'). Arabic diacritics U+064B..U+065F encode as 0xD9 0x8B..0x9F. The current code never matches a real diacritic and corrupts UTF-8 by deleting arbitrary continuation bytes.
  • normalize_hamza_alef (lines 49–56): checks 0x22..0x25 (ASCII '"#$%') and writes 0x27 (ASCII '). Arabic characters U+0622..U+0625 encode as 0xD8 0xA2..0xA5 and Alef U+0627 as 0xD8 0xA7; rewriting single bytes produces invalid UTF-8.
  • remove_prefix_al (lines 58–63): checks 0x27, 0x4C (ASCII 'L), but Arabic "ال" is 0xD8 0xA7 0xD9 0x84. Additionally the slicing logic is inverted: slice_from(z, 2) sets bra=2, ket=z->c, then slice_del removes [bra, ket), keeping only the first 2 bytes and deleting the rest — the opposite of removing a prefix.
  • remove_plural_suffix (lines 65–72): checks ASCII 0x29 and 0x49 against the last byte; Arabic plural suffixes (ون/ين/ات) are multi-byte and do not end in those codes.

Additionally:

  • arabic_UTF_8_create_env sets z->lb = 256 (used as a stemming boundary, not a buffer size), which will misbehave during processing.
  • z->p is not allocated in create_env but is never freed in close_env, leaking the buffer.
  • Unlike other Snowball-generated stemmers in this directory, this is hand-written and lacks the standard Snowball skeleton with among, find_among, and proper region marking.

This implementation cannot match rag_tokenizer.py (which uses NLTK's Snowball-generated Arabic stemmer). Either adopt the Snowball-generated stem_UTF_8_arabic.c from the canonical libstemmer distribution or replace with a tested Arabic stemmer implementation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/stemmer/stem_UTF_8_arabic.cpp` around lines 38 - 83, The
current hand-written Arabic stemmer treats UTF-8 as single bytes and mis-handles
diacritics/prefixes/suffixes and memory boundaries; replace this file with the
canonical Snowball-generated stem_UTF_8_arabic.c (from libstemmer) or
reimplement using a proper UTF-8-aware Arabic stemmer. Specifically, remove the
byte-oriented code in remove_diacritics, normalize_hamza_alef, remove_prefix_al,
remove_plural_suffix and arabic_UTF_8_stem and instead use the
Snowball-generated logic that handles multi-byte UTF-8 codepoints and correct
slicing; also fix environment handling in arabic_UTF_8_create_env/close_env so
z->lb is set correctly and z->p is allocated/freed per the Snowball skeleton.
Ensure the new implementation provides the standard among/find_among usage and
matches rag_tokenizer.py’s NLTK/libstemmer behavior.

@qinling0210 qinling0210 marked this pull request as draft May 6, 2026 09:59
@qinling0210 qinling0210 force-pushed the update_c++_tokenizer branch from c6303a9 to e878669 Compare May 6, 2026 11:17
@qinling0210 qinling0210 marked this pull request as ready for review May 6, 2026 11:26
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/infinity_sdk/infinity/rag_tokenizer.py (2)

158-188: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Silent fallback hides typos and unsupported -l values.

When users pass an unrecognized language to -l/--language (e.g. -l englsih or -l "Chinese"), set_language only emits a logging.debug(...) message and silently keeps the default English stemmer/lemmatizer. With the default logging config, that message is never visible, so the user has no indication their selection was ignored, and downstream the C++ unit test (test_set_language_dutch) compares against this output assuming the language took effect.

Consider either constraining -l to known values via argparse choices=, or surfacing a WARNING so unsupported inputs are observable. For example:

📝 Proposed fix
-            logging.debug(
-                "Language '%s' has no Snowball stemmer; keeping defaults",
-                language,
-            )
+            logging.warning(
+                "Language '%s' has no Snowball stemmer; keeping English defaults",
+                language,
+            )

And/or:

-    parser.add_argument('-l', '--language', help='Language for stemming (e.g., english, dutch)')
+    parser.add_argument(
+        '-l', '--language',
+        choices=sorted(_SNOWBALL_LANGUAGE_MAP.keys()),
+        help='Language for stemming (e.g., english, dutch)',
+    )

Also applies to: 573-573

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/infinity_sdk/infinity/rag_tokenizer.py` around lines 158 - 188, The
set_language method silently falls back to defaults for unknown languages;
change the fallback logging to a visible warning and include the invalid value
and accepted languages for clarity: in set_language (look for
_SNOWBALL_LANGUAGE_MAP, self.stemmer, self.lemmatizer, self._use_lemmatizer)
replace the logging.debug(...) in the unsupported-language branch with
logging.warning(...) and format the message to show the user-provided language
and list of supported keys (e.g. sorted(_SNOWBALL_LANGUAGE_MAP.keys())); also
make the same change at the other occurrence referenced (same
unsupported-language debug at the other location).

17-26: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PEP 723 requires-python allows Python 3.10, but the project requires 3.11+.

The metadata block declares requires-python = ">=3.10,<3.15", which permits 3.10 even though the SDK targets 3.11+. Tighten the lower bound to >=3.11 so uv run resolves a compatible interpreter.

📝 Proposed fix
 # /// script
-# requires-python = ">=3.10,<3.15"
+# requires-python = ">=3.11,<3.15"
 # dependencies = [

As per coding guidelines: "Use Python 3.11+ for SDK and Python-based tests".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/infinity_sdk/infinity/rag_tokenizer.py` around lines 17 - 26, Update
the PEP 723 metadata block to tighten the Python requirement: change the
requires-python value in the metadata (the "requires-python" key inside the ///
script PEP 723 block) from ">=3.10,<3.15" to ">=3.11,<3.15" so the SDK and uv
run resolve Python 3.11+ as required by the project.
♻️ Duplicate comments (3)
src/common/analyzer/rag_analyzer_impl.cpp (2)

687-719: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-Initing the stemmer without DeInit still leaks StemFunc/SN_env.

The constructor calls InitStemmer(STEM_LANG_ENGLISH), and any subsequent SetLanguage(...) (e.g. the new Dutch test, or repeated calls in long-lived processes) calls stemmer_->Init(stem_lang) again. Per stemmer_impl.cpp, Stemmer::Init unconditionally allocates a fresh StemFunc and Snowball SN_env, so the previous allocation leaks each time. Either call stemmer_->DeInit() at the top of both InitStemmer and SetLanguage, or guard the leak inside Stemmer::Init. Same finding as the prior review.

🛡️ Proposed fix
 void RAGAnalyzer::InitStemmer(Language language) {
+    stemmer_->DeInit();
     stemmer_->Init(language);
     use_lemmatizer_ = (language == STEM_LANG_ENGLISH);
 }
@@
     if (stem_lang != STEM_LANG_UNKNOWN) {
+        stemmer_->DeInit();
         stemmer_->Init(stem_lang);
         use_lemmatizer_ = (stem_lang == STEM_LANG_ENGLISH);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 687 - 719, The
stemmer is being re-initialized without freeing previous resources, causing
StemFunc/SN_env leaks; before any call to stemmer_->Init(...) in
RAGAnalyzer::InitStemmer and RAGAnalyzer::SetLanguage, call stemmer_->DeInit()
(or check/guard inside Stemmer::Init) to release prior allocations, then call
Init and set use_lemmatizer_ as before so repeated SetLanguage or InitStemmer
calls do not leak.

49-69: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

"arabic" still silently maps to STEM_LANG_UNKNOWN and diverges from rag_tokenizer.py.

SNOWBALL_LANGUAGE_MAP lists {"arabic", STEM_LANG_UNKNOWN}, so SetLanguage("arabic") matches the entry, sets stem_lang = STEM_LANG_UNKNOWN, and falls into the else branch — leaving the stemmer at its previous value (English by default) and only logging at debug level. Meanwhile, rag_tokenizer.py maps "arabic" to the Arabic Snowball stemmer, so the new test_tokenize_consistency_with_python will silently diverge on any Arabic input, and the PR commit log indicates Arabic support was an explicit goal.

Pick one of:

  • Add a real Arabic stemmer language (extend Language enum and Snowball wiring) and map to it.
  • Drop the "arabic" row from the map (and any others that map to STEM_LANG_UNKNOWN) and have SetLanguage return/log a WARNING-level error when an explicitly requested language is unsupported instead of silently keeping the previous stemmer.

This was raised previously and the divergence remains.

Also applies to: 692-719

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 49 - 69, The
SNOWBALL_LANGUAGE_MAP currently contains {"arabic", STEM_LANG_UNKNOWN} which
causes SetLanguage to silently keep the previous stemmer; remove any entries
that map to STEM_LANG_UNKNOWN (including "arabic") from SNOWBALL_LANGUAGE_MAP
and update SetLanguage (the function that looks up stem_lang from
SNOWBALL_LANGUAGE_MAP) to detect a missing/unsupported mapping and log/return a
WARNING-level message (instead of silently leaving the previous stemmer) when an
explicitly requested language is not supported; ensure the change references
SNOWBALL_LANGUAGE_MAP and SetLanguage so the
test_tokenize_consistency_with_python will surface unsupported-language
mismatches.
src/unit_test/common/analyzer/rag_analyzer_ut.cpp (1)

321-329: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Still missing fail-fast handling of popen/pclose.

popen failure (nullptr) and a non-zero pclose exit status are still not asserted, so a Python helper that fails to start or exits with an error will produce an empty python_result and the test will instead fail on EXPECT_EQ with a confusing message rather than reporting the actual cause. Same suggestion as before applies.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/unit_test/common/analyzer/rag_analyzer_ut.cpp` around lines 321 - 329,
Check and fail-fast on popen/pclose: after calling popen(python_cmd.c_str(),
"r") assert that pipe != nullptr (e.g., EXPECT_NE(pipe, nullptr) or add a FAIL()
with the python_cmd) so test fails immediately if Python can't start; after
pclose(pipe) capture and assert the return status is zero (e.g.,
EXPECT_EQ(pclose_ret, 0) or use WEXITSTATUS/WIFEXITED to report the actual exit
code) and include python_result and python_cmd in the failure message to surface
why the helper failed instead of letting an empty python_result cause a later
confusing EXPECT_EQ.
🧹 Nitpick comments (5)
src/common/analyzer/rag_analyzer.cppm (1)

27-27: 💤 Low value

Move import :logger; to the implementation unit if it's only used there.

logger appears to be needed only inside rag_analyzer_impl.cpp (for LOG_DEBUG in SetLanguage). Importing it in the public module interface unnecessarily widens the dependency surface for every consumer of :rag_analyzer. Consider moving the import into :rag_analyzer.impl only.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer.cppm` at line 27, The public module
interface (rag_analyzer.cppm) currently imports :logger unnecessarily; remove
the line "import :logger;" from the module interface and instead add that import
to the implementation unit (rag_analyzer_impl.cpp / :rag_analyzer.impl) where
LOG_DEBUG is used (e.g., in SetLanguage). Ensure the implementation module has
the import :logger; and rebuild to confirm no other interface symbols depend on
logger.
src/common/analyzer/rag_analyzer_impl.cpp (2)

692-708: 💤 Low value

Linear scan over an array; trim only handles ASCII space/tab.

Two small refinements on SetLanguage:

  1. The lookup walks SNOWBALL_LANGUAGE_MAP linearly. For 16 entries it's fine, but a static const std::unordered_map<std::string_view, Language> (or std::flat_map) reads more cleanly and is O(1).
  2. lang_key.erase(lang_key.find_last_not_of(" \t") + 1); only trims spaces and tabs. If callers ever pass values from files/HTTP headers, \r, \n, \f, \v will leak through and cause a silent miss. Prefer " \t\n\r\f\v".
📝 Proposed fix (trim only)
-    lang_key.erase(lang_key.find_last_not_of(" \t") + 1);
-    lang_key.erase(0, lang_key.find_first_not_of(" \t"));
+    static constexpr std::string_view ws = " \t\n\r\f\v";
+    lang_key.erase(lang_key.find_last_not_of(ws) + 1);
+    lang_key.erase(0, lang_key.find_first_not_of(ws));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 692 - 708, The
SetLanguage function currently linear-scans SNOWBALL_LANGUAGE_MAP and trims only
" \t", causing potential misses and inefficiency; replace the array scan by
using a static const std::unordered_map<std::string_view, Language> (or
std::flat_map) keyed by the normalized lang_key and look up in O(1) (update uses
of SNOWBALL_LANGUAGE_MAP accordingly), and broaden trimming of lang_key to
remove all common whitespace characters by using " \t\n\r\f\v" when calling
find_last_not_of/find_first_not_of so CR/LF and other whitespace won't cause
silent misses.

1391-1398: 💤 Low value

Conditional lemmatization is correctly threaded across all four normalization sites.

Each path now skips the WordNet lemmatizer when use_lemmatizer_ is false and feeds the lowercased token straight into the Snowball stemmer, matching the Python implementation's behavior for non-English Snowball languages. Logic is consistent across EnglishNormalize, Tokenize, TokenizeWithPosition, and EnglishNormalizeWithPosition.

One minor consolidation opportunity: this 4-line pattern is duplicated in four places — extracting a small NormalizeOne(std::string_view lowercase_term, std::string &out) helper would tighten future maintenance, but it's optional.

Also applies to: 1758-1765, 1880-1887, 2210-2217

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/analyzer/rag_analyzer_impl.cpp` around lines 1391 - 1398, The
four-place duplication of the pattern that conditionally lemmatizes then stems
(checking use_lemmatizer_, calling wordnet_lemma_->Lemmatize, then
stemmer_->Stem) should be consolidated into a small helper
NormalizeOne(std::string_view lowercase_term, std::string &out): implement
NormalizeOne to set term_to_stem = use_lemmatizer_ ?
wordnet_lemma_->Lemmatize(lowercase_term) : std::string(lowercase_term) and then
call stemmer_->Stem(term_to_stem, out); replace the duplicated blocks in
EnglishNormalize, Tokenize, TokenizeWithPosition, and
EnglishNormalizeWithPosition with calls to NormalizeOne to reduce duplication
and ease maintenance.
src/unit_test/common/analyzer/rag_analyzer_ut.cpp (2)

312-340: ⚡ Quick win

Test scope is narrow; consider parameterizing across languages.

The test only exercises Dutch on a single word. Given the new SetLanguage API supports many languages and "arabic" in particular is known to silently fall back to defaults (see rag_analyzer_impl.cpp), adding a small table of (language, input, expected_stem_substr) cases would catch regressions and language/Python parity issues much earlier. This is purely an enhancement; not blocking.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/unit_test/common/analyzer/rag_analyzer_ut.cpp` around lines 312 - 340,
Replace the single-case Dutch-only test in TEST_F(RAGAnalyzerTest,
test_set_language_dutch) with a table-driven loop: define an array of tuples
(language, input, expected_stem_substr) including at least dutch and arabic,
then for each entry call the external Python tokenizer (using
rag_tokenizer_path_ and rag_tokenizer.py) to capture python_result, call
analyzer_->SetLanguage(language) and analyzer_->Tokenize(input) to get
cxx_result, and assert that cxx_result contains expected_stem_substr and equals
python_result; keep the same logging and newline trimming steps but run them
per-case to catch language-specific regressions.

318-318: 💤 Low value

Quoting/escaping is fragile if the test input is ever extended.

The shell command embeds "huizen" directly. If anyone later parameterizes the input with strings containing quotes, backticks, $, or backslashes, this will break or behave unexpectedly. Consider a small helper that shell-quotes the argument (e.g. wrapping in single quotes and escaping any embedded single quote), or driving Python through a temp input file like the other tests do.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/unit_test/common/analyzer/rag_analyzer_ut.cpp` at line 318, The test
builds python_cmd as a fragile string: std::string python_cmd = "uv run " +
rag_tokenizer_path_ + "/rag_tokenizer.py " + "-l dutch \"huizen\"";; change this
to safely pass the test input by either (A) using a small shell-quoting helper
(e.g. std::string shell_quote(const std::string& s) that wraps in single quotes
and escapes embedded single quotes) and use it when composing python_cmd, or (B)
write the input text to a temporary file and invoke rag_tokenizer.py with that
filename (mirroring other tests) so rag_tokenizer_path_ and rag_tokenizer.py are
used without embedding raw quoted text; update the python_cmd construction
accordingly and replace the literal "huizen" usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@python/infinity_sdk/infinity/rag_tokenizer.py`:
- Around line 158-188: The set_language method silently falls back to defaults
for unknown languages; change the fallback logging to a visible warning and
include the invalid value and accepted languages for clarity: in set_language
(look for _SNOWBALL_LANGUAGE_MAP, self.stemmer, self.lemmatizer,
self._use_lemmatizer) replace the logging.debug(...) in the unsupported-language
branch with logging.warning(...) and format the message to show the
user-provided language and list of supported keys (e.g.
sorted(_SNOWBALL_LANGUAGE_MAP.keys())); also make the same change at the other
occurrence referenced (same unsupported-language debug at the other location).
- Around line 17-26: Update the PEP 723 metadata block to tighten the Python
requirement: change the requires-python value in the metadata (the
"requires-python" key inside the /// script PEP 723 block) from ">=3.10,<3.15"
to ">=3.11,<3.15" so the SDK and uv run resolve Python 3.11+ as required by the
project.

---

Duplicate comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 687-719: The stemmer is being re-initialized without freeing
previous resources, causing StemFunc/SN_env leaks; before any call to
stemmer_->Init(...) in RAGAnalyzer::InitStemmer and RAGAnalyzer::SetLanguage,
call stemmer_->DeInit() (or check/guard inside Stemmer::Init) to release prior
allocations, then call Init and set use_lemmatizer_ as before so repeated
SetLanguage or InitStemmer calls do not leak.
- Around line 49-69: The SNOWBALL_LANGUAGE_MAP currently contains {"arabic",
STEM_LANG_UNKNOWN} which causes SetLanguage to silently keep the previous
stemmer; remove any entries that map to STEM_LANG_UNKNOWN (including "arabic")
from SNOWBALL_LANGUAGE_MAP and update SetLanguage (the function that looks up
stem_lang from SNOWBALL_LANGUAGE_MAP) to detect a missing/unsupported mapping
and log/return a WARNING-level message (instead of silently leaving the previous
stemmer) when an explicitly requested language is not supported; ensure the
change references SNOWBALL_LANGUAGE_MAP and SetLanguage so the
test_tokenize_consistency_with_python will surface unsupported-language
mismatches.

In `@src/unit_test/common/analyzer/rag_analyzer_ut.cpp`:
- Around line 321-329: Check and fail-fast on popen/pclose: after calling
popen(python_cmd.c_str(), "r") assert that pipe != nullptr (e.g.,
EXPECT_NE(pipe, nullptr) or add a FAIL() with the python_cmd) so test fails
immediately if Python can't start; after pclose(pipe) capture and assert the
return status is zero (e.g., EXPECT_EQ(pclose_ret, 0) or use
WEXITSTATUS/WIFEXITED to report the actual exit code) and include python_result
and python_cmd in the failure message to surface why the helper failed instead
of letting an empty python_result cause a later confusing EXPECT_EQ.

---

Nitpick comments:
In `@src/common/analyzer/rag_analyzer_impl.cpp`:
- Around line 692-708: The SetLanguage function currently linear-scans
SNOWBALL_LANGUAGE_MAP and trims only " \t", causing potential misses and
inefficiency; replace the array scan by using a static const
std::unordered_map<std::string_view, Language> (or std::flat_map) keyed by the
normalized lang_key and look up in O(1) (update uses of SNOWBALL_LANGUAGE_MAP
accordingly), and broaden trimming of lang_key to remove all common whitespace
characters by using " \t\n\r\f\v" when calling
find_last_not_of/find_first_not_of so CR/LF and other whitespace won't cause
silent misses.
- Around line 1391-1398: The four-place duplication of the pattern that
conditionally lemmatizes then stems (checking use_lemmatizer_, calling
wordnet_lemma_->Lemmatize, then stemmer_->Stem) should be consolidated into a
small helper NormalizeOne(std::string_view lowercase_term, std::string &out):
implement NormalizeOne to set term_to_stem = use_lemmatizer_ ?
wordnet_lemma_->Lemmatize(lowercase_term) : std::string(lowercase_term) and then
call stemmer_->Stem(term_to_stem, out); replace the duplicated blocks in
EnglishNormalize, Tokenize, TokenizeWithPosition, and
EnglishNormalizeWithPosition with calls to NormalizeOne to reduce duplication
and ease maintenance.

In `@src/common/analyzer/rag_analyzer.cppm`:
- Line 27: The public module interface (rag_analyzer.cppm) currently imports
:logger unnecessarily; remove the line "import :logger;" from the module
interface and instead add that import to the implementation unit
(rag_analyzer_impl.cpp / :rag_analyzer.impl) where LOG_DEBUG is used (e.g., in
SetLanguage). Ensure the implementation module has the import :logger; and
rebuild to confirm no other interface symbols depend on logger.

In `@src/unit_test/common/analyzer/rag_analyzer_ut.cpp`:
- Around line 312-340: Replace the single-case Dutch-only test in
TEST_F(RAGAnalyzerTest, test_set_language_dutch) with a table-driven loop:
define an array of tuples (language, input, expected_stem_substr) including at
least dutch and arabic, then for each entry call the external Python tokenizer
(using rag_tokenizer_path_ and rag_tokenizer.py) to capture python_result, call
analyzer_->SetLanguage(language) and analyzer_->Tokenize(input) to get
cxx_result, and assert that cxx_result contains expected_stem_substr and equals
python_result; keep the same logging and newline trimming steps but run them
per-case to catch language-specific regressions.
- Line 318: The test builds python_cmd as a fragile string: std::string
python_cmd = "uv run " + rag_tokenizer_path_ + "/rag_tokenizer.py " + "-l dutch
\"huizen\"";; change this to safely pass the test input by either (A) using a
small shell-quoting helper (e.g. std::string shell_quote(const std::string& s)
that wraps in single quotes and escapes embedded single quotes) and use it when
composing python_cmd, or (B) write the input text to a temporary file and invoke
rag_tokenizer.py with that filename (mirroring other tests) so
rag_tokenizer_path_ and rag_tokenizer.py are used without embedding raw quoted
text; update the python_cmd construction accordingly and replace the literal
"huizen" usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22a0fd4b-51c7-40b7-9aeb-7b2a4e4d46ad

📥 Commits

Reviewing files that changed from the base of the PR and between c6303a9 and e878669.

📒 Files selected for processing (4)
  • python/infinity_sdk/infinity/rag_tokenizer.py
  • src/common/analyzer/rag_analyzer.cppm
  • src/common/analyzer/rag_analyzer_impl.cpp
  • src/unit_test/common/analyzer/rag_analyzer_ut.cpp

@yingfeng yingfeng merged commit 5458531 into infiniflow:main May 6, 2026
101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci PR can be test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants