pg_pinyin includes:
- SQL baseline (
sql/pinyin.sql) - Rust extension (
src/lib.rs)
CREATE EXTENSION pg_pinyin installs the Rust-backed core API:
pinyin_char_romanize(text)pinyin_char_romanize(text, suffix text)pinyin_word_romanize(text)pinyin_word_romanize(text, suffix text)pinyin_word_romanize(tokenizer_input anyelement)(overload; usepdbtokenizer input such asname::pdb.icu::text[])pinyin_word_romanize(tokenizer_input anyelement, suffix text)(overload with user-table suffix)pinyin_regex_phrase(text, slope integer DEFAULT NULL, max_expansions integer DEFAULT NULL, generated_pinyin boolean DEFAULT false)(pg_searchquery helper; installed byCREATE EXTENSION pg_pinyinwhenpg_searchis already enabled in the database, returnspdb.query)
pinyin_regex_phrase is a Rust-backend public API, but its return type is pdb.query, so pg_search must be enabled in the database before CREATE EXTENSION pg_pinyin. PostgreSQL extension scripts cannot reliably enable another extension while they are being installed. If pg_pinyin is installed before pg_search, the romanization APIs are still installed and pinyin_regex_phrase is installed as an error stub with a clear exception.
CREATE EXTENSION pg_pinyin also installs pinyin_regex_phrase_patterns(text, generated_pinyin boolean DEFAULT false). It is a Rust-backed internal helper for pinyin_regex_phrase; application SQL should normally call pinyin_regex_phrase(...) instead.
pinyin_regex_phrase_patterns returns an empty text[] when the input is empty, whitespace-only, or cannot be parsed as pinyin tokens. SQL NULL input still returns SQL NULL because the function is strict.
sql/word.sql is not installed automatically by CREATE EXTENSION pg_pinyin. It is the SQL-backend companion to sql/pinyin.sql; load sql/pinyin.sql first and use it only in databases where pg_search is available. pg_search 0.24.0 must be preloaded before CREATE EXTENSION pg_search, for example by starting PostgreSQL with -c shared_preload_libraries=pg_search.
sql_pinyin_regex_phrase(text, slope integer DEFAULT NULL, max_expansions integer DEFAULT NULL, generated_pinyin boolean DEFAULT false)(SQL tokenization, returnspdb.query)sql_pinyin_regex_phrase_patterns(text, generated_pinyin boolean DEFAULT false)(SQL helper returning regex phrase tokens astext[])sql_pinyin_regex_phrase_patternsfollows the same empty-array behavior.sql_pinyin_regex_phraseand the Rust-backedpinyin_regex_phrasemap an empty pattern array topdb.empty()so they are safe to use with@@@; callers that want an empty user query to leave other filters unaffected should omit the@@@predicate or usepdb.all()explicitly.
Recommended usage:
- char romanization +
pg_trgm - word romanization +
pg_search
Optional pg_search query helper:
CREATE EXTENSION IF NOT EXISTS pg_search;
CREATE EXTENSION IF NOT EXISTS pg_pinyin;
\i sql/pinyin.sql
\i sql/word.sql
CREATE TABLE voice (
id bigserial PRIMARY KEY,
description text NOT NULL,
pinyin text GENERATED ALWAYS AS (public.pinyin_word_romanize(description)) STORED
);
CREATE INDEX voice_pinyin_bm25_idx
ON voice
USING bm25 (id, pinyin)
WITH (key_field='id');
-- Rust backend: CREATE EXTENSION pg_pinyin exports pinyin_regex_phrase()
-- when pg_search is already enabled in the database.
SELECT *
FROM voice
WHERE pinyin @@@ public.pinyin_regex_phrase('zhengshuang');
-- SQL backend fallback: use this when pg_pinyin is not installed.
SELECT *
FROM voice
WHERE pinyin @@@ public.sql_pinyin_regex_phrase('zhengshuang');CREATE EXTENSION IF NOT EXISTS pg_pinyin;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE voice (
id bigserial PRIMARY KEY,
description text NOT NULL,
pinyin text GENERATED ALWAYS AS (public.pinyin_char_romanize(description)) STORED
);
CREATE INDEX voice_pinyin_trgm_idx ON voice USING gin (pinyin gin_trgm_ops);
INSERT INTO voice (description) VALUES ('郑爽ABC');
SELECT id, description, pinyin FROM voice;You can provide custom dictionary tables in schema pinyin by suffix:
pinyin.pinyin_mapping_suffix1pinyin.pinyin_words_suffix1
When calling ...(..., '_suffix1'), romanization uses a merged dictionary:
- base tables (
pinyin_mapping/pinyin_words) - suffix tables (
pinyin_mapping_suffix1/pinyin_words_suffix1) with higher priority
Example:
CREATE TABLE IF NOT EXISTS pinyin.pinyin_mapping_suffix1 (
character text PRIMARY KEY,
pinyin text NOT NULL
);
CREATE TABLE IF NOT EXISTS pinyin.pinyin_words_suffix1 (
word text PRIMARY KEY,
pinyin text NOT NULL
);
INSERT INTO pinyin.pinyin_mapping_suffix1 (character, pinyin)
VALUES ('郑', '|zhengx|')
ON CONFLICT (character) DO UPDATE SET pinyin = EXCLUDED.pinyin;
INSERT INTO pinyin.pinyin_words_suffix1 (word, pinyin)
VALUES ('郑爽', '|zhengx| |shuangx|')
ON CONFLICT (word) DO UPDATE SET pinyin = EXCLUDED.pinyin;
SELECT public.pinyin_char_romanize('郑爽ABC', '_suffix1');
SELECT public.pinyin_word_romanize('郑爽ABC'::pdb.icu::text[], '_suffix1');The Rust extension now embeds these dictionaries at build time:
sql/data/pinyin_mapping.csvsql/data/pinyin_token.csvsql/data/pinyin_words.csv
During CREATE EXTENSION pg_pinyin, it seeds dictionary tables under schema pinyin
using PostgreSQL COPY from embedded CSV payloads (with SQL INSERT fallback).
No separate sql/load_data.sql step is required for extension usage.
Data prep logic is in this repo:
scripts/data/generate_extension_data.py(optimized pipeline)scripts/generate_data.sh(one-shot entrypoint)
The project includes mozillazg/pinyin-data as submodule at:
third_party/pinyin-data
Initialize submodule:
git submodule update --init third_party/pinyin-dataGenerate all extension data in one command:
./scripts/generate_data.shNotes:
- char/token data is generated from
third_party/pinyin-data. - word data uses
hanzi_pinyin_words.csvwhen available; otherwise an emptypinyin_words.csvis created.
Generated outputs:
sql/data/pinyin_mapping.csvsql/data/pinyin_token.csvsql/data/pinyin_words.csv
If needed, override source repo:
PINYIN_DATA_DIR=/path/to/pinyin-data ./scripts/generate_data.shpsql "$PGURL" -f sql/pinyin.sql
psql "$PGURL" \
-v mapping_file='/absolute/path/sql/data/pinyin_mapping.csv' \
-v token_file='/absolute/path/sql/data/pinyin_token.csv' \
-v words_file='/absolute/path/sql/data/pinyin_words.csv' \
-f sql/load_data.sqlpgTAP:
./test/pgtap/run.shRust extension tests:
cargo pgrx test pg18 --features pg18Dockerfiles:
docker/Dockerfile.test-trixiedocker/Dockerfile.release-trixie
Defaults now use upstream addresses (no mirror rewrite):
- base image:
postgres:18.3-trixie - apt source: base image defaults
- rustup/cargo source: upstream defaults
Build test image:
docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .
docker run -d --name pg_pinyin_test \
-e POSTGRES_HOST_AUTH_METHOD=trust \
pg_pinyin/test:trixie \
postgres -c shared_preload_libraries=pg_search
# optional: use local mirrors for one-off local builds
docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie \
--build-arg DEBIAN_MIRROR=http://mirrors.tuna.tsinghua.edu.cn/debian \
--build-arg DEBIAN_SECURITY_MIRROR=http://mirrors.tuna.tsinghua.edu.cn/debian-security \
--build-arg RUSTUP_DIST_SERVER=https://mirrors.tuna.tsinghua.edu.cn/rustup \
--build-arg RUSTUP_UPDATE_ROOT=https://mirrors.tuna.tsinghua.edu.cn/rustup/rustup \
--build-arg CARGO_REGISTRIES_CRATES_IO_INDEX=sparse+https://mirrors.tuna.tsinghua.edu.cn/crates.io-index/ \
.
# optional: pin pg_search version at build time
# docker build --build-arg PG_SEARCH_VERSION=0.24.0 -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .Build release image:
docker build -f docker/Dockerfile.release-trixie -t pg_pinyin/release:trixie .The Dockerfiles use BuildKit cache mounts for Rust download/index caches. If needed, ensure BuildKit is enabled:
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .Tokenization-only benchmark script:
scripts/benchmark_pg18.sh
It measures:
- SQL char tokenizer:
characters2romanize(name)(cold+warm) - Rust char tokenizer:
pinyin_char_romanize(name)(cold+warm) - Rust char tokenizer (user suffix overlay):
pinyin_char_romanize(name, '_<suffix>')(cold+warm) - SQL word tokenizer:
icu_romanize(name::pdb.icu::text[])(cold+warm, ifpg_searchexists) - Rust word tokenizer with tokenizer input:
pinyin_word_romanize(name::pdb.icu::text[])(cold+warm) - Rust word tokenizer with suffix overlay:
pinyin_word_romanize(name::pdb.icu::text[], '_<suffix>')(cold+warm) - Rust word tokenizer with plain text input:
pinyin_word_romanize(name)(cold+warm) - SQL backend query builder:
sql_pinyin_regex_phrase_patterns(query)plussql_pinyin_regex_phrase(query)forpdb.queryconstruction (warm, fromsql/word.sql) - Rust backend query-token builder:
pinyin_regex_phrase_patterns(query)(warm, fromCREATE EXTENSION pg_pinyin) - Rust backend
pg_searchquery builder:pinyin_regex_phrase(query)(warm, exported byCREATE EXTENSION pg_pinyinwhenpg_searchis already enabled in the database) - Standalone Rust query-token builder:
cargo run --release --bin benchmark_pinyin_regex_phrase -- --mode tokens - Standalone Python query-token builder:
scripts/benchmark_pinyin_regex_phrase.py --mode tokensusing the same tokenization output
All benchmark queries use EXPLAIN (ANALYZE, BUFFERS, MEMORY, SUMMARY).
By default, scripts/benchmark_pg18.sh drops and recreates a dedicated benchmark database (BENCHMARK_DATABASE=pg_pinyin_benchmark) before each run, then refreshes runtime state before each benchmark task. This avoids collisions with previously loaded extensions, SQL helper definitions, BM25 indexes, bench tables, and Rust dictionary caches. Set BENCHMARK_FRESH_DATABASE=0 only when you intentionally want to reuse the database named by PGURL.
Run:
ROWS=2000 REGEX_BENCH_ROWS=20000 USER_TABLE_SUFFIX=_bench PGURL=postgres://localhost/postgres ./scripts/benchmark_pg18.shStandalone helper benchmarks:
cargo run --release --bin benchmark_pinyin_regex_phrase -- --rows 20000 --runs 5 --mode tokens
python3 scripts/benchmark_pinyin_regex_phrase.py --rows 20000 --runs 5Session command:
ROWS=2000 REGEX_BENCH_ROWS=20000 USER_TABLE_SUFFIX=_bench PGURL=postgres://postgres@localhost:5432/postgres ./scripts/benchmark_pg18.shLatest run (PG18, pg_search=0.24.0, pg_pinyin=0.0.4, fresh benchmark database, ROWS=2000, REGEX_BENCH_ROWS=20000, 2026-06-08):
Character mode:
| Scenario | Cold | Warm | Speedup vs SQL (cold / warm) |
|---|---|---|---|
SQL baseline (characters2romanize) |
8291.975 |
8072.434 |
1.0x / 1.0x |
Rust (pinyin_char_romanize) |
90.025 |
32.087 |
92.1x / 251.6x |
Rust + suffix (pinyin_char_romanize(name, '_bench')) |
176.412 |
33.863 |
47.0x / 238.4x |
Word mode (pg_search tokenizer input):
| Scenario | Cold | Warm | Speedup vs SQL (cold / warm) |
|---|---|---|---|
SQL baseline (icu_romanize(name::pdb.icu::text[])) |
255.649 |
249.832 |
1.0x / 1.0x |
Rust (pinyin_word_romanize(name::pdb.icu::text[])) |
332.474 |
71.259 |
0.8x / 3.5x |
Rust + suffix (pinyin_word_romanize(name::pdb.icu::text[], '_bench')) |
788.152 |
69.230 |
0.3x / 3.6x |
Rust plain text (pinyin_word_romanize(name)) |
334.510 |
36.570 |
0.8x / 6.8x |
Query-token builder (pinyin_regex_phrase_patterns, 20,000 rows):
| Scenario | Cold-ish / Best | Warm / Best | Notes |
|---|---|---|---|
SQL backend tokens (sql_pinyin_regex_phrase_patterns) |
1555.603 ms |
1572.450 ms |
PostgreSQL SQL helper from sql/word.sql |
Rust backend tokens (pinyin_regex_phrase_patterns) |
41.784 ms |
43.334 ms |
PostgreSQL UDF path, includes text[] return overhead |
Rust backend generated-pinyin tokens (pinyin_regex_phrase_patterns(query, true)) |
- | 44.439 ms |
PostgreSQL UDF path |
Rust backend pg_search query (pinyin_regex_phrase) |
- | 60.588 ms |
Public helper exported by CREATE EXTENSION pg_pinyin |
Rust backend slope/max (pinyin_regex_phrase(query, 2, 4096)) |
- | 57.838 ms |
Public helper exported by CREATE EXTENSION pg_pinyin |
SQL backend + pg_search query (sql_pinyin_regex_phrase) |
- | 1611.476 ms |
Builds pdb.query |
SQL backend + slope/max (sql_pinyin_regex_phrase(query, 2, 4096)) |
- | 1607.444 ms |
Builds pdb.query |
SQL backend generated-pinyin query (sql_pinyin_regex_phrase(query, NULL, NULL, true)) |
- | 1568.413 ms |
Builds pdb.query |
Standalone query-token builder (tokens mode, 20,000 rows, no PostgreSQL executor or SQL array overhead):
| Scenario | Best | Median | Best per row | Checksum |
|---|---|---|---|---|
| Rust standalone | 3.538 ms |
3.581 ms |
0.177 us |
219996 |
| Python standalone | 33.353 ms |
33.631 ms |
1.668 us |
219996 |
Server-side full pg_search query benchmark (20,000 rows from bench_pinyin_regex_queries, no client round trip per query):
| Scenario | Execution Time | Notes |
|---|---|---|
Rust parser in PostgreSQL + pg_search, memoize enabled |
1027.056 ms |
Query mix has 9 distinct queries; PostgreSQL memoizes repeated LATERAL results |
Rust parser in PostgreSQL + pg_search, memoize disabled and JIT off |
5739.186 ms |
Executes 20,000 BM25 lookups |
Full pg_search query benchmark (20,000 client queries against a 2,000-row BM25 table, same result checksum):
| Scenario | Best | Median | Best per query | Checksum |
|---|---|---|---|---|
Python client parse + text[] patterns + pg_search |
7439.585 ms |
7503.080 ms |
371.979 us |
1337644 |
Rust in-Postgres parse + pg_search |
7529.758 ms |
7530.354 ms |
376.488 us |
1337644 |
The full client-query benchmark shows that once each query executes a real pg_search index lookup and a client/server round trip, parser cost is not the dominant factor. Keeping parsing in PostgreSQL is still useful when callers want a single parameterized SQL API, server-side batch queries, no client-side token dictionary, and no dynamic SQL construction.
Times above are Execution Time in milliseconds from EXPLAIN (ANALYZE, BUFFERS, MEMORY, SUMMARY).
cold runs for Rust base paths force a dictionary version bump before execution to simulate first-use cache load.
Suffix dictionaries are cached on first use and reused across statements. If suffix tables are updated, clear cache with public.pinyin_clear_suffix_cache('_suffix') (or public.pinyin_clear_suffix_cache() for all).
The standalone Rust/Python query-token numbers intentionally exclude PostgreSQL executor, UDF, and SQL array materialization overhead; they compare only the tokenization and pattern construction path.
- Tidy up the data generation pipeline and expand the word dictionary coverage.
Support user-provided dictionaries and allow romanization against a specific dictionary set.- Provide a smooth upgrade path for extension dictionaries and user dictionaries.
- Improve English handling (including stemming).
- Provide better examples without
pg_search. - Improve performance and memory balance (for example, evaluate frozen hash structures vs table lookups).
All dictionaries remain runtime-editable:
pinyin.pinyin_mappingpinyin.pinyin_wordspinyin.pinyin_token
No extension rebuild is required after table updates.
If you use the SQL-based romanization method (sql/pinyin.sql), cite:
- CN115905297A: 一种支持拼音检索和排序的方法及系统
BibTeX:
@patent{CN115905297A,
author = {Liang Zhanzhao},
title = {一种支持拼音检索和排序的方法及系统},
number = {CN115905297A},
country = {CN},
year = {2023},
url = {https://patents.google.com/patent/CN115905297A/zh}
}- Hanzi word-to-pinyin TSV source: tsroten/dragonmapper
hanzi_pinyin_words.tsv