Skip to content

psychofict/hwpkit

Repository files navigation

hwpkit — read, edit & extract Korean HWP / HWPX documents in pure Python

hwpkit

The pure-Python toolkit for Korean HWP & HWPX (Hancom Office) documents.
Read it, edit it, extract its text — no Hancom, no Windows, no COM automation.

PyPI Python versions CI Documentation License: MIT

📖 Documentation · PyPI · Changelog


Korean government, universities, courts, and most Korean enterprises run on HWP (한글, Hangul Word Processor) — the .hwp and .hwpx formats from Hancom Office. The rest of the world's tooling (python-docx, pdfplumber, unstructured, LibreOffice) can't read them, and the few that try need a Windows box with Hancom installed driving it over COM.

hwpkit is the missing piece. Pure Python, cross-platform, zero external apps — install it and start reading, editing, and extracting Korean documents in three lines.

pip install hwpkit[full]
from hwpkit import extract_text_from_file
print(extract_text_from_file("계약서.hwp"))      # …or .hwpx — auto-detected

That's it. No Hancom license. No Windows. No headless office server.

Why hwpkit

  • 📄 Both formats, one API. Binary .hwp (HWP 5.0) and XML .hwpx (OWPML) — extract, edit, and insert images through the same calls. open_document() hands you the same object either way, so you learn it once and never branch on format.
  • 🐍 Pure Python, runs anywhere. Linux, macOS, Windows, containers, Lambda. No Hancom, no pywin32, no COM, no LibreOffice subprocess.
  • 🤖 Built for LLM / RAG. Clean Korean text out of any .hwp/.hwpx, ready to chunk and embed. The preprocessing step your retrieval pipeline was missing.
  • ✍️ Edit without corrupting. Fill government & university forms, tick checkboxes, rewrite cells — and hwpkit rewrites the binary container while preserving the directory-tree structure Hancom validates on open. (This is the genuinely hard part. We solved it. See the gotchas.)
  • 🖋️ Insert seals & signatures. Stamp a 도장/직인/서명 image into a form cell — into both .hwp and .hwpx.
  • 🪶 Tiny core. The base install is just olefile; lxml and Pillow are optional extras, loaded lazily only when you use .hwpx or images.
  • ⚖️ MIT licensed. Use it anywhere, commercially or not.

Install

Python 3.9+.

pip install hwpkit            # core: binary .hwp read + edit
pip install hwpkit[hwpx]      # + .hwpx (OWPML) support  (adds lxml)
pip install hwpkit[image]     # + seal / signature insertion  (adds Pillow)
pip install hwpkit[full]      # everything

Quickstart

Extract text from any Korean document — for RAG, search, or an LLM:

from hwpkit import extract_text_from_file

text = extract_text_from_file("notice.hwpx")   # .hwp or .hwpx, auto-detected
# from the shell — works on both formats
hwpkit-text contract.hwp | llm "Summarize the key obligations in Korean"

Fill a form, tick a checkbox, stamp a seal — one API, either format:

from hwpkit import open_document

doc = open_document("template.hwp")            # or "template.hwpx" — auto-detected
print(doc.describe())                           # list paragraphs to find field indices
doc.inject_text(24, "홍길동")                    # fill an empty cell
doc.swap_in_para_text(40, "□ 석사", "☑ 석사")    # tick a checkbox
doc.replace_text(75, "2026. 05. 19.")           # overwrite a cell
doc.place_image(42, "seal.png", width_mm=30)    # stamp a 도장 / signature
doc.save("out.hwp")

open_document returns an HwpFile or HwpxFile depending on the file — both expose the same methods, so your code never branches on format.

Prefer plain functions? The originals are still there: fill_hwp(...), inject_text(records, i, text), and the file-to-file place_image("in.hwp", "out.hwp", "seal.png", paragraph_index=42).

Find which paragraph is which field:

hwpkit-inspect template.hwp        # one line per record, with a text preview

Built for Korean RAG & LLM pipelines

Korean enterprises ship contracts, policies, regulations, government notices, court filings, internal memos, and academic papers as .hwp / .hwpx. If your retrieval pipeline can't read HWP, it simply can't index Korean enterprise data — and the standard stack (pdfplumber, python-docx, unstructured) doesn't cover it.

hwpkit is a clean, dependency-light text source you can plug into anything — no LLM SDK required:

import glob
from hwpkit import extract_text_from_file

for path in glob.glob("corpus/**/*.hwp*", recursive=True):   # .hwp and .hwpx
    vector_db.add(doc_id=path, content=extract_text_from_file(path))

Extraction strips inline controls (tables, images, footnote refs, autonumbers, page-number controls, bookmarks) and returns clean, one-line-per-paragraph text — table-cell content included — ready for chunkers, embeddings, or any context window.

Who uses this

  • AI / RAG engineers indexing Korean documents into vector DBs.
  • Gov-tech & RPA teams auto-filling 관공서·대학 forms at scale.
  • Data engineers migrating HWP archives to text / structured data.
  • Anyone who needs to edit a .hwp without clicking through Hancom.

hwpkit vs the alternatives

pyhwp pyhwpx olefile hwpkit
Pure Python (no Hancom / no Windows) (needs Hancom + COM)
Extract text — .hwp
Extract text — .hwpx
Edit text without corrupting the file (via Hancom)
Rewrite a stream that grew / shrank n/a
Insert an image (seal / signature) (via Hancom)
One API across .hwp and .hwpx
Runs in CI / Linux / containers

hwpkit is the only option that does all of it in portable, pure Python — no Hancom, no Windows, no COM bridge.

How it works (for the curious)

Under the hood, .hwp is a Microsoft Compound File Binary (MS-CFB) container of deflate-compressed record streams; .hwpx is a ZIP of OWPML XML. olefile can only rewrite a stream if its byte length is unchanged — almost never true when you inject Korean text. hwpkit rewrites the whole container while preserving the red-black-tree directory topology Hancom validates on open, and patches the record-level layout caches so text re-flows correctly.

  • docs/OBJECT_MODEL.md — how .hwp records map onto their .hwpx (OWPML) twins; the shared HWPUNIT geometry.
  • docs/RECORD_FORMAT.md — the binary record format.
  • docs/GOTCHAS.md — the traps that take a week to find the first time (layout cache, 7-slot CharShape, RB-tree validation, image sizing). Read this if Hancom is rejecting your output.

For semantic HWP → XML (OWPML) conversion, see pyhwp.

Contributing

Issues, ideas, and PRs welcome. If hwpkit saved you from a Windows VM and a Hancom license, a ⭐ on GitHub helps others find it.

License

MIT — see LICENSE.


Made by Ebenworks

About

Read, fill, and edit Korean HWP (Hancom Office) documents in Python. Extract text for LLM / RAG pipelines, fill government & university forms programmatically, and rewrite the binary without corrupting it.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages