feat(rag): add PowerPoint (.pptx) extraction with VLM image support#1224
Conversation
PPTX files were explicitly excluded from GAIA's RAG pipeline — users had to save as PDF first. This adds native extraction of text (shapes, tables, speaker notes) and embedded images (via VLM when available), mirroring the existing PDF pipeline pattern. New module pptx_utils.py handles slide-level extraction with group shape recursion and markdown table formatting. A zip bomb guard rejects files whose uncompressed size exceeds 500 MB before handing them to python-pptx.
|
@claude review this PR. |
|
@claude review this PR |
|
Solid PR — the per-slide loop, VLM integration, 🟡 Error handling diverges from the PDF pattern it mirrors. PDF failures raise typed exceptions ( 🟢 Markdown table rows are single-line. 🟢 Two One process note: this adds a new document type feeding RAG retrieval. Unit tests cover extraction mechanics well, but if you haven't already, a quick None of these are blockers — the core extraction logic looks correct and well-tested. The error-typing one is the only thing I'd genuinely recommend addressing for parity with PDF. |
|
Solid implementation that cleanly mirrors the PDF pipeline — the zip bomb guard, VLM integration, and Issues🟡 pptx_utils functions imported inside the wrong try block (
|
|
Solid, well-scoped feature. The PPTX pipeline faithfully mirrors Two minor issues worth fixing before merge, then a handful of nits. Issues Found🟡
|
|
Clean, well-scoped addition that makes PPTX indexing work the way users expect — the existing PDF pipeline carries over naturally (same VLMClient integration, Issues🟡
|
Move `import zipfile` to module-level, surface `pptx_status` in index_document stats, harden test VLM patch with try/finally, use modern type hints, add logger.debug for slide_num, warn on group shape depth limit, and fix txBox → tx_box naming.
|
Solid feature addition with a clean mirror of the existing PDF pipeline. One docstring claim about SummaryThe PPTX path faithfully replicates the PDF pipeline — same Issues Found🟡
|
…y extraction When PowerPoint is installed (Windows), automatically converts PPTX to PDF and feeds it through the existing PDF pipeline -- capturing charts, SmartArt, and visual layout that python-pptx cannot access. Speaker notes are still extracted via python-pptx since they do not appear in PDF renders. Falls back gracefully to python-pptx native extraction when PowerPoint is not available (Linux, Mac, or no Office installed).
|
PPTX extraction lands cleanly with solid test coverage and a thoughtful fallback chain. Two issues need to be addressed before merge: a temp-directory leak on every non-Windows PPTX indexing call, and a path injection risk in the PowerShell COM script. SummaryThe implementation correctly mirrors the PDF pipeline — Issues🟡 Important — Temp directory leaked when conversion returns
|
Builds a test PPTX programmatically (no sensitive files), indexes it through the full pipeline (COM conversion, PDF extraction, VLM, embeddings), then verifies LLM answers contain expected facts (cost, location, team members, speaker notes). Skips when Lemonade is not running.
|
Clean feature addition that mirrors the PDF pipeline precisely — same Issues Found🔴 Critical — Path injection in PowerShell script (
|
itomek
left a comment
There was a problem hiding this comment.
The extraction logic is in good shape and you've already cleared most of the earlier review notes (imports reordered so the text/notes helpers load before the VLM try-block, count_images_in_slide failure now logs, zipfile hoisted to module scope, _iter_shapes depth-skip now warns, slide_num now used). Faithful PDF-pipeline parity, dep declared (python-pptx in both rag extras), docs (chat.mdx/rag.mdx) and the allowlist (files.py/utils.py) all updated consistently, plus 17 unit tests. Approving.
Two items worth fixing here or filing a follow-up for, noted inline — neither leaves the machine, so not blocking, but both are real:
- Temp-dir leak on the non-Windows path: convert_pptx_to_pdf returns None on Linux/macOS, so the unconditional mkdtemp is never cleaned up — an empty gaia_pptx_* dir leaks on every PPTX index. One-line elif fix.
- PowerShell path interpolation in convert_pptx_to_pdf breaks on apostrophe paths; escape single quotes before interpolation.
Plus a multi-line-cell markdown hardening and a missing test for the zip-bomb guard.
One process note per CLAUDE.md: this feeds a new document type into RAG retrieval, so a quick gaia eval agent --category rag_quality against a PPTX-backed scenario would confirm the extracted text actually retrieves well end-to-end, not just that extraction runs.
Generated by Claude Code
There was a problem hiding this comment.
Temp-dir leak on the non-Windows path: tmp_dir = mkdtemp(...) runs unconditionally, but the only shutil.rmtree is inside the if pdf_conversion_path: ... finally. On Linux/macOS convert_pptx_to_pdf returns None, so that branch is skipped and the empty gaia_pptx_* dir leaks on every PPTX index. Add an elif tmp_dir: shutil.rmtree(tmp_dir, ignore_errors=True) after the conversion block. (Worth fixing here or filing an issue — it recurs on the primary path.)
Generated by Claude Code
There was a problem hiding this comment.
convert_pptx_to_pdf interpolates pptx_abs/pdf_abs into single-quoted PowerShell string literals with no escaping. A path containing an apostrophe (valid on Windows, e.g. C:\Users\O'Brien\deck.pptx) closes the literal early; since the path is user-controlled this breaks conversion. Standard fix is doubling single quotes — pptx_abs.replace("'", "''") (and same for pdf_abs) — before interpolation. _table_to_markdown is also worth hardening for multi-line cell text (.replace("\n", "
")), and the zip-bomb guard has no test yet.
Generated by Claude Code
- featureTitle now lists all blocked Office formats (doc, docx, ppt, xls) so the auto-filed feature-request title matches the blocklist - drop the #1224 reference from a test name so the assertion stands alone
Uploading a .pptx told users to "save as PDF first." Now GAIA indexes PowerPoint natively — extracting text from shapes, tables, and speaker notes, with VLM analysis of embedded images when available. The implementation mirrors the existing PDF pipeline (same VLMClient integration,
[Page N]markers, merge strategy, metadata structure) so the downstream chunking and retrieval paths work unchanged.A zip bomb guard checks uncompressed size before opening (500 MB limit), and WMF/EMF metafiles are skipped gracefully since PIL cannot decode them.
Test plan