[SPARK-57135][SQL] Support reading CSV files inside tar archives by akshatshenoi-db · Pull Request #56193 · apache/spark

akshatshenoi-db · 2026-05-28T21:29:56Z

What changes were proposed in this pull request?

Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) directly through the CSV data source, by streaming each archive entry through the CSV parser without unpacking it to disk. Gated behind a new config spark.sql.files.archive.reader.enabled (default false).

ArchiveReader (new): a small streaming core. ArchiveReader(path) selects an implementation by file extension, and readEntries(conf)(parseEntry) opens the archive once, hands each non-skipped entry to parseEntry as a bounded, non-closing InputStream, and concatenates the per-entry results into a single iterator. It advances to the next entry only after the current one is fully consumed, so at most one entry is in flight and memory stays bounded regardless of archive size. Directories and dot-prefixed entries (macOS ._*, .DS_Store, ...) are skipped; the stream is closed on exhaustion, on close(), and (defensively) on task completion. ArchiveReader is an abstract base; TarArchiveReader is the only implementation today. .tar.gz is auto-decompressed by Hadoop's codec factory; .tgz (not a registered codec extension) is unwrapped with GZIPInputStream.
CSVFileFormat: archives are non-splittable (isSplitable returns false), so each archive is read as a single split; buildReader streams every entry through UnivocityParser (parseStream for multiLine, otherwise parseIterator over a LineReader-backed line iterator). Each entry is treated as the start of its own file, so headers are validated and dropped per entry, exactly as for standalone CSV files.
CSVDataSource: a readArchive path streams entries through the same per-entry parser / header-checker construction used for a standalone CSV read. It lives on the V1 CSVFileFormat read path only; the V2 file data source calls readFile directly and is intentionally left untouched.

The streaming approach avoids local disk entirely; the trade-off is that it only supports formats parseable from a sequential stream, so this PR scopes the feature to CSV over tar. Formats that need random access within a file (Parquet/ORC footers) cannot stream from a tar and are out of scope.

The ArchiveReader abstraction -- extension-dispatched apply, one subclass per archive format, and a format-agnostic lineIterator -- is a deliberate seam: other file formats (e.g. JSON, text, XML) and other archive formats are intended to be added later as additive subclasses/bindings, without reworking this core.

This change was reviewed by Alden Lau on the ingestion core team.

Why are the changes needed?

A common ingestion pattern packs many small CSV files into tar archives to reduce file/namespace pressure on object stores and HDFS. Today these cannot be read without unpacking them externally first. This lets users point the CSV reader directly at a tar archive. Streaming (rather than materializing entries to local disk) keeps the read bounded in memory and adds no local-disk requirement.

Does this PR introduce any user-facing change?

Yes. A new config spark.sql.files.archive.reader.enabled (default false) is added. When enabled, the CSV data source reads .tar/.tar.gz/.tgz paths by streaming their entries during a scan. With the default false, behavior is unchanged.

How was this patch tested?

New tests:

ArchiveReaderSuite (unit): isArchivePath dispatch and readEntries -- entry ordering, gzip handling (.tar.gz and .tgz), directory/dotfile skipping, lazy one-entry-at-a-time advance, the non-closing entry stream, idempotent close(), and TaskContext cleanup.
End-to-end CSV reads of .tar/.tar.gz/.tgz through the data source, asserting parity with reading the same entries as loose files in a directory. The format- and archive-agnostic harness (ArchiveReadSuiteBase + TarArchiveReadBase) is bound to CSV by CSVArchiveReadBase, split into header (CSVHeaderTarArchiveReadSuite) and headerless (CSVHeaderlessTarArchiveReadSuite) suites so the shared tests run in both modes. Coverage includes multi-entry reads, column pruning, a mixed archive/loose partitioned layout, empty archives, single-partition splittability, ignoreCorruptFiles, mismatched headers, custom delimiter, and multiline quoted fields.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

HyukjinKwon · 2026-05-28T22:00:18Z

Con't we already support compression codec in CSV, JSON and text? I think we should rather add an option there instead of introducing a new datasource

pan3793 · 2026-05-29T04:58:06Z

in addition to gzip tarball, can it be extended to support other codec? at least I think zstd should be supported, a similar request was raised in the Hadoop dev list recently

https://lists.apache.org/thread/ntlx40h3vn6k7q3y5qf22vm815nw8lkz

HyukjinKwon · 2026-06-01T22:37:08Z

@akshatshenoi-eng I think we should support this all in Text and JSON as well with sharing the same codebase. Also do you mind explaining how it's going to work? e.g., if partitioned table is all tar-gzed, would Spark recognize the structure? Or would you read all them in single dataframe?

In addition, how do we handle the physical partitions? Would we distribute them quite well?

akshatshenoi-db · 2026-06-01T23:17:24Z

@akshatshenoi-eng I think we should support this all in Text and JSON as well with sharing the same codebase. Also do you mind explaining how it's going to work? e.g., if partitioned table is all tar-gzed, would Spark recognize the structure? Or would you read all them in single dataframe?

In addition, how do we handle the physical partitions? Would we distribute them quite well?

Text and JSON support are both planned. I made ArchiveReader and streamArchiveEntries format agnostic so adding that support should be straightforward since both have stream-based parsers (same as CSV). Parquet, ORC, Avro, XML, and Excel are also planned I'm still figuring Parquet out since it can't be streamed like CSV is. I just wanted to start with CSV to validate the streaming design end-to-end before scaling to other formats.

Spark recognizes the partition structure correctly. Partition discovery happens at the directory level, independent of file format. If the layout is:

s3://bucket/dt=2024-01-01/data.tar.gz
s3://bucket/dt=2024-01-02/data.tar.gz

each archive becomes a PartitionedFile with its partition values already attached (dt=2024-01-01, etc.). When the archive is streamed, every row produced from its entries inherits those partition values automatically.

Each archive is a single Spark partition because tar is a sequential stream (isSplitable returns false, so Spark can't carve it into byte-range splits). The distribution across executors scales with the number of archive files: 10 archives → 10 tasks, which distribute across the cluster normally. The current limitation is that a single large archive isn't parallelized but that is also on the roadmap to be handled later.

Sorry for anything else that may be vague or not yet implemented my intern project is enabling multi-file archive read support for tar, tar.gz, zip and 7z.

Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) by streaming each archive entry through the CSV parser without unpacking to disk. Gated behind spark.sql.files.archive.reader.enabled (default false).

cloud-fan

1 blocking, 3 non-blocking, 1 nit. Reusable streaming core; the one blocker is framework completeness (schema inference), not direction — CSV-only is fine for now.

Design / architecture (1)

CSVFileFormat.scala:49: the core (ArchiveReader/lineIterator) is already reusable; only the per-FileFormat gating is copy-paste — consider lifting into a shared TextBasedFileFormat trait — see inline

Correctness (3)

CSVFileFormat.scala:140: schema inference isn't archive-aware — inferring from a .tar without a schema reads raw tar bytes; solve once in the reusable layer — see inline
CSVDataSource.scala:117: ignoreCorruptFiles is archive-granular (whole archive skipped vs per loose file) — see inline
ArchiveReader.scala:154: shouldSkipEntry skips only .-prefixed, not _-prefixed (e.g. _SUCCESS) — see inline

Nits: 1 minor item (see inline comments).

PR description suggestions

Document that schema inference isn't supported for archives (an explicit schema is required).
Document that ignoreCorruptFiles is archive-granular (a corrupt archive is skipped whole).

cloud-fan · 2026-06-05T03:50:27Z

    val parsedOptions = getCsvOptions(sparkSession, options)
+    // A tar archive is decompressed/unpacked as a sequential stream, so it must be read as a
+    // single split rather than carved into byte ranges.
+    if (parsedOptions.archiveFormatEnabled && ArchiveReader.isArchivePath(path)) {


On reusability for later formats (per the offline discussion — CSV-only now is fine): the reusable core is in good shape. ArchiveReader/TarArchiveReader/lineIterator live in execution.datasources and are format-agnostic, so a future JsonDataSource.readArchive can call ArchiveReader(path).readEntries(conf) { (name, in) => ... } directly, and archiveFormatEnabled already sits on the shared FileSourceOptions.

The one piece that isn't general is this per-FileFormat gating — the isSplitable archive guard here plus the buildReader archive-vs-normal branch. JsonFileFormat/TextFileFormat have structurally identical isSplitable/buildReader today, so they'd copy these verbatim. Optionally lift the gating into a shared helper/trait on TextBasedFileFormat (e.g. ArchiveReadableFileFormat) so a new format supplies only its per-entry parser. Minor; the bigger reuse gap is schema inference (see the comment at the archive branch in buildReader).

cloud-fan · 2026-06-05T03:50:27Z

+      // A tar archive (always a single split, see `isSplitable`) is streamed entry by entry when
+      // archive reads are enabled; otherwise the file is parsed directly.
+      if (parsedOptions.archiveFormatEnabled && ArchiveReader.isArchivePath(file.toPath)) {
+        CSVDataSource(parsedOptions).readArchive(


Schema inference is not archive-aware — the key framework-completeness gap. inferSchema → CSVDataSource.inferSchema → infer reads raw file bytes (TextFileFormat for non-multiline, BinaryFileRDD for multiline); none go through ArchiveReader. So with the config enabled and no user-supplied schema (spark.read.csv("data.tar")), inference reads tar headers/bytes as CSV and yields a garbage schema rather than erroring. Every test passes .schema(...) explicitly, so this path is untested, and the description doesn't state a schema is required.

This matters most for reuse: JsonDataSource/TextFileFormat infer through the same createBaseDataset/BinaryFileRDD raw-bytes path, so leaving it unsolved means every future format re-hits (or re-fixes) the same bug. Worth solving once in the reusable layer — a shared "infer from the first archive entry via ArchiveReader" path, or at minimum a shared clear "schema required for archive reads" error — plus a test and a description note.

cloud-fan · 2026-06-05T03:50:27Z

+      getHeaderChecker: (Boolean, String) => CSVHeaderChecker)(
+      parseEntry: (UnivocityParser, CSVHeaderChecker, InputStream) => Iterator[InternalRow])
+    : Iterator[InternalRow] = {
+    ArchiveReader(file.toPath).readEntries(conf) { (entryName, in) =>


ignoreCorruptFiles ends up archive-granular here. An archive is a single non-splittable PartitionedFile, so FileScanRDD's per-file corrupt handling skips the whole archive on any throw (corrupt gzip/tar, an IO error mid-entry, or a FAILFAST malformed record), whereas a directory of loose files skips only the bad file and keeps the rest. The PR claims directory parity, but the corrupt test only covers a whole-corrupt archive — not a bad entry among good ones. Worth a test that asserts the actual behavior and a one-line note that corrupt handling is archive-granular.

cloud-fan · 2026-06-05T03:50:27Z

+    if (entry.isDirectory) return true
+    val name = entry.getName
+    val basename = name.substring(name.lastIndexOf('/') + 1)
+    basename.startsWith(".")


shouldSkipEntry skips only .-prefixed entries, but Spark's loose-file listing filters both .- and _-prefixed names via HadoopFSUtils.shouldFilterOutPathName (InMemoryFileIndex). So an entry named _SUCCESS / _committed_* inside an archive is read as data, breaking the "parse like a directory of the same files" parity the suite asserts. Mirror the _ filter (ideally reuse shouldFilterOutPathName). Good news: this lives in the shared TarArchiveReader, so the fix benefits every future format for free.

cloud-fan · 2026-06-05T03:50:27Z

+    val newline = options.lineSeparatorInRead.getOrElse(
+      Array(options.asParserSettings.getFormat.getNormalizedNewline.toByte))
+    ArchiveReader.lineIterator(in, options.lineSeparatorInRead).map { line =>
+      line.append(newline, 0, newline.length)


Nit: the comment says the trailing newline is appended "so UnivocityParser does not raise EOF on the final line," but the non-archive readFile path feeds lines to the same parseIterator (via HadoopFileLinesReader) without re-appending a terminator. So either the EOF claim is inaccurate (and this append is unnecessary), or there's a subtlety worth spelling out. The single-byte getNormalizedNewline append also diverges from readFile for non-UTF-8 multi-byte charsets. Could you double-check — if the append isn't needed, dropping it keeps the archive path identical to readFile.

akshatshenoi-db force-pushed the archive-format branch from 3f8d192 to e31d86a Compare May 29, 2026 19:00

akshatshenoi-db changed the title ~~[SPARK-57135][SQL] Add ArchiveFormat for reading .tar/.tar.gz/.tgz archives as files~~ [SPARK-57135][SQL] Support reading CSV files inside tar archives May 29, 2026

akshatshenoi-db force-pushed the archive-format branch 2 times, most recently from 99b7166 to 670e233 Compare May 29, 2026 21:27

akshatshenoi-db mentioned this pull request Jun 1, 2026

[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193) #56254

Open

akshatshenoi-db force-pushed the archive-format branch from 4cb725a to 7ebfffb Compare June 4, 2026 21:00

akshatshenoi-db force-pushed the archive-format branch from 7ebfffb to 8ec5f07 Compare June 4, 2026 23:48

cloud-fan reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193

[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193
akshatshenoi-db wants to merge 1 commit into
apache:masterfrom
akshatshenoi-db:archive-format

akshatshenoi-db commented May 28, 2026 •

edited

Loading

Uh oh!

HyukjinKwon commented May 28, 2026

Uh oh!

pan3793 commented May 29, 2026

Uh oh!

HyukjinKwon commented Jun 1, 2026

Uh oh!

akshatshenoi-db commented Jun 1, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan Jun 5, 2026

Uh oh!

cloud-fan Jun 5, 2026

Uh oh!

cloud-fan Jun 5, 2026

Uh oh!

cloud-fan Jun 5, 2026

Uh oh!

cloud-fan Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

akshatshenoi-db commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented May 28, 2026

Uh oh!

pan3793 commented May 29, 2026

Uh oh!

HyukjinKwon commented Jun 1, 2026

Uh oh!

akshatshenoi-db commented Jun 1, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Correctness (3)

PR description suggestions

Uh oh!

cloud-fan Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akshatshenoi-db commented May 28, 2026 •

edited

Loading