[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193) by akshatshenoi-db · Pull Request #56254 · apache/spark

akshatshenoi-db · 2026-06-01T21:45:59Z

What changes were proposed in this pull request?

Stacked on #56193 (CSV tar archive read support). Please review/merge that PR first;
until it merges, this PR's diff will also include its commits. The inference-specific
changes are in the top commit.

Adds CSV schema inference for tar archives (.tar/.tar.gz/.tgz), building on the
archive read support in #56193. When spark.sql.files.archive.enabled is set and an input
path is a tar archive, CSVDataSource.inferSchema partitions inputs into archives vs
non-archives, infers each archive by streaming its entries (entries are tokenized like
standalone CSV files and never unpacked to disk), and merges the result with any non-archive
files' inferred schema. The enablement flag is read from FileSourceOptions.archiveFormatEnabled
(added in the base PR). The config doc is updated to note archives are supported during both
scan and schema inference.

Why are the changes needed?

The archive feature was split into two PRs to keep each reviewable: #56193 adds reading, this
PR adds schema inference so that inferSchema/inferSchema=true works for archives the same
way it does for a directory of CSV files.

Does this PR introduce any user-facing change?

No. The capability is behind the spark.sql.files.archive.enabled config (default false,
introduced in #56193); this PR only extends that opt-in feature to schema inference.

How was this patch tested?

Added inference parity tests to CSVArchiveReadBase (run by CSVTarArchiveReadSuite):

an archive infers the same schema as a directory of the same files;
all archive formats (.tar/.tar.gz/.tgz) infer the same schema.

Was this patch authored or co-authored using generative AI tooling?

Yes, authored with assistance from generative AI tooling.

Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) by streaming each archive entry through the CSV parser without unpacking to disk. Gated behind spark.sql.files.archive.reader.enabled (default false).

Stacked on the archive read PR (apache#56193). Adds schema inference for tar archives: CSVDataSource.inferSchema streams each archive's entries (never unpacked to disk) and merges the result with any non-archive files' inferred schema. Honors ignoreCorruptFiles and ignoreMissingFiles at archive granularity, matching the loose-file path. Gated on the existing spark.sql.files.archive.reader.enabled conf (read from FileSourceOptions.archiveFormatEnabled). Adds inference tests: directory parity, all archive formats agree, corrupt-archive skip, cross-entry type widening, and mixed archive + loose-file inference.

akshatshenoi-db force-pushed the archive-format-schema-inference branch from bc9d61f to 32ef4c9 Compare June 5, 2026 21:23

akshatshenoi-db force-pushed the archive-format-schema-inference branch from 32ef4c9 to d457830 Compare June 5, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193)#56254

[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193)#56254
akshatshenoi-db wants to merge 2 commits into
apache:masterfrom
akshatshenoi-db:archive-format-schema-inference

akshatshenoi-db commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akshatshenoi-db commented Jun 1, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant