Skip to content

[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193)#56254

Open
akshatshenoi-db wants to merge 2 commits into
apache:masterfrom
akshatshenoi-db:archive-format-schema-inference
Open

[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193)#56254
akshatshenoi-db wants to merge 2 commits into
apache:masterfrom
akshatshenoi-db:archive-format-schema-inference

Conversation

@akshatshenoi-db
Copy link
Copy Markdown

What changes were proposed in this pull request?

Stacked on #56193 (CSV tar archive read support). Please review/merge that PR first;
until it merges, this PR's diff will also include its commits. The inference-specific
changes are in the top commit.

Adds CSV schema inference for tar archives (.tar/.tar.gz/.tgz), building on the
archive read support in #56193. When spark.sql.files.archive.enabled is set and an input
path is a tar archive, CSVDataSource.inferSchema partitions inputs into archives vs
non-archives, infers each archive by streaming its entries (entries are tokenized like
standalone CSV files and never unpacked to disk), and merges the result with any non-archive
files' inferred schema. The enablement flag is read from FileSourceOptions.archiveFormatEnabled
(added in the base PR). The config doc is updated to note archives are supported during both
scan and schema inference.

Why are the changes needed?

The archive feature was split into two PRs to keep each reviewable: #56193 adds reading, this
PR adds schema inference so that inferSchema/inferSchema=true works for archives the same
way it does for a directory of CSV files.

Does this PR introduce any user-facing change?

No. The capability is behind the spark.sql.files.archive.enabled config (default false,
introduced in #56193); this PR only extends that opt-in feature to schema inference.

How was this patch tested?

Added inference parity tests to CSVArchiveReadBase (run by CSVTarArchiveReadSuite):

  • an archive infers the same schema as a directory of the same files;
  • all archive formats (.tar/.tar.gz/.tgz) infer the same schema.

Was this patch authored or co-authored using generative AI tooling?

Yes, authored with assistance from generative AI tooling.

Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) by streaming each archive entry through the CSV parser without unpacking to disk. Gated behind spark.sql.files.archive.reader.enabled (default false).
@akshatshenoi-db akshatshenoi-db force-pushed the archive-format-schema-inference branch from bc9d61f to 32ef4c9 Compare June 5, 2026 21:23
Stacked on the archive read PR (apache#56193). Adds schema inference for tar archives:
CSVDataSource.inferSchema streams each archive's entries (never unpacked to disk) and
merges the result with any non-archive files' inferred schema. Honors ignoreCorruptFiles
and ignoreMissingFiles at archive granularity, matching the loose-file path. Gated on the
existing spark.sql.files.archive.reader.enabled conf (read from
FileSourceOptions.archiveFormatEnabled).

Adds inference tests: directory parity, all archive formats agree, corrupt-archive skip,
cross-entry type widening, and mixed archive + loose-file inference.
@akshatshenoi-db akshatshenoi-db force-pushed the archive-format-schema-inference branch from 32ef4c9 to d457830 Compare June 5, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant