[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193)#56254
Open
akshatshenoi-db wants to merge 2 commits into
Open
[SPARK-57135][SQL] Infer CSV schema from tar archives (stacked on #56193)#56254akshatshenoi-db wants to merge 2 commits into
akshatshenoi-db wants to merge 2 commits into
Conversation
Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) by streaming each archive entry through the CSV parser without unpacking to disk. Gated behind spark.sql.files.archive.reader.enabled (default false).
bc9d61f to
32ef4c9
Compare
Stacked on the archive read PR (apache#56193). Adds schema inference for tar archives: CSVDataSource.inferSchema streams each archive's entries (never unpacked to disk) and merges the result with any non-archive files' inferred schema. Honors ignoreCorruptFiles and ignoreMissingFiles at archive granularity, matching the loose-file path. Gated on the existing spark.sql.files.archive.reader.enabled conf (read from FileSourceOptions.archiveFormatEnabled). Adds inference tests: directory parity, all archive formats agree, corrupt-archive skip, cross-entry type widening, and mixed archive + loose-file inference.
32ef4c9 to
d457830
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds CSV schema inference for tar archives (
.tar/.tar.gz/.tgz), building on thearchive read support in #56193. When
spark.sql.files.archive.enabledis set and an inputpath is a tar archive,
CSVDataSource.inferSchemapartitions inputs into archives vsnon-archives, infers each archive by streaming its entries (entries are tokenized like
standalone CSV files and never unpacked to disk), and merges the result with any non-archive
files' inferred schema. The enablement flag is read from
FileSourceOptions.archiveFormatEnabled(added in the base PR). The config doc is updated to note archives are supported during both
scan and schema inference.
Why are the changes needed?
The archive feature was split into two PRs to keep each reviewable: #56193 adds reading, this
PR adds schema inference so that
inferSchema/inferSchema=trueworks for archives the sameway it does for a directory of CSV files.
Does this PR introduce any user-facing change?
No. The capability is behind the
spark.sql.files.archive.enabledconfig (defaultfalse,introduced in #56193); this PR only extends that opt-in feature to schema inference.
How was this patch tested?
Added inference parity tests to
CSVArchiveReadBase(run byCSVTarArchiveReadSuite):.tar/.tar.gz/.tgz) infer the same schema.Was this patch authored or co-authored using generative AI tooling?
Yes, authored with assistance from generative AI tooling.