Skip to content

feat: add review-datafusion-pr Claude Code skill#3974

Open
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/review-datafusion-pr-skill
Open

feat: add review-datafusion-pr Claude Code skill#3974
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/review-datafusion-pr-skill

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

We already have a review-comet-pr skill that helps reviewers check PRs in this repo for Spark compatibility and implementation correctness. A similar workflow applies when reviewing PRs in the upstream apache/datafusion repository, particularly for the datafusion-spark compatible function library and for core DataFusion changes that may affect Comet.

The upstream repo has a different test approach. It uses .slt (sqllogictest) files written in DataFusion SQL syntax, so the tests cannot be run directly in Spark. A reviewer needs to manually run equivalent queries in Spark to verify that the DataFusion implementation produces the same result.

This skill packages that workflow so it is consistent across reviews and so new reviewers have a concrete checklist to follow.

What changes are included in this PR?

Adds a new Claude Code skill at .claude/skills/review-datafusion-pr/SKILL.md. The skill covers:

  • PR classification into a Spark expression track, a Comet API impact track, or both
  • Reading the Spark source and Spark tests as the canonical reference for expression behavior
  • Reviewing the Rust implementation under datafusion/spark/src/function/
  • Reviewing the .slt test file against the testing guide in datafusion/sqllogictest/test_files/spark/README.md
  • A manual Spark cross-check step with translation notes from DataFusion SQL to Spark SQL, since .slt tests cannot prove Spark equivalence on their own
  • A checklist for breaking API changes in the DataFusion crates that Comet depends on (datafusion, datafusion-datasource, datafusion-physical-expr-adapter, datafusion-spark)
  • CI status, documentation, and common review issues

How are these changes tested?

Manual review of the skill content. The skill is guidance for human reviewers and is not executed by CI.

@andygrove andygrove marked this pull request as ready for review April 17, 2026 13:23
@parthchandra
Copy link
Copy Markdown
Contributor

Would the DataFusion repo be a better place for this? (I've no objection to having it in the Comet repo too).

@andygrove
Copy link
Copy Markdown
Member Author

Would the DataFusion repo be a better place for this? (I've no objection to having it in the Comet repo too).

It's a question worth discussing. The skill is reviewing specifically for any impact on the Comet project.

@parthchandra
Copy link
Copy Markdown
Contributor

Would the DataFusion repo be a better place for this? (I've no objection to having it in the Comet repo too).

It's a question worth discussing. The skill is reviewing specifically for any impact on the Comet project.

My point exactly. DF reviewers may not always remember to consider impact on Comet and this would certainly be a useful for them and invaluable for us.

- [ ] Both scalar and array inputs are exercised (the README requires this)
- [ ] All accepted Spark input types are tested with explicit casts (`0::INT`, `0::BIGINT`, etc.) — DataFusion and Spark do not infer types the same way
- [ ] Null input is tested
- [ ] Edge cases: empty string, boundary values (e.g., `INT_MIN`), `NaN`, `Infinity`, `-0.0`, negative values for numeric functions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [ ] Edge cases: empty string, boundary values (e.g., `INT_MIN`), `NaN`, `Infinity`, `-0.0`, negative values for numeric functions
- [ ] Edge cases: empty string, boundary values (e.g., `INT_MIN`), `NaN`, `Infinity`, `-0.0`, `-Infinity`, `+0.0`, negative values for numeric functions

- [ ] Edge cases: empty string, boundary values (e.g., `INT_MIN`), `NaN`, `Infinity`, `-0.0`, negative values for numeric functions
- [ ] ANSI mode behavior is wrapped in `set datafusion.execution.enable_ansi_mode = true/false` pairs where Spark differs between modes
- [ ] Test only contains `SELECT` statements for the function under test, with no unrelated setup
- [ ] Header comments cite the upstream source if ported (the existing files show the pattern)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for nested types it is needed to test empty values if applicable, like empty array, map and mix of empty and non empty entries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants