Skip to content

Commit 076b738

Browse files
authored
docs: Update README and benchmark results for 0.15.0 release (#3995)
1 parent dc16751 commit 076b738

34 files changed

Lines changed: 1039 additions & 1878 deletions

README.md

Lines changed: 51 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -40,75 +40,76 @@ Apache DataFusion Comet is a high-performance accelerator for Apache Spark, buil
4040
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
4141
Spark ecosystem without requiring any code changes.
4242

43-
Comet also accelerates Apache Iceberg, when performing Parquet scans from Spark.
43+
**Comet provides a 2x speedup for TPC-H @ 1TB, resulting in 50% cost savings.**
4444

45-
[Apache DataFusion]: https://datafusion.apache.org
46-
47-
# Benefits of Using Comet
48-
49-
## Run Spark Queries at DataFusion Speeds
50-
51-
Comet delivers a performance speedup for many queries, enabling faster data processing and shorter time-to-insights.
52-
53-
The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format
54-
using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
55-
for details of the environment used for these benchmarks.
56-
57-
When using Comet, the overall run time is reduced from 687 seconds to 302 seconds, a 2.2x speedup.
58-
59-
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_allqueries.png)
45+
That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have,
46+
or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into
47+
lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL,
48+
DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are
49+
required, so the savings come from better utilization of the infrastructure you already run on.
6050

61-
Here is a breakdown showing relative performance of Spark and Comet for each TPC-H query.
51+
![](docs/source/_static/images/benchmark-results/0.15.0/tpch_allqueries.png)
6252

63-
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_queries_compare.png)
53+
![](docs/source/_static/images/benchmark-results/0.15.0/tpch_queries_compare.png)
6454

65-
The following charts shows how much Comet currently accelerates each query from the benchmark.
55+
See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html) for more details.
6656

67-
### Relative speedup
68-
69-
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_queries_speedup_rel.png)
57+
[Apache DataFusion]: https://datafusion.apache.org
7058

71-
### Absolute speedup
59+
## What Comet Accelerates
7260

73-
![](docs/source/_static/images/benchmark-results/0.11.0/tpch_queries_speedup_abs.png)
61+
Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion.
62+
It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
7463

75-
These benchmarks can be reproduced in any environment using the documentation in the
76-
[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage
77-
you to run your own benchmarks.
64+
- **Parquet scans**: native Parquet reader integrated with Spark's query planner
65+
- **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark
66+
(see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
67+
- **Shuffle**: native columnar shuffle with support for hash and range partitioning
68+
- **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array,
69+
map, JSON, hash, and predicate categories
70+
- **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses
71+
- **Joins**: hash join, sort-merge join, and broadcast join
7872

79-
Results for our benchmark derived from TPC-DS are available in the [benchmarking guide](https://datafusion.apache.org/comet/contributor-guide/benchmark-results/tpc-ds.html).
73+
For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html)
74+
and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages.
8075

81-
## Use Commodity Hardware
76+
## Drop-In Integration
8277

83-
Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or
84-
specialized hardware accelerators, such as GPUs or FPGA. By maximizing the utilization of commodity hardware, Comet
85-
ensures cost-effectiveness and scalability for your Spark deployments.
78+
Comet is designed as a drop-in accelerator for Apache Spark, allowing you to integrate Comet into your existing
79+
Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness the
80+
benefits of Comet's acceleration capabilities without disrupting your Spark applications.
8681

87-
## Spark Compatibility
82+
## Getting Started
8883

89-
Comet aims for 100% compatibility with all supported versions of Apache Spark, allowing you to integrate Comet into
90-
your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness
91-
the benefits of Comet's acceleration capabilities without disrupting your Spark applications.
84+
Comet supports Apache Spark 3.4 and 3.5, and provides experimental support for Spark 4.0. See the
85+
[installation guide](https://datafusion.apache.org/comet/user-guide/installation.html) for the detailed
86+
version, Java, and Scala compatibility matrix.
9287

93-
## Tight Integration with Apache DataFusion
88+
Install Comet by adding the jar for your Spark and Scala version to the Spark classpath and enabling the plugin.
89+
A typical configuration looks like:
9490

95-
Comet tightly integrates with the core Apache DataFusion project, leveraging its powerful execution engine. With
96-
seamless interoperability between Comet and DataFusion, you can achieve optimal performance and efficiency in your
97-
Spark workloads.
91+
```shell
92+
export COMET_JAR=/path/to/comet-spark-spark3.5_2.12-<version>.jar
9893

99-
## Active Community
94+
$SPARK_HOME/bin/spark-shell \
95+
--jars $COMET_JAR \
96+
--conf spark.driver.extraClassPath=$COMET_JAR \
97+
--conf spark.executor.extraClassPath=$COMET_JAR \
98+
--conf spark.plugins=org.apache.spark.CometPlugin \
99+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
100+
--conf spark.comet.explainFallback.enabled=true \
101+
--conf spark.memory.offHeap.enabled=true \
102+
--conf spark.memory.offHeap.size=4g
103+
```
100104

101-
Comet boasts a vibrant and active community of developers, contributors, and users dedicated to advancing the
102-
capabilities of Apache DataFusion and accelerating the performance of Apache Spark.
105+
For full installation instructions, published jar downloads, and configuration reference, see the
106+
[installation guide](https://datafusion.apache.org/comet/user-guide/installation.html) and the
107+
[configuration reference](https://datafusion.apache.org/comet/user-guide/configs.html).
103108

104-
## Getting Started
109+
## Community
105110

106-
To get started with Apache DataFusion Comet, follow the
107-
[installation instructions](https://datafusion.apache.org/comet/user-guide/installation.html). Join the
108-
[DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html) to connect
109-
with other users, ask questions, and share your experiences with Comet.
110-
111-
Follow [Apache DataFusion Comet Overview](https://datafusion.apache.org/comet/about/index.html#comet-overview) to get more detailed information
111+
Join the [DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html)
112+
to connect with other users, ask questions, and share your experiences with Comet.
112113

113114
## Contributing
114115

@@ -120,8 +121,3 @@ shaping the future of Comet. Check out our
120121
## License
121122

122123
Apache DataFusion Comet is licensed under the Apache License 2.0. See the [LICENSE.txt](LICENSE.txt) file for details.
123-
124-
## Acknowledgments
125-
126-
We would like to express our gratitude to the Apache DataFusion community for their support and contributions to
127-
Comet. Together, we're building a faster, more efficient future for big data processing with Apache Spark.

docs/.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,4 @@
1818
build
1919
temp
2020
venv/
21-
.python-version
22-
comet-*
21+
.python-version
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)