Some of the initial differentiators are described at the bottom of our design doc at https://github.com/ray-project/deltacat/blob/main/deltacat/c.... But yes, controlling file I/O was also an important part of this since it allowed us to (1) run more targeted downloads/reads of only the Parquet row groups and columns participating in compaction and (2) track dirty/clean files to skip unnecessary re-writes of "clean" files that weren't altered by compaction. Also, just better leveraging catalog metadata (e.g., primary key indexes if available) to filter out more files in the initial scan, and to copy clean files into the compacted variant by reference (when supported by the underlying catalog format).
The trick with doing compaction in Python was to ensure that the most performance-sensitive code was delegated to more optimal C++ (e.g, Ray and Arrow) and Rust (e.g., Daft) code paths. If we did all of our per-record processing ops in pure Python, compaction would indeed be much slower.
The trick with doing compaction in Python was to ensure that the most performance-sensitive code was delegated to more optimal C++ (e.g, Ray and Arrow) and Rust (e.g., Daft) code paths. If we did all of our per-record processing ops in pure Python, compaction would indeed be much slower.