Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some of the initial differentiators are described at the bottom of our design doc at https://github.com/ray-project/deltacat/blob/main/deltacat/c.... But yes, controlling file I/O was also an important part of this since it allowed us to (1) run more targeted downloads/reads of only the Parquet row groups and columns participating in compaction and (2) track dirty/clean files to skip unnecessary re-writes of "clean" files that weren't altered by compaction. Also, just better leveraging catalog metadata (e.g., primary key indexes if available) to filter out more files in the initial scan, and to copy clean files into the compacted variant by reference (when supported by the underlying catalog format).

The trick with doing compaction in Python was to ensure that the most performance-sensitive code was delegated to more optimal C++ (e.g, Ray and Arrow) and Rust (e.g., Daft) code paths. If we did all of our per-record processing ops in pure Python, compaction would indeed be much slower.



This is one of the first times I've heard of people using Daft in the wild. Would you be able to elaborate on where Daft came in handy?

Edit: Nvm, I kept reading! Thanks for the interesting post!


Thanks a lot for the explanation. Sounds a lot like how Pyspark allows for declarative definitions of computation that is actually executed in Java.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: