Since you know a bunch about this, I'm going to ask you a question that I was ab...

RobinL · on Jan 10, 2023

I'm not an expert in the nuts and bolts of Arrow, but I think you have two options:

- Save to feather format. Feather format is essentially the same thing as the Arrow in-memory format. This is uncompressed and so if you have super fast IO, it'll read back to memory faster, or at least, with minimal CPU usage.

- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.

On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.

See also https://news.ycombinator.com/item?id=34324649

sanderjd · on Jan 10, 2023

Thanks! Exactly what I was looking for. I'll do some benchmarking of these two options for my workload.

sdz · on Jan 10, 2023

You're probably looking for the Arrow IPC format [1], which writes the data in close to the same format as the memory layout. On some platforms, reading this back is just an mmap and can be done with zero copying. Parquet, on the other hand, is a somewhat more complex format and there will be some amount of encoding and decoding on read/write. Flight is an RPC framework that essentially sends Arrow data around in IPC format.

[1] https://arrow.apache.org/docs/python/ipc.html