Since you know a bunch about this, I'm going to ask you a question that I was about to research: If I have a dataset in memory in Arrow, but I want to cache it to disk to read back in later, what is the most efficient way to do that at this moment in time? Is it to write to parquet and read the parquet back into memory, or is there a more efficient way to write the native Arrow format such that it can be read back in directly? I think this sounds kind of like Flight, except that my understanding is that is intended for moving the data across a network rather than temporally across a disk.
I'm not an expert in the nuts and bolts of Arrow, but I think you have two options:
- Save to feather format. Feather format is essentially the same thing as the Arrow in-memory format. This is uncompressed and so if you have super fast IO, it'll read back to memory faster, or at least, with minimal CPU usage.
- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.
On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.
You're probably looking for the Arrow IPC format [1], which writes the data in close to the same format as the memory layout. On some platforms, reading this back is just an mmap and can be done with zero copying. Parquet, on the other hand, is a somewhat more complex format and there will be some amount of encoding and decoding on read/write. Flight is an RPC framework that essentially sends Arrow data around in IPC format.