> Flatbuffers is a low-level building block for binary data serialization. It is not adapted to the representation of large, structured, homogenous data, and does not sit at the right abstraction layer for data analysis tasks.
> Arrow is a data layer aimed directly at the needs of data analysis, providing a comprehensive collection of data types required to analytics, built-in support for “null” values (representing missing data), and an expanding toolbox of I/O and computing facilities.
> The Arrow file format does use Flatbuffers under the hood to serialize schemas and other metadata needed to implement the Arrow binary IPC protocol, but the Arrow data format uses its own representation for optimal access and computation
1. outperforms a bunch of binary formats (as well as all other text formats)
2. has zero dependencies with fast compile times (even the derive macros are custom and don't use syn family crates as build-time deps)
Mixing milliseconds down to picoseconds made the columns too wide. All of the benchmark data is also available as JSON in the benchmark_results directory if you wanted to drop it into sheets or some other analysis.
After trying to like rkyv, I came to the conclusion that the serialization/deserializarion API is not ergonomic due to the heavy use of generics in their API.
My first thought was "How does it handle untrusted input?" and they have a page dedicated to it: https://rkyv.org/validation.html
But the phrasing on that page does not exactly inspire confidence ("...good defaults that will work for most archived types...", "...it's not possible or feasible to ensure data integrity with these use cases..."). Is this actually usable for untrusted data or is it mostly used in scenarios where you already know the data is fine?
As for the second quote, the surrounding context explains that the validator will by default return an error if you point to a single object using two different pointers with different opinions on what the pointee's type is. This doesn't sound like a safety issue, since the validator is being too conservative rather than not conservative enough.
The first quote is probably in part referring to the second quote. If that is all it is referring to, than there is no safety issue. If there are other similar issues but rkyv chooses to reject valid archives rather than accept invalid ones, then there also is no safety issue. However, that isn't unambiguous, so I can't say for certain that it isn't possible to misuse the library from safe rust.
Author here, you’re correct. You can customize your validation context for your specific needs. For example, if you don’t have allocation available (i.e. `#![no_std]` without the alloc crate) then you’ll probably need to write your own mapping system to handle shared pointers. Or you can just not use them if that works better for you. That’s also a large part of why rkyv uses generics so heavily.
If your data is read-only then pointing to the same object from two locations is (usually) fine. But rkyv also supports in-place mutability, which requires validating that no two pointers will overlap each other. Otherwise you could have simultaneous mutable borrows to the same value which is UB.
I tried Rkyv and one aspect that I came to dislike was the lack of an explicit schema. I know that some people consider this a feature, but after using it I wish there was a more clear path for migrations.
Sorry to be pedantic, but there’s nothing “zero copy” in a computer. Every single operation copies a bunch of things around.
Performance in computing is about managing the copy granularity by trying to align cohesiveness of space and time (optimal batching) which also relies on end-to-end analysis not just one step in the computation (whether a specific operation is a copy, virtual copy /cow/, move or virtual move by reference etc.). Avoiding a copy has time costs, such as indirection and incidental coupling. You trade space for time. Sometimes that’s smart sometimes not.
We have countless stories about how removing poor caching and going brute force speeds up an app.
Specifically for serialization and storage we also have increasingly more examples of how storing something compressed and calculating it on demand from compressed storage is faster than avoiding a copy on it because of how heterogenous memory bandwidth to compute bandwidth is over time (i.e. slow IO fast compute).
For example it’s often better to pack data into varints or compress sequences by delta rather than naively store them inRAM as long as they’re read roughly in sequence or groups.
Avoiding a copy and transform feels nice, but often isn’t. It should be only applied when specific uses show compute is the bottleneck.
Just to chime in on a detail: often the "zero-copy" term is used in the context of data "forwarding" use cases (serving data loaded from disks or passed through a proxy etc).
The idea there is to have zero "extra" copies of the data.
Obviously if you're reading data from disk and sending it over the network there is at least one copy going on: the copy from the disk to the network card buffer.
And unless your network controller buffer is directly addressable by the disk controller, there is also going to be another copy of the data sitting in RAM.
That's often still considered zero copy IO mainly because the CPU is not involved in the copying (which happens via DMA).
In case of network to network forwarding the data can even stay within the same buffers on the same network controller.
Protocols that involve sending large blobs of uninterpreted bytes are particularly amendable to this kind of optimization.
For example you can send a large blob of data from disk and pump it on a TCP connection without having that data going through the CPU at all. Practical necessities as encryption make that even more challenging but there are some network controllers supporting hardware encryption so at least in principle that's still possible.
However when you also throw data serialization in the mix you often end up getting the CPU involved in converting data from the raw form you read on disk and the encoded/serialized form in the network response.
For example consider a server that returns a stream of data using the gRPC ByteStream API which is a gRPC stream of ReadResponse messages that contain a single `bytes` field.
The protobuf encoding of that field is just a field tag plus the field size plus the raw bytes of the payload. In principle it could be possible to implement a gRPC serializer that avoid materializing the body of that data field in memory but instead rely on sendfile or other primitives that potentially can apply further optimizations and at least remove the CPU out of the loop.
But doing that is hard, and affects the API of the serializer and every library that uses it.
I'm not aware of any opensource protobuf+grpc implementations that do this (I may be wrong, long time since I checked and I may be missing something).
I don't mind that the rest of the serialized message would need to be unpacked/repacked to/from different representations, as generally the size of that part is dwarfed by the large blobs.
> In principle it could be possible to implement a gRPC serializer
FWIW it isn't that much work to rig up a C++ gRPC implementation that just does whatever you want with some buffers. That is, you provide SerializationTraits for your type, and you provide the function that converts your type to what is the functional equivalent of an iovec. This helps minimize copies because, as in your example, it's often useful to add a preamble to some large byte buffer. Doing so in a naive codelab style with the vanilla protobuf API would result in extra copies, probably. However I also note that recent releases of protobuf C++ allow bytes fields to be represented by an absl::Cord, which enabled you to refer to existing large buffers instead of copying them. In either case you need a pointer that can be sent with `sendmsg`, not `sendfile`
We use Rkyv heavily at Climatiq to embed already-parsed data into our binaries that we then serve on an edge network. The speed is great and for the happy path it works very nicely, and is extremely fast. However the fact there's a separate version of each struct does become a bit of a bother.
apache/arrow-rs: https://github.com/apache/arrow-rs
From https://arrow.apache.org/faq/ :
> How does Arrow relate to Flatbuffers?
> Flatbuffers is a low-level building block for binary data serialization. It is not adapted to the representation of large, structured, homogenous data, and does not sit at the right abstraction layer for data analysis tasks.
> Arrow is a data layer aimed directly at the needs of data analysis, providing a comprehensive collection of data types required to analytics, built-in support for “null” values (representing missing data), and an expanding toolbox of I/O and computing facilities.
> The Arrow file format does use Flatbuffers under the hood to serialize schemas and other metadata needed to implement the Arrow binary IPC protocol, but the Arrow data format uses its own representation for optimal access and computation