This kind of "expected next field" optimization has a long history in protobuf, but results are mixed.
The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.
One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:
> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.
> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.
One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:
cmp QWORD PTR [rdi+0x8],rsi # Bounds check
jbe .fallback
movzx r10d,WORD PTR [rsi] # Load two bytes of tag
mov eax,r10d
and eax,0xf8
mov r9,QWORD PTR [rcx+rax*2] # Load table data
xor r9,r10
mov rax,QWORD PTR [rcx+rax*2+0x8] # Load field parser function
jmp rax # Tail call to field parser
That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!
And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.
> This kind of "expected next field" optimization has a long history in protobuf
You could probably even trace the history of the idea all the way to Van Jacobson’s 30-instruction TCP fastpath[1]. Or to go a bit closer, I’ve found that an interpreter for a stack+accumulator VM (which, compared to the pure stack option, is prone to blowing up the bytecode count and thus dispatch cost with the constant PUSH-accumulator instructions) goes significantly faster if you change the (non-shared) dispatch from
which feels somewhat analogous to the next-field optimization and avoids polluting the indirect branch predictor with the very common PUSH predictions. (It’s still slower than not having those PUSHes in the first place.)
The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.
One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:
> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.
> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.
One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:
That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.