It sounds like the cost of context switching between these very different workloads (crypto, disk I/O, network I/O, protocol buffer parsing and formatting) could be improved in Twisted.
Any idea where the overhead comes from? Twisted, or the Python interpreter itself? Is this a GIL performance issue? Or perhaps even lower -- something here is really hostile to CPU cache?
I realize this is a matter of taste, but my favorite async framework is still the kernel. Write small programs that do one thing well (and thus have pretty uniform workloads) and then let the kernel balance resource usage.
After continuing to hit walls with the standard Python Protocol Buffer library, we wrote our own that was 5x faster than the barely-documented C++ Python Protocol Buffer compiled module support.
Ugh yeah, the standard Python protobuf library is pure python and horribly slow. And it requires code generation -- in a dynamic language! The C++ one is faster, but also requires code generation and is just nasty to work with.
Not that this matters much to you at this point, but I have a small C/Python protobuf library that's 10x faster than the standard Python protobuf: https://github.com/acg/lwpb
PS. I see you're in SLC area -- me too. We should talk tech shop in person sometime!
Hi! Yes, you should totally swing by. Shoot us an email :)
I think the sort of performance issues we were hitting were honestly related to just Python runtime overhead. A Python function call is actually really expensive, and with Twisted Deferred handling, it's really hard to eliminate the massive amount of function calls each I/O event does.
We just had a really slow CPU so we did our best to eliminate as many Python function calls in hotspots as possible, but yeah that was challenging.
A Python function call is actually really expensive, and with Twisted Deferred handling, it's really hard to eliminate the massive amount of function calls each I/O event does.
Ah, makes total sense. Any particular reason you chose to use Twisted in a hardware appliance? Is there a web interface that's supposed to be super responsive?
The main codebase actually started out targeting desktop-class hardware, but the poor uptime of user's work machines (now laptops, soon tablets) actually made our business model not work, hence the dedicated hardware.
Did you ever get a chance to evaluate what it does under PyPy? In my experience, there's quite a few such cases where it eliminates most of the overhead, and it's not even uncommon to see it generate what appears to be optimal x86_64.
Any idea where the overhead comes from? Twisted, or the Python interpreter itself? Is this a GIL performance issue? Or perhaps even lower -- something here is really hostile to CPU cache?
I realize this is a matter of taste, but my favorite async framework is still the kernel. Write small programs that do one thing well (and thus have pretty uniform workloads) and then let the kernel balance resource usage.
After continuing to hit walls with the standard Python Protocol Buffer library, we wrote our own that was 5x faster than the barely-documented C++ Python Protocol Buffer compiled module support.
Ugh yeah, the standard Python protobuf library is pure python and horribly slow. And it requires code generation -- in a dynamic language! The C++ one is faster, but also requires code generation and is just nasty to work with.
Not that this matters much to you at this point, but I have a small C/Python protobuf library that's 10x faster than the standard Python protobuf: https://github.com/acg/lwpb
PS. I see you're in SLC area -- me too. We should talk tech shop in person sometime!