On top of what others have said here about TPUs and their kin, you can make things really scream by taping out an ASIC for a specific frozen neural network (i.e. including the weights and parameters).
If you never have to change the network - for instance to do image segmentation or object recognition - then you can’t get any more efficient than a custom silicon design that bakes in the weights as transistors.
If you never have to change the network - for instance to do image segmentation or object recognition - then you can’t get any more efficient than a custom silicon design that bakes in the weights as transistors.