> Reserving full GPU instances for these models leads to allocating 17.7% of our GPUs to serve only 1.35% of requests
> Deployment results show that Aegaeon reduces the number of GPUs required for serving these models from 1,192 to 213, highlighting an 82% GPU resource saving.
82% of their CPUs were serving 98.6% of all traffic. If they reduced the cluster size, they got it to 96.2% of their CPUs serving 98.6% of their traffic. If they reallocated those, which is more likely, then 96.8% of their CPUs are serving 98.6% of all requests, or around 17% more capacity for popular requests on the same hardware.