Several years ago, I implemented an application specific load balancer where the LB kept two queues. One was free_workers and the other was open_requests.
Initially, upon startup, each worker registers itself with the LB and gets added to the free_workers queue.
When requests arrive at the LB, the LB checks if there are free workers available. If yes, it dequeues a free worker and dispatches the request to that worker.
If no free workers are available, the LB adds the request to the open_requests queue.
When a worker finishes its work, it lets the LB know and the LB adds the free worker to the end of the free_workers queue and initiates another round of dispatching.
The parameter to watch for is the queue size for the open_requests queue.
(There were a few more nitty gritty details, but that was the concept at a high level)
But then your LB is a single point of failure! Even if you were to spin up duplicate copies of the LB that could be failed-over-to in real time, there is still the drawback that you are bounded by the number of requests you can store on a single machine.
I think what you actually built is a message queue engine that forwards the data to the consumers, not a load balancer.
Initially, upon startup, each worker registers itself with the LB and gets added to the free_workers queue.
When requests arrive at the LB, the LB checks if there are free workers available. If yes, it dequeues a free worker and dispatches the request to that worker.
If no free workers are available, the LB adds the request to the open_requests queue.
When a worker finishes its work, it lets the LB know and the LB adds the free worker to the end of the free_workers queue and initiates another round of dispatching.
The parameter to watch for is the queue size for the open_requests queue.
(There were a few more nitty gritty details, but that was the concept at a high level)