I would be very concerned about any LLM model being used for "transcription", since they may injecting things that nobody said, as in this recent item:
They list the error rate on the git repo directly, it was never good even when it was the best.
I saw mediocre results from the biggest model even when I gave it a video of Tom Scott speaking at the Royal Institution where I could be extremely confident about the quality of the recording.
WER is a decent metric to compare models but there's a difference between mistranscribing "effect" for "affect" and the kind of hallucinations Whisper has. I've run thousands of hours of audio through it for comparisons to other models and the kinds of thing you see Whisper inventing out of whole cloth is phrases like "please like and subscribe" in periods of silence. To me it suggested that it's trained off a lot of YouTube.
I would be very concerned about any LLM model being used for "transcription", since they may injecting things that nobody said, as in this recent item:
https://news.ycombinator.com/item?id=41968191