We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial.
Benchmark results
The average WER results of our tasks show that Deepgram is the leading speech-to-text provider for healthcare in this benchmark.
Methodology
Dataset
We wanted to evaluate the tools’ transcription accuracy in a specific area, so we decided to conduct two tasks:
Task 1: Healthcare voice data
- Total number of samples: 100
- Total duration: 9 minutes and 25 seconds
- Average duration per sample: 5.65 seconds
- Content: Healthcare voice data including medical terminology, patient interactions, and clinical discussions
- Variety: Different speakers, varying audio quality, and diverse medical contexts spoken in English
Audio specifications:
- Format: WAV
- Channels: 1 (Mono)
- Sample width: 16-bit
- Sample rate: 16 kHz
- Consistent bitrate: 256 kbps
- Duration range: ~4.5 to 11.5 seconds per file
Task 2: An anatomy lecture
- Total number of samples: 1
- Total duration: 8 minutes and 35 seconds
- Content: An anatomy lecture given by a doctor, including medical terminology
- Variety: One speaker speaking in English, in the first half of the video, music plays in the background.
Audio specifications:
- Format: WAV
- Channels: 2 (Stereo)
- Sample width: 16-bit
- Sample rate: 48 kHz
- Consistent bitrate: 1536 kbps
Evaluation metrics
We used Word Error Rate (WER) and Character Error Rate (CER) as evaluation metrics for transcription accuracy. Word Error Rate is calculated as:
WER = (S + D + I) / N
Where:
- S = Number of substitutions
- D = Number of deletions
- I = Number of insertions
- N = Total number of words in the ground truth
The formula calculates the minimum number of word-level operations needed to transform the hypothesis into the reference, divided by the number of words in the reference. Lower WER indicates better accuracy, with 0% being a perfect match.
The Character Error Rate (CER) is calculated by dividing the total number of character-level errors (including insertions, deletions, and substitutions) by the total number of characters in the reference text.
We used speech-to-text APIs to transcribe audio files to text.
The maximum file size input in one time of the providers is shown in the table:
Provider | Maximum file size |
---|---|
Amazon AWS Transcribe | 2GB |
AssemblyAI | 5GB |
Deepgram Nova 2 | 2GB |
Google Cloud Speech-to-Text | 10MB |
Microsoft Azure Speech | 1GB |
OpenAI Whisper L-V2 | 25MB |
Rev.ai | 1GB |
Speechmatics | 1GB |
Note: For providers with smaller file size limits (like Google and OpenAI), larger audio files need to be split into smaller chunks before processing. We performed that in Task 2.
Speech recognition
Speech recognition enables computers to transcribe audio files into text, with the help of machine learning algorithms. A transcription service’s API can be used with various programming languages to batch transcription. These platforms support both real-time and asynchronous transcription.
Speech recognition technology has numerous applications, including transcription, voice assistants, and language translation.
Benefits of using speech recognition for transcription
- Fast transcription of audio files
- Time and effort savings
- Real-time transcription and translation
- Accessibility for individuals with disabilities
How do speech-to-text AI tools work?
The transcription process includes:
- Audio data is uploaded or streamed to the speech-to-text tool
- Usage of machine learning algorithms to analyze the audio data and identify patterns in speech
- The tool converts the speech to text using a speech-to-text engine
- The transcribed text is then displayed to the user.
FAQ
Further reading

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.