Speech-to-Text Benchmark: Deepgram vs. Whisper in 2025

Cem Dilmegani

with Şevval Alper

See our ethical norms

We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial.

Benchmark results

The average WER results of our tasks show that Deepgram is the leading speech-to-text provider for healthcare in this benchmark.

Methodology

Dataset

We wanted to evaluate the tools’ transcription accuracy in a specific area, so we decided to conduct two tasks:

Task 1: Healthcare voice data

Total number of samples: 100
Total duration: 9 minutes and 25 seconds
Average duration per sample: 5.65 seconds
Content: Healthcare voice data including medical terminology, patient interactions, and clinical discussions
Variety: Different speakers, varying audio quality, and diverse medical contexts spoken in English

Audio specifications:

Format: WAV
Channels: 1 (Mono)
Sample width: 16-bit
Sample rate: 16 kHz
Consistent bitrate: 256 kbps
Duration range: ~4.5 to 11.5 seconds per file

Task 2: An anatomy lecture

Total number of samples: 1
Total duration: 8 minutes and 35 seconds
Content: An anatomy lecture given by a doctor, including medical terminology
Variety: One speaker speaking in English, in the first half of the video, music plays in the background.

Audio specifications:

Format: WAV
Channels: 2 (Stereo)
Sample width: 16-bit
Sample rate: 48 kHz
Consistent bitrate: 1536 kbps

Evaluation metrics

We used Word Error Rate (WER) and Character Error Rate (CER) as evaluation metrics for transcription accuracy. Word Error Rate is calculated as:

WER = (S + D + I) / N

Where:

S = Number of substitutions
D = Number of deletions
I = Number of insertions
N = Total number of words in the ground truth

The formula calculates the minimum number of word-level operations needed to transform the hypothesis into the reference, divided by the number of words in the reference. Lower WER indicates better accuracy, with 0% being a perfect match.

The Character Error Rate (CER) is calculated by dividing the total number of character-level errors (including insertions, deletions, and substitutions) by the total number of characters in the reference text.

We used speech-to-text APIs to transcribe audio files to text.

The maximum file size input in one time of the providers is shown in the table:

Updated at 02-11-2025

Provider	Maximum file size
Amazon AWS Transcribe	2GB
AssemblyAI	5GB
Deepgram Nova 2	2GB
Google Cloud Speech-to-Text	10MB
Microsoft Azure Speech	1GB
OpenAI Whisper L-V2	25MB
Rev.ai	1GB
Speechmatics	1GB

Note: For providers with smaller file size limits (like Google and OpenAI), larger audio files need to be split into smaller chunks before processing. We performed that in Task 2.

Speech recognition

Speech recognition enables computers to transcribe audio files into text, with the help of machine learning algorithms. A transcription service’s API can be used with various programming languages to batch transcription. These platforms support both real-time and asynchronous transcription.

Speech recognition technology has numerous applications, including transcription, voice assistants, and language translation.

Benefits of using speech recognition for transcription

Fast transcription of audio files
Time and effort savings
Real-time transcription and translation
Accessibility for individuals with disabilities

How do speech-to-text AI tools work?

The transcription process includes:

Audio data is uploaded or streamed to the speech-to-text tool
Usage of machine learning algorithms to analyze the audio data and identify patterns in speech
The tool converts the speech to text using a speech-to-text engine
The transcribed text is then displayed to the user.

FAQ

What are the applications of speech recognition technology?

Transcription of audio and video recordings can be used in:
Voice assistants and virtual assistants
Language translation and interpretation
Speech-to-text (ASR) systems for individuals with disabilities

What are the features of leading speech-to-text providers?

Their pre-trained models enable automatic speech recognition (ASR) for recorded audio and video files. High-accuracy audio transcriptions include automatic punctuation and topic detection.
An open-source engine or a speech recognition provider from a service your company already works with (i.e. Google Cloud, AWS transcribe) can be chosen as the transcription solution to your company’s needs. Some of them also provide free credits but we recommend being cautious about data security.

How to convert audio files to text?

A speech-to-text API can help to transcribe audio files into text. Processing and analysis of audio data:
Audio data is processed using techniques such as noise reduction and echo cancellation
The audio data is then analyzed using machine learning algorithms to identify patterns in speech
The algorithms use acoustic models and language models to recognize spoken words and phrases
Converting speech to text using machine learning algorithms:
Machine learning algorithms are trained on large datasets of audio and text data
The algorithms learn to recognize patterns in speech and convert them into text
The algorithms can be fine-tuned and customized for specific use cases and languages

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Researched by