AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
API testingAPI
Updated on Apr 2, 2025

Speech-to-Text Benchmark: Deepgram vs. Whisper in 2025

Headshot of Cem Dilmegani
MailLinkedinX

We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial.

Benchmark results

The average WER results of our tasks show that Deepgram is the leading speech-to-text provider for healthcare in this benchmark.

Methodology

Dataset

We wanted to evaluate the tools’ transcription accuracy in a specific area, so we decided to conduct two tasks:

Task 1: Healthcare voice data

  • Total number of samples: 100
  • Total duration: 9 minutes and 25 seconds
  • Average duration per sample: 5.65 seconds
  • Content: Healthcare voice data including medical terminology, patient interactions, and clinical discussions
  • Variety: Different speakers, varying audio quality, and diverse medical contexts spoken in English

Audio specifications:

  • Format: WAV
  • Channels: 1 (Mono)
  • Sample width: 16-bit
  • Sample rate: 16 kHz
  • Consistent bitrate: 256 kbps
  • Duration range: ~4.5 to 11.5 seconds per file

Task 2: An anatomy lecture

  • Total number of samples: 1
  • Total duration: 8 minutes and 35 seconds
  • Content: An anatomy lecture given by a doctor, including medical terminology
  • Variety: One speaker speaking in English, in the first half of the video, music plays in the background.

Audio specifications:

  • Format: WAV
  • Channels: 2 (Stereo)
  • Sample width: 16-bit
  • Sample rate: 48 kHz
  • Consistent bitrate: 1536 kbps

Evaluation metrics

We used Word Error Rate (WER) and Character Error Rate (CER) as evaluation metrics for transcription accuracy. Word Error Rate is calculated as:

WER = (S + D + I) / N

Where:

  • S = Number of substitutions
  • D = Number of deletions
  • I = Number of insertions
  • N = Total number of words in the ground truth

The formula calculates the minimum number of word-level operations needed to transform the hypothesis into the reference, divided by the number of words in the reference. Lower WER indicates better accuracy, with 0% being a perfect match.

The Character Error Rate (CER) is calculated by dividing the total number of character-level errors (including insertions, deletions, and substitutions) by the total number of characters in the reference text.

We used speech-to-text APIs to transcribe audio files to text.

The maximum file size input in one time of the providers is shown in the table:

Last Updated at 02-11-2025
ProviderMaximum file size
Amazon AWS Transcribe2GB
AssemblyAI5GB
Deepgram Nova 22GB
Google Cloud Speech-to-Text10MB
Microsoft Azure Speech1GB
OpenAI Whisper L-V225MB
Rev.ai1GB
Speechmatics1GB

Note: For providers with smaller file size limits (like Google and OpenAI), larger audio files need to be split into smaller chunks before processing. We performed that in Task 2.

Speech recognition

Speech recognition enables computers to transcribe audio files into text, with the help of machine learning algorithms. A transcription service’s API can be used with various programming languages to batch transcription. These platforms support both real-time and asynchronous transcription.

Speech recognition technology has numerous applications, including transcription, voice assistants, and language translation.

Benefits of using speech recognition for transcription

  • Fast transcription of audio files
  • Time and effort savings
  • Real-time transcription and translation
  • Accessibility for individuals with disabilities

How do speech-to-text AI tools work?

The transcription process includes:

  • Audio data is uploaded or streamed to the speech-to-text tool
  • Usage of machine learning algorithms to analyze the audio data and identify patterns in speech
  • The tool converts the speech to text using a speech-to-text engine
  • The transcribed text is then displayed to the user.

FAQ

What are the applications of speech recognition technology?

Transcription of audio and video recordings can be used in:
Voice assistants and virtual assistants
Language translation and interpretation
Speech-to-text (ASR) systems for individuals with disabilities

What are the features of leading speech-to-text providers?

Their pre-trained models enable automatic speech recognition (ASR) for recorded audio and video files. High-accuracy audio transcriptions include automatic punctuation and topic detection.
An open-source engine or a speech recognition provider from a service your company already works with (i.e. Google Cloud, AWS transcribe) can be chosen as the transcription solution to your company’s needs. Some of them also provide free credits but we recommend being cautious about data security.

How to convert audio files to text?

A speech-to-text API can help to transcribe audio files into text. Processing and analysis of audio data:
Audio data is processed using techniques such as noise reduction and echo cancellation
The audio data is then analyzed using machine learning algorithms to identify patterns in speech
The algorithms use acoustic models and language models to recognize spoken words and phrases
Converting speech to text using machine learning algorithms:
Machine learning algorithms are trained on large datasets of audio and text data
The algorithms learn to recognize patterns in speech and convert them into text
The algorithms can be fine-tuned and customized for specific use cases and languages

Further reading

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Şevval is an AIMultiple industry analyst specializing in AI coding tools, AI agents and quantum technologies.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments