Member-only story
Automatic Speech Recognition: Amazon Transcribe vs. Google’s Speech-to-Text
My two cents, based on hands-on experience of both services for English language transcription.

Technically speaking, Automatic Speech Recognition (ASR) is about converting a specific language content from one form to another. Here the source form is in audio and the destination form is in text. And both audio and text are in one particular language. I had the opportunity to experiment with both Amazon transcribe and GCP’s (Google Cloud platform) Speech-to-Text services to transcribe audios/videos of US-English. I am going to compare these two services based on some criteria.
- Speed/API call time
From my observation, GCP’s Speech-to-Text service is at least 2–3 times faster than Amazon’s Transcribe service on average. For audio of 20 seconds, the Amazon transcribe service may take anywhere from 20s to 50s to transcribe whereas Speech-to-Text may take anywhere from 5s to 25s. Another fact that I observed is that for a list of audios all having the same duration, transcription times of these audios are more dispersed in the case of Speech-to-Text service compared to Amazon transcribe. In other words, Google transcription takes a variable time to transcribe audios of fixed duration with respect to Amazon transcription which usually takes a higher execution time and is clustered around a higher average execution time.
2. Accuracy
I want to only touch on the accuracy of transcribing technical terms and acronyms. Google’s Speech-to-Text is much more capable of recognizing technical terms and acronyms as opposed to Amazon’s Transcribe service. For terms such as S3 and dev, Amazon transcribe service may transcribe them as “s three” and “depth” whereas Google transcription service will produce them accurately as they are written here.
3. Filler sounds removal
Google removes filler sounds such as ah, um, mhm, etc automatically from transcription text whereas Amazon keeps them with the text.
4. Automatic Punctuation
Amazon transcribe’s automatic punctuation in transcription text seems to be much more accurate than Google Speech-to-text. This…