Mastering Speech-to-Text and Summarisation with Amazon Transcribe and Bedrock

Charlie Douglas
AWS in Plain English
6 min readMar 2, 2024

--

Image created by DALL-E

In the rapidly evolving landscape of technology, the leap from the rudimentary speech-to-text capabilities of a decade ago to today’s sophisticated models is nothing short of revolutionary. Despite the challenges faced by mainstream voice recognition technologies like Siri in understanding basic commands as of 2024, advancements in tools like Amazon Transcribe have dramatically simplified the conversion of speech to text. This breakthrough is not just a technical marvel; it heralds a new era of possibilities in automating

This opens up a world of use-cases. What I have been exploring is the possibility of automatically summarising a recorded telephone conversation and sending out an email with the conversation summary. Microsoft Co-Pilot, of-course, offers this very functionality built right into MS Teams. Other companies are getting in on the case too, but for a call centre type environment this isn’t yet suitable so a different option is required.

What I propose for this solution is based on 3 different AWS services — Amazon Transcribe, Amazon Bedrock, and Amazon S3. Building out this solution for a production environment would involve event-driven architecture that automatically detects when a raw file has landed in S3 and would kick off the process automatically. For this guide, we’re just looking at the basics of transcribing the conversation from speech-to-text, and providing a summary.

As with my previous tutorials, you will need to ensure your python environment is working and you have set up the AWS SDK correctly.

What are Amazon Transcribe and Amazon Bedrock?

Amazon Transcribe is an automatic speech recognition service that makes it easy for developers to integrate into their own applications.

Amazon Bedrock is their GenerativeAI platform. It allows developers to implement a range of foundation models into their applications.

It’s worth noting at this stage that Amazon Transcribe has GenerativeAI functionality in trial at the moment which will replicate some of the functionality that we are manually building here. It’s due for general release later this year.

Getting started with Amazon Transcribe

For the purpose of this tutorial you will need a sample of a recorded conversation. I found that the call recordings freely available here useful for testing purposes. Download a few and save them into a folder called ‘data’.

Step 1: Import required packages — these will be explained as we use them.

import boto3
import requests
import json
import time
import uuid

Step 2: Upload your audio to S3

This is basic functionality using the AWS SDK (boto3) library that will allow you to upload a file to a specific S3 bucket. You initiate the s3 service by using the code boto3.client(‘s3’). Giving this a variable allows you to reference it throughout the function. s3.upload_file requires you to define the:

  • file path — the location of your source data
  • bucket name — the specific name of your S3 bucket
  • object name — what your file will be called when uploaded
def upload_file_to_s3(file_path, bucket_name, object_name):
s3 = boto3.client('s3')
try:
s3.upload_file(file_path, bucket_name, object_name)
print(f"File uploaded successfully to S3 bucket: {bucket_name}")
except Exception as e:
print(f"Error uploading file to S3 bucket: {e}")

Simply call your function and provide the variable details to upload your file to s3.

# Upload file to S3
file_path = 'data/youraudioname.mp3'
bucket_name = 'your_s3_bucket'
object_name = 'youraudioname.mp3'
upload_file_to_s3(file_path, bucket_name, object_name)

Your result should show as “File uploaded successfully to S3 bucket”. If you received an error it is likely a permissions issue and your SDK setup isn’t correct.

Step 3: Start the Amazon Transcribe job

The Amazon Transcribe client is called in exactly the same way as the S3 client is. The specific function required is called start_transcription_job and it requires to be passed the name of the job, the file location from the previous step, the media format, and the language.

I am generating a unique job name using the uuid library. This creates a completely unique string which is being appended to the end of the object name and will look something like: youraudioname.mp3–30b57e6d-2109–4dd8–8088-c0615c96321d.

The media_uri is generated by concatenating the bucket name and object name together along with the s3 prefix.

transcribe = boto3.client('transcribe')

# Start a transcription job
job_name = object_name + '-' + str(uuid.uuid4())
media_uri = "s3://" + bucket_name + '/' + object_name
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': media_uri},
MediaFormat='mp3',
LanguageCode='en-US'
)

When you’ve ran this code you should get a display of information. They key part to look for is the job’s current status which will likely be ‘in-progress’. This can be validated on the AWS Console as well. You can see that my jobs had completed by the time I managed to grab a screenshot!

Step 4: Display the transcribed output

The first step of retrieving your transcribed output is checking the status of the job. Using the time library the code will check whether the job is completed or failed prior to moving onto the next stage.

The output of a Transcribe job is stored in json format which means that we need to convert it to retrieve legible text. If the job has completed then we need to parse the text using ‘indexing’ to retrieve the specific attributes required which are then printed out.

# Wait for the transcription job to complete
while True:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
print("Not ready yet...")
time.sleep(5)

# Fetch and display the transcription
if status['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED':
transcript_url = status['TranscriptionJob']['Transcript']['TranscriptFileUri']
response = requests.get(transcript_url)
transcript_json = response.json()

# Extracting and displaying the transcript in a user-friendly format
transcript_text = transcript_json['results']['transcripts'][0]['transcript']
print(transcript_text)
else:
print("Transcription job failed")

Step 5: Summarise the output with Amazon Bedrock

We’ll use a similar process to my previous article on Getting Started with Amazon Bedrock.

As per connecting to S3 and Transcribe, the process again uses boto3.client but this time the service name is bedrock-runtime. I’m using Anthropic’s Claude v2 model and the prompt for the model is simply asking it to provide a summary of the conversation, and then passing it the transcript. I’ve limited to 500 tokens (around 2,000 characters) which you can play around with to get a length that suits your use case.

bedrock = boto3.client(service_name='bedrock-runtime')

modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'
body = json.dumps({
"prompt": "Human: Please provide a summary of this conversation: " + transcript_text + "Assistant: ",
"max_tokens_to_sample": 500
})

response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())

print(response_body.get('completion'))

Wrapping Up

AWS’ Python SDK makes development and use of their services incredibly simple. With around 50 lines of code we’ve uploaded an audio file, transcribed it with Amazon Transcribe, and summarised it with Amazon Bedrock. If you wanted to as a next step you could use Comprehend to get a sentiment analysis and find the key entities mentioned within the conversation.

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--

Data analytics, data science, AI/ML professional within financial services.