Using Google’s Speech-to-Text API with Python

Grettel Juárez
4 min readMar 26, 2021

--

This post provides steps and python syntax for utilizing the Google Cloud Platform speech transcription service.

Photo by Volodymyr Hryshchenko on Unsplash

Speech transcription refers to the conversion of speech audio to text. This can be applied to many use cases such as voice assistants, dictation, customer service call center documentation, or creation of meeting notes in an office business setting. It is not difficult to see the value this can bring to individuals and businesses.

AWS has long been a leader in this space. Google, IBM, and Microsoft have of course developed their own services as well. However, there are other lesser-known options to consider. For example, Dragon Transcription Engine is known to be exceptional for medical transcription. Whatever your use case, it is worth considering the options available. See the bottom of this post for ways to quickly try out the Google and AWS options for a quick compare.

In this post, I will share the steps to implement the Google Speech-to-Text API using python and tips I learned in the process. There are 3 methods of transcription with Google’s API shown below. For this post, I will be focusing on the asynchronous recognition type using the REST API. Requests are made to the API and a response is returned in the form of a JSON.

The high-level steps for this implementation are as follows:

  • Set up Google Cloud services
  • Prepare/Transcribe file

Link to the full code on Github is here.

Set up Google Cloud services

Two services will be needed:

  • Speech-to-Text: Generate key to access conversion API
  • Storage: For asynchronous requests, the audio file to be converted must be read from a cloud storage bucket

1. Create Speech-to-Text service

First, you will need to set up the speech-to-text API and download your credentials via a JSON file. Please follow instructions to set up API from Google Cloud’s quick start documentation here.

If you don’t already have a Google Cloud Platform account, this will take you through the process of doing so.

Tip: If setting the JSON file path from the terminal does not work for you as stated in the documentation steps, you can try the code below in your python notebook:

import os
json_path = {your_path}
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=json_path

2. Verify API setup

Again, follow instructions in the documentation linked in the previous step to perform the following additional steps:

  • “Install the client library”
  • “Make an audio transcription request”

This will use an audio file already accessible to you in Google’s storage bucket to confirm API setup.

3. Create storage

Next, a Google Cloud bucket must be created. The transcription service will read your audio file from this bucket.

This can be done on the Google Cloud console here.

OR using this code:

Prepare/Transcribe File

Now that we have confirmed a successful request and set up a storage bucket, we can transcribe our own file. The next step is to prepare the file.

First, install necessary libraries if not already available.

pip install --upgrade google-cloud-speech
pip install --upgrade google-cloud-storage

Import:

from google.cloud import speech
from google.cloud import storage

1. Meet file requirements

There are a couple of requirements that an input file needs to meet to complete successfully. For a .wav file, those are:

  • Sample rate must be 16K Hz or greater
  • .wav file must be single channel (mono)

More details for your specific use case may be found in the best practices documentation here.

If you are working with a .wav file, the code below converts it file to a single channel.

2. Upload file to storage bucket

File must be in cloud storage. chunk_size is used to upload a large file.

3. Transcribe

The code below is for single speaker/text transcription. Note a few configurations here:

  • Model is specified as “video” because my audio files were extracted from mp4 videos. Please see documentation here.
  • Punctuation is turned on explicitly

Tip: I do recommend implementing the model configuration. The quality of the transcript improved significantly with this.

4. Write transcript to file

5. Delete file from bucket

This step is preferred when the file is no longer needed to avoid unnecessary charges.

6. Put it all together

The result will be a file with the same name as audio file, but with .txt extension.

Delete file from storage when ready:

# remove audio file from bucket
delete_blob(bucket, audio_file)

The code used here along with the input and outputs are available on my Github here.

In addition, to these steps, my notebook on Github includes:

  • Using beta API for multiple speakers
  • Using beta API for word timestamps

Lastly, it’s worth mentioning that Google Cloud provides an accessible way to try out the transcription from the console. If you have just a handful of files to transcribe or are unsure whether the Google speech-to-text API is the right choice for your project, you may want to check out the demo section on the Google Cloud page.

Similarly, AWS transcribe is also simple to try on singular files via their platform. Perhaps checking results against AWS transcribe can also help decide between the two. AWS has a great tutorial with step-by-step instructions to test out their service as well.

Conclusion

This article provided step by step instruction on how to leverage the Google Speech-to-Text API using python. I hope it was helpful!

--

--

Grettel Juárez

Data Science | Performance Engineering | Technology Consulting