[TUT] Speech to Text from a File with Google Cloud Speech API

This blog will explain how to use Google Cloud’s Speech API to convert an audio recording of someone speaking to text in Android. We’ll use Kotlin and the Google Speech client for best results.

Google offers a speech recognition API, that is able to convert your spoken language into a textual representation. The API has methods for pre-recorded long running speech recognition, pre-recorded short speech recognition and streaming speech recognition. Today we are looking at long running recognition from a pre-recorded audio source.

First off is adding the dependencies to your build.gradle, the Google Cloud Speech API is needed as well as Googles Remote Procedure Call API – this uses protobufs and it is how the Speech API sends and receives data.

implementation 'io.grpc:grpc-okhttp:1.10.0'
implementation 'com.google.cloud:google-cloud-speech:0.41.0-alpha'

We will be using coroutines as our background threading model, also add this dependency:

implementation "org.jetbrains.kotlinx:kotlinx-coroutines-core:1.0.1"

Although it is officially released, it is still an alpha and google gives this warning.

The Speech-to-Text v1 is officially released and is generally available from the https://speech.googleapis.com/v1/speech endpoint. The Client Libraries are released as Alpha and will likely be changed in backward-incompatible ways. The client libraries are currently not recommended for production use.

If you do not have already, you need to setup a Google Cloud Project and enable the natural language API. (this is so Google can charge you if you go over the usage limit 🙂 ).

Once you’ve done that, there is one more caveat before we get to the juicy details. Google Cloud uses an API scheme that has keys to give you access to each API. Best practice dictates keeping these keys on a server behind your own security and using your api to access them. This blog is not about that, so we are taking a shortcut.

We declare the speechClientas a field of the class. We use Kotlin’s lazy initialisation to instantiate the field the first time we access it. The example is doing this inside of a Fragment, hence the use of the activity.

private val speechClient : SpeechClient by lazy {
        // NOTE: TODO STOPSHIP The line below uses an embedded credential (res/raw/credential.json).
        //       You should not package a credential with real application.
        //       Instead, you should get a credential securely from a server.
        activity?.applicationContext?.resources?.openRawResource(R.raw.credential).use {
            SpeechClient.create(
                SpeechSettings.newBuilder()
                    .setCredentialsProvider { GoogleCredentials.fromStream(it) }
                    .build())
        }
    }

Once we have our SpeechClient created from our Google Cloud credentials we can get to the recognition. Recognition is composed of asynchronous calls and network use. Therefore it is best to do it off of the Android main thread. Here we call the speech recognition in a coroutine.

GlobalScope.launch {
            Log.d("TUT", "I'm working in thread ${Thread.currentThread().name}")
            analyze(fileByteString)
        }

fileByteString is a ByteStringof your voice file. See here for further info about ByteString. Creating the variable will look something like this, depending on where your audio is stored:

val fileByteString = ByteString.copyFrom(File("/path/to/your/file.wav").readBytes())

Now we can create the analyze method, passing in our file.

private fun analyze(fileByteString: ByteString) {
        val req = RecognizeRequest.newBuilder()
            .setConfig(
                RecognitionConfig.newBuilder()                
                    .setEncoding(RecognitionConfig.AudioEncoding.AMR_WB)
                    .setLanguageCode("en-US")
                    .setSampleRateHertz(16000)
                    .build()
            )
            .setAudio(
                RecognitionAudio.newBuilder()
                    .setContent(fileByteString)
                    .build()
            )
            .build()
        val response = speechClient.recognize(req)

        Log.d("TUT", "Response, count ${response.resultsCount}")
        val results = response.resultsList
        for (result in results) {
            val alternative = result.getAlternativesList().get(0)
            val text = alternative.getTranscript()
            Log.d("TUT", "Transcription: $text")
        }
    }

Here we are passing in an audio file that is using the AMR_WB codec as well as the AMR_WB filetype. Best practices dictate what makes a good filetype. If you don’t send the correct file, or the correct matching sample rate, you will still get a result – but it will be empty.

We’re using the RecognizeRequest Builder as we want to transcribe audio under a minute long. For audio over 1 minute, you need the LongRunningRecognizeRequestand to host that audio on Google Cloud.

In the config, you should set the encoding, language and sample rate to match that of your audio file. In the audio recognition builder, we use setContent to pass in the ByteString audio file to be transcribed (under 1 minute). We then call the recognize method on our speech client. The result is a synchronous RecognizeResponse, this contains a list of transcribed results. For each result, there is a further list of alternative transcriptions. We’re selecting the first of these. Then getting the text with getTranscript.

Thats it! You can now retrieve a text representation of any spoken audio file that is encoded correctly. If you are going to production with this and expect many users to be accessing it, I recommend you read the quotas and limits section of the docs.

Code is available on GitHub here.