Monday, September 22, 2025
Coding STEM

Powerful auto-transcription using AI (openAI’s Whisper)

Today, I’ll show you how to tap into the “world’s most powerful speech-to-text API” from our own applications. We’ll be using Deepgram, which is based on OpenAI’s Whisper AI SST technology. Deepgram claims to have trained their AI model with10,000+ years worth of audio data.

(For more of my posts about Text to Speech and vice versa, about AI and/or Machine Learning, click here.)

So how do we leverage this magical power? Of course, anyone can try out their UI on their web site by checking out a few samples, but the real fun is writing your own program that can transcript any audio file, whenever we want using their powerful AI platform.

The full source code is below but I have also shared the code publicly at my Google Colab space here from where it can be downloaded as .py or .jpynb file.

Of course, to use their AI service from our apps, we need to get a developer API key, which is avaiable for free (for now) with limitations from the site: deepgram.com

Without a valid API key, my code is not going to run locally or in Colab. So, get your own key (I cannot share mine as my tokens would be used everytime that key is used to make a query and after a limited time, it costs money), and then assign the key’s value to DEEPGRAM_API_KEY variable as a string.

Further instructions on the setup and explanations are in my Google Colab space, but to summarize what it does…the program calls the AI engine that does the heavy lifting of transcribing an audio file using Whisper technology created by OpenAI. Then an implementation of that by Deepgram improves on the performance metrics and removes some size limitations along with adding more language support and speaker diarization. Diarization means ability to recognize different speakers in an audio (e.g. conference, meetings, conversations, interviews) which allows it to include a unique tag for each speaker in the transcript.

While it does a very good job transcribing pre-recorded audio whether it’s from a local file or via URL on a remote location such as in streaming format (e.g. podcasts), it can also do real-time transcribing of a live stream, which I was able to also verify in another program I wrote. But in this blog, we’ll just discuss transcribing pre-recorded files both local and remote. The code in Colab I share above includes all the required code for such.

But for completeness, here’s the full source here as well. Feel free to play with it and modify it after you read the rest of this post.

Ok, so let’s put it to test. Let’s load a streaming m4a formatted audio of a support call involving a caller (man) and a woman (support rep) over the internet. To do that, in the code above we use the url: ‘https://res.cloudinary.com/deepgram/video/upload/v1663090406/dg-audio/Upgrading-phone-plan_pmfsfm.m4a’ and assign FULLPATH to it, and then we call our custom function main() with it. That takes care of the authentication, communication with the backend and talks to the AI engine to get the results back as a JSON payload.

Then our program will parse through the JSON content and extract what we care about. In my case, I only want the value of the key ‘transcript’, ‘confidence’ (for accuracy), and in the program I’ll enumerate all occurrences of the key ‘word’ in ‘words’ array which gives me the number of words it processed. When you look at a sample JSON response which I share below, it’ll make more sense.

You can hear the actual audio file we’re using below (2 mins long):

The output from my program is shown in the shell as below:

The program, running on my PC took 1.2 seconds and got a whole bunch of information back from the AI model in JSON format over the internet. Then I extract the transcript (shown in the output above in a condensed form…which I’ll share in full below), and AI’s overall confidence in the transcription (99.9%) and the count of words it processed (339). That was impressive.

By the way, the ‘Seconds to execute’ and ‘Count of words detected’ are not in the Deepgram API, instead I have implemented them in my code using standard python time library to measure block execution and by parsing the JSON file that their AI returned to me. I’ll explain more on the overall process again below.

For now, let’s take a look at what the transcription I got in 1.2 seconds without even listening to the audio, and compare with the audio above! Here’s the transcript by AI:

And thank you for calling Premier Phone Service. This call may be recorded for quality and training purposes. My name is Beth, and I'll be assisting you. How are you today? I'm pretty good. Thanks. How are you? I'm doing well. Thank you. May I have your name? Sure. My name's Tom Idle. Can you spell that last name for me? Yeah. Yeah. I d l e. Okay. L e at the end. I was picturing it idle, like American idle, I d o l. Yeah. That that happens a lot. It's not really a common name. Okay, mister Eidle. How can I help you today? Yeah. I need some information on upgrading my service plan. Sure, I can absolutely help you with that today. Can you tell me what plan you have currently? I think it's a silver And let me get my glasses so I can read this. Yeah. Yeah. It's the silver plan. Okay. Alright, silver plan. And how many people do you have on your plan right now? Three. I've got my brother, Billy, my mom, cat, and I guess I count two. So, yeah, that's three. Great. And how can I help you with your plan today, sir? Oh, you can call me Tom. There's no need for the sir. I'm sorry, Tom. It's just an old habit. How can I help you with your plan? Well, on my plan right now, I can only have three people on it, and I'm wanting to add more, so I'm wondering if I can switch my plan up or upgrade it somehow. How many more people are you wanting to add to your plan? Well, here's the thing. I need to add three more people. So far. I wanted to add my friend Margaret, my daughter, Anna, and my son Todd. Alright. We do have a few options that support six users. One is our gold. The other is our platinum plan. Okay. So how much are those gonna cost me? Well, the gold plan is

That was a file that’s located on a remote server in a streaming audio format. But I’d also said that I’ll demonstrate transcribing a local audio file. Let’s do that next.

Here’s an audio recorded (in British accent, no less) by a fan of Shakepeare’s play Henry V. It’s 3.5 minutes long and you can listen to it yourself below:

The file is local to my PC in MP3 file. And I want to get a transcription of this beautiful narration.

Configuring a local file vs a remote file for AI processing is a little different. Instead of sending the file name and path (which is local to my PC only), I have to open it and send the engine a handle to the binary file. I also need to set its source type to “buffer” and specify its mime type with mime_type attribute (it’s according to Deepgram’s SDK documentation, so we have to comply fully or it won’t work) accordingly. The code I share here and in Colab has comments inline to explain further. At any rate, when these are set, then we can call dg_client.transcription.sync_prerecorded() function with these parameters. And if all goes well, we get a detailed JSON response just above with the remote URL.

Then it’s up to my program to decide what to do with it, fortunately, as earlier, it has everything I need and a little more in the JSON content.

Once we run the program, the output in the shell in summary looks like this:

We see the time it took to transcribe (2.8 s), the overall accuracy of confidence it has (99.5%), and it detected 418 words spoken in the audio. The actual transcription in my editor shown in screenshot above is just a condensed line as a summary but it’s much longer, so if I click on it and copy the text to clipboard, I get the full transcription. However, my program saves the entire transcription, including details such as start time and end time of each word, and much more as a file locally as well (as shown in the source code). So, it’s convenient for me to open it and analyze it and reuse the transcription or information within however format I want.

The entire JSON content is shown later in the post. The output file name of the JSON file is in this format: ‘<audio_file.ext>.json’ and it’s placed in a subdirectory of current working directory called ‘\transcripts’. So, it’s really easy to locate and identify any of the JSON files just by looking at the name which audio file it corresponds to. Of course, you can change the code to save however you wish.

We can open it with the custom function: load_from_disk(TRANSCRIPTION_FILE) where TRANSCRIPTION_FILE is the input full path of the local audio file.

Because everything is saved in the saved JSON file, we don’t need to make another API call to the AI to extract details of the transcription. We simply need to parse the JSON file content. And the output I want in shell is just the transcripted text and the number of words transcribed. Here’s the output from the JSON file for Henry V:

Henry v, Act four, scene three.
What's he that wishes so? My cousin, West Mulland, nay, my fair cousin.
If we are marked to die, we are a now to do our country loss.
And if to live, the fewer men, the greater share of honor, God's will, I pray thee, wish not one man more.
By Job, I am not covetous for gold.
Nor care I who doth feed upon my cost.
It yearns me not if men my garments wear Such outward things dwell not in my desires.
But if it be a sin to covet honor, I am the most offending soul alive.
No.
Faith my cousins wish not a man from England.
God's peace.
I would not lose so great an honor as one man more me thinks would share from me for the best hope I have Oh, do not wish one more.
Rather, proclaim it Westmoland through my host that he which hath no stomach to this fight let him depart.
His passport shall be made and crowns for convoy put into his purse.
We would not die in that man's company that fears his fellowship to die with us.
This day is called the feast of Christian.
He that outlives this day and comes safe home, will stand a tiptoe when the day is named and rouse him at the name of Crispian.
He that shall live this day and see old age will yearly on the vigil feast his neighbors and say, tomorrow is Saint Crispian.
Then he will strip his sleeve and show his scars and say, these wounds I had on Crispin's day.
Old men forget, yet all shall be forgot, but he'll remember with advantages what feats he did that day.
Then shall our names, familiar in his mouth as household words Harry the king, Bedford, and Exeter Wuric and Tollbert, Salisbury, and Gloucester, be in their flowing cups freshly remembered.
This story shall the good man teach his son, and Crispin Crispian shall now go by from this day to the ending of the world, but we in it shall be remembered.
We, few, we, happy few, We band of brothers for he today that sheds his blood with me shall be my brother.
Be he near so vile this day shall gentle his condition.
And gentlemen in England, now a bed, shall think themselves accursed they were not here and hold their manhood's cheap whilst any speaks had fought with us upon St.
Crispin's day.

.
Count of words detected: 418

Wonderful…we’ve got everything we need!

As mentioned earlier, I’ll share the raw JSON output from Shakespeare’s play Henry V audio snippet that the program saved by the custom function: save_to_disk()…so the ouput is shown below. The size depends on the number of words detected in the audio.

And that’s all for now. Using Deepgram’s Whisper API we can transcribe large audio file quickly and very accurately and with clean, reasonably short code. With its support of tackling large audio streams and files and in many languages, you can see how useful this can be for training, video production, translating both on demand and real-time.

Is there any caveats? Well, first after you burn your tokens, you’ll have to get more tokens which must be bought. While not expensive per token, they can add up, so be aware of that while testing. They do give $200 credit to start with. As far as technical performance, it’s very very good. However, it does NOT transcribe songs, even when the lyrics are clear and drowned out by other instruments. In fact, it transcribed 2 seconds’ worth of lyrics and then simply stopped transcribing. I’m suspecting this has something to do with copyright more than technical limitation. As with any AI, it’s not going to be 100% correct all the time, but keeping in mind the speed and high accuracy, it’s suitable for transcribing most, reasonably clear audio files.

Be sure to read my next blog where I share the trick to enable Diarization and parse so you can neatly show a full transcript line by line by speakers from an audio with multiple speakers! This is powerful stuff!

For more of my posts about Text to Speech and vice versa, AI and Machine Learning, click here.

JSON output from Henry V audio clip:

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top