AI Transcription with Diarization

This is the last of 3-part series on Datagram’s Audio->Text Transcriber using their latest AI engine called Whisper. Be sure to read them in this order, if you haven’t already, to follow along best:

Powerful Auto-Transcription Using AI (OpenAI’s Whisper) (previous post)
AI Transcription With Diarization (this post)
Summarization & Detecting Topics by Deepgram Whisper AI (next post)

This is a continuation of the post about Deepgram’s AI technology used for transcribing real-time or pre-recorded audio in virtually any format including streaming. Read that post first if you haven’t already to make sense of this one.

In this post, I demonstrate how we can extend the code in the previous post to enable diarization and how to parse through the JSON response we get from the AI model.

The following audio clip is from an all-female NASA mission communication file which includes multiple speakers.

In my previous post on the topic, I already demonstrated with code how to transribe this type of audio transcription. However, the transcription did not have any specifics about which speaker was saying what word(s)…it was one blob of text. With Deepgram, we can get to identify each speaker with a number 0 to n. And if a speaker already speaks again, his/her words are identified accordingly instead of creating a new speaker ID. This process is Diarization and is incredibly powerful and useful. The accuracy is very good for most clear audios with some distinguishable tone/voice features between speakers.

So, how do we do this? First, we need to set the options before went request to the model. Specifically, diarize parameter. So the options variable in the previous code would be something as below (you can change the other parameters but “diarize” must be set to True).

options = { "punctuate": True, "numerals": True, "utterances": True, "diarize": True, "model": "nova", "language": "en-US", }

When do make a call with these options we get a much larger JSON response back, which includes what we’ve seen in the previous code + speaker-by-speaker identification and word-by-word attribution by speaker including confidence level of the AI in detecting the speaker, confidence level in detecting the word spoken by the speaker.

For example, a snippet of one peron saying: “My name is Beth.” would produce the following JSON:

{                           "word": "my",
                            "start": 4.88,
                            "end": 5.04,
                            "confidence": 0.9999336,
                            "speaker": 0,
                            "speaker_confidence": 0.41414207,
                            "punctuated_word": "My"
                        },
                        {
                            "word": "name",
                            "start": 5.04,
                            "end": 5.2,
                            "confidence": 0.9999909,
                            "speaker": 0,
                            "speaker_confidence": 0.41414207,
                            "punctuated_word": "name"
                        },
                        {
                            "word": "is",
                            "start": 5.2,
                            "end": 5.3599997,
                            "confidence": 0.9996911,
                            "speaker": 0,
                            "speaker_confidence": 0.41414207,
                            "punctuated_word": "is"
                        },
                        {
                            "word": "beth",
                            "start": 5.3599997,
                            "end": 5.6,
                            "confidence": 0.93816376,
                            "speaker": 0,
                            "speaker_confidence": 0.41414207,
                            "punctuated_word": "Beth,"
                        }

Notice the speaker key and it’s value starting at 0 for the first detected speaker. Now, if the other person responds: “I’m pretty good. Thanks.”, that results in following JSON:

                            "word": "i'm",
                            "start": 8.544999,
                            "end": 8.705,
                            "confidence": 0.9308567,
                            "speaker": 1,
                            "speaker_confidence": 0.26664168,
                            "punctuated_word": "I'm"
                        },
                        {
                            "word": "pretty",
                            "start": 8.705,
                            "end": 9.025,
                            "confidence": 0.9995352,
                            "speaker": 1,
                            "speaker_confidence": 0.26664168,
                            "punctuated_word": "pretty"
                        },
                        {
                            "word": "good",
                            "start": 9.025,
                            "end": 9.264999,
                            "confidence": 0.99478686,
                            "speaker": 1,
                            "speaker_confidence": 0.26664168,
                            "punctuated_word": "good."
                        },
                        {
                            "word": "thanks",
                            "start": 9.264999,
                            "end": 9.745,
                            "confidence": 0.9965024,
                            "speaker": 1,
                            "speaker_confidence": 0.26664168,
                            "punctuated_word": "Thanks."
                        }

Now we see the “speaker” key value has changed to 1. And depending on the audio it can go to 2,3,4 and more. Just by looking at this, I realize it’s going to be painful to find all the unqiue speakers and stitch all the words from this huge JSON and preceded with the right speaker number and then have them in the right order.

I was not able to find any documentation on parsing this payload on their SDK repo, but by looking at the response (the code to save and to display the full JSON content is in previous post), we can figure out the way. Fortunately, we also get a new key array called “utterances” which looks like this:

   "utterances": [
        {
            "start": 0.0,
            "end": 7.7,
            "confidence": 0.9634944,
            "channel": 0,
            "transcript": "And thank you for calling Premier Phone Service. This call may be recorded for quality and training purposes. My name is Beth, and I'll be assisting you. How are you today?",
            "words": [
                {
                    "word": "and",
                    "start": 0.0,
                    "end": 0.24,
                    "confidence": 0.60129,
                    "speaker": 0,
                    "speaker_confidence": 0.6758549,
                    "punctuated_word": "And"
                },
               ...etc. (structure continues for each word, for each speaker)

Well, this is a lot better! Because we can actually extract multiple sentences at a time per speaker for one transcript block at a time. Then we can also calculate the number of speakers in the conversation even if the same speaker spoke many times by taking unique IDs of the speakers list or set (set() is the smartest choice as by definition of its data structure, it won’t allow dupes).

In the end I want a format of the transcript as this:

[Speaker: 0] <words spoken by speaker 0 until another speaker speaks (if any)>

[Speaker: 1] <words spoken by speaker 1 until another speaker speaks (if any)>

and so on.

So, the code block to show transcript by speakers is:

utterances = data['results']['utterances']
speakers = set() # so, no duplicate speakers will be added
for utterance in utterances:
    speaker = utterance['speaker']
    transcript = utterance['transcript']
    print(f'[Speaker:{speaker}] {transcript}')
    speakers.add(speaker) # iterate and keep adding to the set

And to find the number of speakers AI detected, we can now simply check the ‘speakers’ set size:

# get count of speakers (which is length of the set which doesn’t allow dupes)
num_speakers = len(speakers)

Rest of the code is same. Putting it altogether if we run this on the above audio clip, we get this diarized full transcript from my code:

*** Diarized transcript ***:

Seconds to execute: 2.627

** Overall accuracy ** : 99.134%

[Speaker:0] And [Speaker:0] Jessica, Christina, [Speaker:0] we are so proud of you. [Speaker:0] You're gonna do great today. We'll be waiting for you here in a couple of hours when you get home. I'm gonna hand you over to Stephanie now. [Speaker:0] Have a great great EVA. [Speaker:1] Drew, thank you so much. And our pleasure working with you this morning. And I'm working on getting that EV hatch open. And I can report. [Speaker:0] It's opened and stowed. [Speaker:1] Thank you, Drew. Thank you so much. [Speaker:2] On your DCMs, [Speaker:2] Take your power switches to bat. Stagger switch throws and expect a warning tone. [Speaker:3] Final steps before they begin the space walk. Copy. Check display switch functional. [Speaker:3] Tracy, how important is this this [Speaker:3] the guiding it through is Sounds like seems like a lot to remember on your own. Absolutely. [Speaker:2] Take power e b 1, e b 2, 2 switches to off, o f f. Yeah. Christina, [Speaker:2] Jessica have enough work with their hands and feet and their brain outside that it really helps to have someone like Stephanie. Do power both off? [Speaker:1] DCMs, [Speaker:2] connect your SCUs from your DCMs and stow the SCUs in the pouch. [Speaker:2] Commentator of So not only does Stephanie [Speaker:3] 38 AM central time, a little ahead of schedule about 12 minutes, but That gets us started on today's historic space walk. [Speaker:3] Andrew Morgan there, wishing the crew luck? Related in pouch and DCM cover closed. [Speaker:2] Copy. EV 2.

Count of words detected: 235 | Total speakers: 4

Boom! We have broken up each phrase by specific speakers, and also can tell without even listening to any audio, how many speakers are involved, and as before, we get accuracy %, count of words, and performance. If we functionize this block of code, then we can call this on transcribing for the first time, or loading a previously saved transcript file to get its metrics and details just as we did in the previous code example.

I hope this was helpful for those working with this API and JSON parsing.

Leave a Reply Cancel reply