Are transcripts and captions generated by Artificial Intelligence ready for prime time?

Published on June 23, 2023

Summary

Transcripts and captions are useful for deaf and hard of hearing people, and many more. Relying on automated, or artificial intelligence to generate them leads to inacuracies.

The short of it is: NOPE! But let’s explore this a bit more.

Artificial Intelligence is all the rage. AI everything. AI Captions. AI Transcription. I’ve seen several articles stating that Artificial Intelligence is going to revolutionize captions and transcripts. It’s as if people are just waking up to the fact that you can do transcriptions and captions with AI. Folks! This isn’t new!

Two human-like robotic hands typing on a split keyboard. — Photo by Scott Graham on Unsplash

It isn’t new, but it was labelled differently. In general, these kind of features have been known as “automatic”. In other words, created by a machine.

What are captions and transcripts anyway?

I know that most people are familiar with captions and transcripts. But there are also some of us who don’t know, so let’s describe them.

Transcripts

Transcripts are a text version of the spoken content in audio or video. They are often provided in text on the page where the media is included and can be “verbatim,” including all the speaking ticks such as “uhm” “yeah,” and “so,” etc. Or they can be cleaned up for readability.

There are many ways to provide a transcript - I won’t list them all here, this isn’t the point of the post. You can get more information about transcripts on the W3C’s website.

Captions

Captions are a text version of the spoken content in video media. Captions are typically provided overlayed onto the video. They can be “burned in” to the video and always there (known as open captions). Or the can be available when a viewer choses to expose it (known as closed captions). Captions typically should also include information about relevant sounds in the audio track. For example, dog barking, or music playing.

Captions are different than subtitles. Subtitles are intended to provide a translation of the spoken content into other languages.

You can get more information about captions on the W3C website.

Why are captions and transcripts important?

When we don't provide transcripts and/or captions we chose to exclude Deaf, deaf, and Hard of hearing folks.

They are used primarily for Deaf, deaf, or Hard of Hearing folks to access content. We block access to our content if we don’t provide captions and transcripts. In other words, when we don’t provide transcripts and/or captions we chose to exclude Deaf, deaf, and Hard of hearing folks. And if we look at the US alone, that was nearly 38 million people as of 2021.

Never mind the fact that captions and transcripts are also critical for many neurodiverse folks.

And then, if we look at the “accessibility is good for everyone” aspect of this, we can also add many benefits to many other groups:

People who aren’t native language speaker of the primary language in your audio content.
People in distracting environment, e.g. bus, park, other public space.
People needing to keep quiet while a child is sleeping, or someone is sick in the household.
You, the content creator, wanting to access and reference your own content later down the road.
The list goes on.

A bit of history

Google/YouTube was the first to introduce automatic captions in 2009. And the quality of these captions were not great. In fact, they led to the D/deaf community to refer to them as “craptions”. Captions so bad they are crap.

There have been a lot of organizations leveraging machines/automation for transcripts over the years. Some better than others (e.g. Otter.ai). Some really bad.

These days you can find automated captions in YouTube, Zoom calls, MS Teams calls, and many more. You can use automated transcription through services such as Otter.ai, REV, or even through Adobe Premiere.

But the quality of captions or products is… often worthy of the “craption” label.

Inaccurate captions

There are so many bad captions or bad transcripts out there!

One of my favorite AI Transcription mistake is when I say my name, “Nic Steenhout”, it writes “Nixine out”. It’s also bit ironic that one of the words often mistranscribed or miscaptioned is the word “deaf”. AI tends to write this as “death”. There are so many examples of bad automated captions or transcripts, it would take a whole long post to do a round of what’s out there.

And that’s if the AI caption engine doesn’t just get confused and skip entire passages of what’s being said.

A post by Consumer Reports points out that in their tests, there were between 5 and 12 errors per 100 words of automated captions. That’s up to 12% error rate.

A fun game from 3PlayMedia - guess the correct caption.

Another fun game: Next time you’re in a video call, get all the participants to not use headphones/speaker. Everyone turn on automated captions. And run your meeting that way.

Some factors that make for inaccurate automated captions

Technical words. e.g.tech, medical, engineering, web, etc
Background noises. e.g. AC fans, construction noise, bad microphone, etc
“Non-standard” accents. And by that I mean any accent that isn’t US Midwest.
More than one person speaking at the same time.

Why are accurate captions and transcripts important?

Simple: for every 1% loss in accuracy, there’s up to 20% loss in understanding.

You’re making content. Arguably, you wish for people to access, and understand, your content. If you don’t provide captions or transcripts, you are blocking out a significant proportion of your potential audience. If you don’t provide accurate captions or transcripts, you aren’t sure they get what you’re saying.

Your audience isn't likely to understand any of your content if you rely on AI captions or transcripts.

Let’s spell it out. Looking back at Consumer Report’s numbers, we have an error rate between 5% and 12%. For every 1% error there’s up to 20% loss in understanding. That means that there’s between 100% and 240% loss of understanding. In other words: Your audience isn’t likely to understand any of your content if you rely on AI captions or transcripts.

AI Captions - Not ready for prime time

Bottom line is that AI Captions and transcripts aren’t ready for prime time. They aren’t good enough to rely on them to provide usable content.

Sure, they are improving. But it’ll be a while before they are good enough to be used as the stand alone solution.

AI Transcripts - foundation to speed up work

While AI Captions or transcripts aren’t ready to be used on their own if you intend your message to be available to large parts of your audience, they still have their place.

I use automatic transcription for a first pass on my transcript work for my podcast, the A11y Rules Soundbite. Once I have a transcript, it’s relatively quick and painless to go through and manually edit the episode’s transcript. This gives me a quick, affordable, and accurate transcript. Which can be used as a base for captions.

But what about real time events, Nic?

Sure, the method I describe above is fine if we have on-demand audio or video (e.g. prerecorded stuff people can watch/listen when they want). But it doesn’t work when you have real time events. If you’re in a meeting or at a conference, there’s no time to get AI transcription and then manually edit it for the audience.

The only way to go is to actually rely on an experienced human transcriptionist. This is what’s known CART, or Communication Access Realtime Translation.

Now what?

By all means, use Artificial Intelligence tools to generate captions and transcripts, but make sure to manually edit the output. And if you have real-time events, don’t rely on machines to do complex work they aren’t capable of doing correctly yet.