There has been a lot of talk lately about automatic transcription, or AI (artificial intelligence) transcription. This includes speech-to-text software and means that a transcription of your voice or recording is made automatically rather than by a human. I’ve recently experienced working with clients who use voice-to-text software and receiving an automatic closed-captioning file from a video meeting platform, and I’m taking the opportunity to share my experiences with both. Last time, I looked at features to look out for with speech-to-text software and this time I’m talking about automatic closed-captioning.
Using closed-captioning to create live subtitles and texts
A client for whom I’m transcribing focus groups (so discussion groups of several people with one facilitator) had one group that included a participant living with a hearing impairment. They turned on the closed-captioning feature in the video meeting platform they were using, so that the participant could read what the other participants were saying. As it recorded everyone’s speech in real time and then generated a text afterwards, my client sent it to me to see if I’d find it useful.
As I’ve been thinking about offering an automatic transcription editing service next to my full transcription service, I was really interested in seeing how this worked.
What does real-time closed-captioning or automatic transcription look like?
In my opinion, automatic real-time closed-captioning is not there yet in terms of generating a good, usable transcription. Here are the downfalls I noticed in the tape (you’ll notice some of these if you turn on the subtitles on the news, etc. – which are very rarely produced by humans these days).
- Time stamps were added every few seconds which is great for some clients but my focus group transcription clients usually only want it every ten minutes.
- There was no differentiation of speakers, although new utterances were usually started on a new line (this could be a new utterance by a new speaker or a new utterance by the same speaker).
- If two people spoke at once the speech was jumbled.
- Even captioning of the slowest, clearest and most “accentless” (Received Pronunciation) speaker was full of errors including homophones, missed words and repeated words.
- If someone had an accent (regional or English as an additional language), it pretty well failed to cope at all.
- If someone spoke quickly, it pretty well failed to cope at all.
- Ums and ers were not recorded, which is understandable in terms of a participant needing to know what the others were saying, but is not useful when your client has requested a full verbatim transcription (see my article on the types of transcription here).
In summary, the transcription produced for this session by the closed-captioning software would not have been of any use to the researcher without extensive editing.
I have also had a look at the automatic transcription on various video playing platforms such as YouTube and the same issues have appeared there, too.
Is it quicker to edit an automatically generated transcription than to transcribe it from scratch?
With this particular client, while the participants varied over the groups, I had transcribed a fair few groups and had an idea of how many audio minutes I was transcribing per hour. It’s also worth noting that I’m experienced in editing other people’s transcriptions, as I used to be the go-to transcriber for tricky sessions at a big worldwide conference.
Bearing those points in mind, using the closed-caption transcript and editing it to the same standard as one I had done from scratch took exactly the same time as transcribing it from scratch would have taken! There was less actual straight typing, but more mouse work and clicking, so I don’t think it saved me much risk of RSI, either.
I will keep looking at this issue over the next few years, as automatic closed-captioning and the transcripts it produces are bound to improve with improved technology and voice recognition.
In this article, I have discussed the use of automatic closed-captioning and whether it can be used to generate transcripts that replace or can be used as a basis for human transcription.
If you have experience of using automatic closed-captioning, particularly in languages other than English, please comment with anything else you’ve noticed that it would be useful for people to know.
Other relevant articles on this website
Automatic transcription – some real-world case studies 1: voice-to-text software
Why you need to be human to produce a good transcription
Using a transcription app rather than a human transcriber – pros and cons
What are the types of transcription?
What information does my transcriber need?
How to be a good transcription customer
How long does transcription take?
Recording and sending audio files for researchers and journalists
How to get into transcribing as a job
The technology transcribers use