There’s a lot of talk about automatic transcription, or AI (artificial intelligence) transcription, also known as voice-to-text software. This means that a transcription of your voice or recording is made automatically rather than with human input. What’s the state-of-art of this at the moment? I’ve recently experienced working with both a client using voice-to-text software to generate text that I edit and receiving an automatic closed-captioning file from a video meeting platform, so I thought it would be a good opportunity to share my experiences with both. This article looks at speech-to-text software and the next one will examine closed-captioning.
Using voice-to-text software to create text documents
A couple of my regular editing clients use voice-to-text software to create documents which they send to me for me to edit. I have also worked with a number of students who, because they live with a visual impairment or a physical issue (for example, RSI that makes typing painful and difficult), have used this method to generate sometimes very long documents.
What common features of voice-to-text-generated documents can an editor look out for?
Here’s what I’ve noticed about what documents look like when the client is using voice-to-text software
- The outcome is a lot more accurate when using more sophisticated voice-to-text software that can “learn” the speaker’s voice, rather than out-of-the-box, one-size-fits-all software.
- The outcome is also a lot more accurate and able to cope with “standard” (Received Pronunciation) slowly and clearly spoken English (in this case; I’m guessing it’s the same with other languages but would love to know for sure). The software can struggle with accents and fast speakers.
- The most common issues with voice-to-text software are
- Homophones – the software doesn’t know which spelling the speaker wants to use out of two alternatives that sound the same – bear/bare, which/witch, etc. This is really common and can lead to some very odd sentences and potential embarrassing issues. Note that these can’t be spotted by having the software read the text back to the speaker, as the words sound the same.
- Added words – the software registers two separate words when there’s only one: “repeated distractions” becomes “repeat and distractions”.
- Missed words and parts of words – if the speaker speaks quickly and skips over short words or swallows the middle of words, they might not register in the software: “paddle boards” becomes “pad boards”; “fruit and nut” becomes “fruited nut” or “fruit nut”.
- Missed punctuation – this usually has to be spoken in in a set formula by the speaker. If they don’t do that, the punctuation won’t be there.
These issues are quite different from the usual ones met in editing people’s texts, whether their English is their first or additional language. Just as particular Language 1s will bleed through into writers’ other languages (as an L1 English speaker, I am likely to put French and Spanish sentences into an incorrect English word order, for example), dictated English has its own little oddities and patterns that you need to look out for.
How can the speaker and editor combat issues with speech-to-text documents?
There are a few things the speaker and then the editor can do to mitigate these issues.
- The speaker could speak slowly and clearly, enunciating all the words and their endings and putting the punctuation in as required.
- If there is an option to “teach” the software the speaker’s voice, I recommend doing that for optimum results.
- Always have someone check a speech-to-text-generated text.
- The speaker/client could let the editor know that they’ve used such software, so the editor can be on high alert for the features listed above (remember that Spellcheck won’t necessarily notice correctly spelled homophones).
- The editor could watch for oddly worded sentences as well as the grammar / spelling / punctuation issues they usually look out for.
In this article, I have discussed voice-to-text software that is sometimes used to generate documents, what the client/speaker can do to make sure the text they generate is as accurate as possible and what the editor of such documents can look out for.
If you have experience of using speech-to-text software, particularly in languages other than English, please comment with anything else you’ve noticed that it would be useful for people to know.
A friend talks about this issue with regard to an interview she conducted – read about her experience in this guest post.
Next time, I’ll talk about my experience of automatic closed-captioning on a video meeting platform.
Other relevant articles on this website
Automatic transcription – some real-world case studies 2: automatic closed-captioning (coming soon!)