I had the pleasure of being on Circulating Ideas with Steve Thomas. We talked about a bunch of things including open textbooks, accessibility, alternate formats, and being a systems librarian. He’s a great host and an interesting person to chat with. The interview went up last week.
Without a transcript a podcast isn’t accessible to Deaf and some Hard of Hearing people. It felt strange to be talking about accessibility and universal design and have it be in an audio-only format. So I decided to produce a transcript.
I heard the folks from Pop Up Archive present at code4lib in Portland. Pop Up Archive makes sound searchable using speech-to-text technology. Their clients are mostly public radio broadcasters who are looking to make their sound archives searchable. I remember thinking at code4lib that this could be an interesting tool to help make politics more accessible and transparent. For example, transcripts could be made available fairly quickly after a municipal committee (or provincial or federal committee) met. The transcript is almost the byproduct of this process.
I was curious how it could be used to produce a transcript. I was also curious about how accurate the machine transcript was, as well as how long it would take me to clean up. First, you upload the sound file. Next, you can add metadata about the file you uploaded. Then Pop Up Archive processes your sound file. The machine transcript takes as long as your file is, in my case 39 minutes, to process. The machine transcript was about 80% accurate. Finally you can edit the machine transcript on their platform. It took me about 2 hours to clean up a 39 minute interview.
I like the interface. It was intuitive and once I’d learned the keyboard shortcuts I was able to clean up this file more quickly. On my work monitor I couldn’t see the highlighting of the line that was being played, but it’s much clearer on my laptop. I would’ve appreciated a global find and replace feature. It’s possible to export in various file formats: audio file, text without timestamps, text with timestamps, SRT format (captions), XML format (WC3 transcript), or as JSON. I grabbed the text with timestamps and then plopped it into Word to use spellcheck to catch misspelt words. Steve spent another hour editing it to make it easier to read (I say “like” and “so…” quite a bit) and formatting it so it’s clear who was saying what. He also added links which took another 30 minutes.
I’m sure there’s a more efficient workflow but I was really impressed with the machine transcript that Pop Up Audio generated. According to this company, it takes a professional transcriber 1 hour to transcribe 15 minutes of clearly recorded audio and then additional time to proofread.
With improvements in speech-to-text technology and machine transcripts I think tools like this can make it easier for podcasters to produce transcripts. I can also see this being used (along with human editors) as a faster way to produce transcripts for audio and video as part of a disability accommodation in education.
Here’s the final transcript.