Text-to-speech isn't new, of course. If you've ever listened to a satnav, heard your phone read a notification aloud, or used an automated phone menu, you've experienced it. But if your expectations are set by those robotic, halting voices, you're in for a surprise. Modern AI voice generation produces speech that sounds convincingly human: natural pauses, emotional inflection, the kind of rhythm that makes you do a double-take when you find out it's synthetic. The technology has moved from "obviously a computer" to "wait, is that a real person?"
Speech-to-text has had a similar leap. Voice dictation used to be a frustrating exercise in correction. You'd spend as long fixing the transcription as you would have spent just typing. Current tools handle accents, background noise, filler words, and natural speech patterns with enough accuracy to be properly useful rather than just a party trick.
This Thing gives you hands-on experience with both directions. You'll generate a spoken narration from text you've written, and you'll convert your own spoken words into a written transcription. By the end, you'll have a good sense of how capable these tools are, where they fall short, and how they might fit into your own life.
How AI voice technology works
As with image generation, you don't need a deep technical understanding to use these tools well, but knowing the basics helps you understand what you're hearing and why the results vary.
The current tool landscape
You've got several options for both text-to-speech and speech-to-text. Here's what's available and worth trying.
Ethics, privacy, and voice cloning
Voice AI raises some distinctive ethical questions worth thinking about before you dive into the activity.
Voice cloning is the big one. Several platforms, including ElevenLabs on its paid tiers, allow you to clone a voice from a short audio sample. Think about what that means: if you can create a convincing replica of someone's voice from a few minutes of recording, that same technology can be used for fraud and impersonation. There have already been real cases of voice cloning being used in scam phone calls, where victims believed they were speaking to a family member in distress. ElevenLabs and other providers have introduced safeguards (consent verification, abuse detection, content policies) but the capability exists and is becoming more accessible.
Synthetic media and trust. As AI voices become indistinguishable from real ones, the question of trust becomes more pressing. How do you know a voice message is really from the person it claims to be? How should organisations handle AI-generated audio in official communications? There's no settled answer yet, but being aware of the question is part of being AI literate.
None of this should put you off experimenting. The tools are useful and worth understanding. But voice AI is one area where the gap between "impressive demo" and "serious societal question" is particularly narrow.
Resources to explore
Your primary text-to-speech tool for the activity. Free tier with 10,000 characters per month (~10 minutes of audio). Sign up with email, Google, or Apple.
A simpler alternative for text-to-speech with a free tier. No account required for basic use. Good for comparing voice quality.
Free speech-to-text with speaker identification and meeting features. Limited free transcription minutes per month. Web and mobile.
Free speech-to-text built into Google Docs (Tools > Voice typing). Requires Chrome browser and internet connection.
Available on iOS and Android at no cost. Tap the microphone icon on your keyboard. Surprisingly accurate for general dictation.
The open-source speech recognition model behind many current transcription tools. For background reading, not required for the activity.
Activity: both sides of the voice coin
This activity has two parts. You'll start by turning text into speech, then flip the process and turn speech into text.
Part 1: Text-to-speech
- Write your script. Write a short piece of text, somewhere between 100 and 300 words, that you'd like to hear spoken aloud. Choose something personal to you, not workplace material.
- Generate your narration. Head to elevenlabs.io, sign up for a free account, paste in your script, and choose a voice from the library. Try at least two different voices and save your favourite version.
- Compare with a baseline (optional). Try the same text in your phone's built-in text-to-speech or Natural Reader. The quality difference is often striking.
Part 2: Speech-to-text
- Record yourself speaking. Using your phone's voice recorder app, record yourself speaking for about one minute. Talk naturally about a personal topic. Don't read from a script, and don't worry about filler words or pauses.
- Transcribe your recording. Choose a speech-to-text tool (Otter.ai, Google Docs voice typing, or phone dictation) and run your recording through it.
- Evaluate the transcription. Read through it carefully and compare it to what you actually said. Look at overall accuracy, tricky spots, filler words, punctuation, and formatting.
Your output
A document or blog post containing:
- Your written script (the text you used for text-to-speech)
- The audio file from ElevenLabs (your favourite version, noting which voice you used)
- The transcription from Part 2, with any errors highlighted or annotated
- A written reflection: how convincing was the AI-generated speech? How accurate was the transcription? What surprised you? Can you think of practical uses for either technology? What concerns, if any, do you have?
Why this matters
The practical uses are fairly obvious: accessible content for people who prefer listening to reading, transcription that saves hours of manual typing, narration for videos and presentations, dictation that's finally accurate enough to be a real alternative to typing.
But voice is also deeply personal. We recognise people by their voices. We make judgements about trustworthiness and authority based on how someone sounds. When AI can produce speech that's indistinguishable from a real person, that changes how we relate to audio content. It's worth understanding the technology precisely because it touches something so personal.
When you combine both directions (speech-to-text and text-to-speech), you also start to see the foundations for things like conversational AI assistants, real-time translation, and accessibility tools that could change how people interact with technology. The tools you've used today are early pieces of that bigger picture.
Claim your Open Badge
Submit your written script, the ElevenLabs audio file, your transcription with error annotations, and your written reflection as evidence for your Thing 10 badge via cred.scot.
Submit your script, audio file, annotated transcription, and reflection as evidence to claim this badge via cred.scot.
Claim now