Thing 10: AI voice and speech

Text-to-speech isn't new, of course. If you've ever listened to a satnav, heard your phone read a notification aloud, or used an automated phone menu, you've experienced it. But if your expectations are set by those robotic, halting voices, you're in for a surprise. Modern AI voice generation produces speech that sounds convincingly human: natural pauses, emotional inflection, the kind of rhythm that makes you do a double-take when you find out it's synthetic. The technology has moved from "obviously a computer" to "wait, is that a real person?"

Speech-to-text has had a similar leap. Voice dictation used to be a frustrating exercise in correction. You'd spend as long fixing the transcription as you would have spent just typing. Current tools handle accents, background noise, filler words, and natural speech patterns with enough accuracy to be properly useful rather than just a party trick.

This Thing gives you hands-on experience with both directions. You'll generate a spoken narration from text you've written, and you'll convert your own spoken words into a written transcription. By the end, you'll have a good sense of how capable these tools are, where they fall short, and how they might fit into your own life.

How AI voice technology works

An abstract illustration representing AI voice and speech technology, with visual elements suggesting sound waves and digital processing — AI has transformed both text-to-speech and speech-to-text, producing results that are often hard to distinguish from human performance.

As with image generation, you don't need a deep technical understanding to use these tools well, but knowing the basics helps you understand what you're hearing and why the results vary.

Text-to-speech

Traditional text-to-speech systems worked by stitching together pre-recorded fragments of human speech (individual sounds, syllables, or words) and playing them back in sequence. This is why they sounded choppy and unnatural. The transitions between fragments were always slightly off, and the system had no understanding of meaning or context, so emphasis and intonation were essentially random.

Modern AI text-to-speech works differently. Models like those used by ElevenLabs are trained on vast amounts of human speech data. They learn not just how individual words sound but how people actually talk: the rhythm of a sentence, where emphasis naturally falls, how tone shifts with punctuation, how a question sounds different from a statement. When you give these models a piece of text, they generate speech from scratch rather than stitching together fragments. The result is fluid, natural-sounding audio that captures many of the subtle qualities of human speech.

The quality gap between the best AI voices and obviously synthetic speech is now enormous. The best models can convey warmth, urgency, and hesitation in ways that are often hard to distinguish from a human recording, at least for short passages. Longer passages sometimes reveal patterns: a slight uniformity in pacing, a tendency to handle unusual words or abbreviations oddly, or an emotional flatness that creeps in when there's no obvious tone to match.

Speech-to-text

Speech-to-text (also called transcription or speech recognition) faces a different challenge: making sense of the messy, imperfect way humans actually talk. We mumble, we speak over each other, we change direction mid-sentence, we use filler words, and we rely heavily on context that isn't in the words themselves.

AI transcription models, the most influential being OpenAI's Whisper, are trained on hundreds of thousands of hours of transcribed audio. They learn to handle accents, background noise, specialist vocabulary, and the general chaos of spoken language. Modern systems achieve accuracy rates above 95% for clear speech in quiet environments, and they're getting better in noisier conditions too.

Where current transcription tools still struggle is with heavy accents, overlapping speakers, very technical terminology, and audio quality that's genuinely poor. They also make mistakes that a human transcriber wouldn't, occasionally substituting a similar-sounding word that makes no sense in context, or mishearing proper nouns. These errors are worth looking out for in your activity.

The current tool landscape

You've got several options for both text-to-speech and speech-to-text. Here's what's available and worth trying.

Text-to-speech tools

ElevenLabs (elevenlabs.io) is the standout tool in this category. It offers the most natural-sounding AI voices currently available to consumers, with a library of thousands of voices in over 70 languages. The free tier gives you 10,000 characters per month, roughly equivalent to ten minutes of generated audio. That's more than enough for this activity, though you'll want to be thoughtful about how you use your allowance rather than generating dozens of test clips. You can sign up with email, Google, or Apple. The free tier doesn't include voice cloning or commercial usage rights, but it gives you full access to the voice library and the core text-to-speech engine.

Natural Reader (naturalreaders.com) offers a straightforward free tier with a selection of AI voices. It's less sophisticated than ElevenLabs (the voices are good but noticeably more "AI" in character) but it's simple to use and doesn't require an account for basic generation. It's a solid backup if you want to compare voices across platforms.

Your phone's built-in text-to-speech is worth trying as a baseline comparison. Both iOS (Settings > Accessibility > Spoken Content) and Android (Settings > Accessibility > Text-to-Speech) have built-in speech engines that have improved significantly in recent years. Comparing these with a dedicated AI voice tool is a quick way to appreciate how far the technology has come.

ChatGPT's Read Aloud feature allows you to have any ChatGPT response spoken aloud using OpenAI's voice models. It's not a standalone text-to-speech tool, but the voice quality is high and it gives you another point of comparison if you're already using ChatGPT.

Speech-to-text tools

Otter.ai (otter.ai) is a popular transcription tool with a free tier that includes a limited number of transcription minutes per month. It's designed primarily for meeting transcription (it can identify different speakers, highlight key points, and generate summaries) but it works perfectly well for transcribing a simple voice recording. Available on web and mobile.

Your phone's built-in dictation is likely the most accessible option you already have. On iPhone, tap the microphone icon on any keyboard. On Android, use Google Voice Typing (the microphone on the Gboard keyboard). Both have become surprisingly accurate for general dictation, and they work offline on most recent devices. The advantage is zero setup; the limitation is that they're designed for dictation (speaking in a structured way) rather than transcribing natural, conversational speech.

Google Docs voice typing (Tools > Voice typing in any Google Doc) provides free speech-to-text directly in your browser. It requires a Chrome browser and an internet connection but handles extended dictation well.

Microsoft Word dictation offers similar functionality built into Word Online and the desktop app (Home > Dictate). If you already use Microsoft 365, this is worth trying.

Ethics, privacy, and voice cloning

Voice AI raises some distinctive ethical questions worth thinking about before you dive into the activity.

Voice cloning is the big one. Several platforms, including ElevenLabs on its paid tiers, allow you to clone a voice from a short audio sample. Think about what that means: if you can create a convincing replica of someone's voice from a few minutes of recording, that same technology can be used for fraud and impersonation. There have already been real cases of voice cloning being used in scam phone calls, where victims believed they were speaking to a family member in distress. ElevenLabs and other providers have introduced safeguards (consent verification, abuse detection, content policies) but the capability exists and is becoming more accessible.

Privacy and your recordings: when you use a cloud-based transcription tool like Otter.ai, your audio is uploaded to their servers for processing. For this activity, you'll be recording yourself talking about personal interests, not confidential work material, so the privacy risk is low. But it's worth knowing that anything you record and upload to a cloud service is leaving your device. For sensitive content in a professional context, this matters. We'll look at local AI options later in the programme (Thing 22) that keep everything on your own machine.

Synthetic media and trust. As AI voices become indistinguishable from real ones, the question of trust becomes more pressing. How do you know a voice message is really from the person it claims to be? How should organisations handle AI-generated audio in official communications? There's no settled answer yet, but being aware of the question is part of being AI literate.

None of this should put you off experimenting. The tools are useful and worth understanding. But voice AI is one area where the gap between "impressive demo" and "serious societal question" is particularly narrow.

Resources to explore

ElevenLabs

Your primary text-to-speech tool for the activity. Free tier with 10,000 characters per month (~10 minutes of audio). Sign up with email, Google, or Apple.

Open tool

Natural Reader

A simpler alternative for text-to-speech with a free tier. No account required for basic use. Good for comparing voice quality.

Open tool

Otter.ai

Free speech-to-text with speaker identification and meeting features. Limited free transcription minutes per month. Web and mobile.

Open tool

Google Docs voice typing

Free speech-to-text built into Google Docs (Tools > Voice typing). Requires Chrome browser and internet connection.

Open tool

Built-in phone dictation

Available on iOS and Android at no cost. Tap the microphone icon on your keyboard. Surprisingly accurate for general dictation.

Read article

OpenAI Whisper

The open-source speech recognition model behind many current transcription tools. For background reading, not required for the activity.

Read article

Activity: both sides of the voice coin

30–45 minutes ElevenLabs + a speech-to-text tool

This activity has two parts. You'll start by turning text into speech, then flip the process and turn speech into text.

Part 1: Text-to-speech

Write your script. Write a short piece of text, somewhere between 100 and 300 words, that you'd like to hear spoken aloud. Choose something personal to you, not workplace material.
Generate your narration. Head to elevenlabs.io, sign up for a free account, paste in your script, and choose a voice from the library. Try at least two different voices and save your favourite version.
Compare with a baseline (optional). Try the same text in your phone's built-in text-to-speech or Natural Reader. The quality difference is often striking.

Ideas for your script

An "about me" paragraph for a personal blog, dating profile, or social media bio
A short introduction you might give at a community group or hobby club
A welcome message for a personal project or podcast idea you've been thinking about
A reading of a favourite quote or passage from a book you love (keep it short)
A voicemail greeting that's more interesting than "leave a message after the tone"

The content matters less than having something you care about hearing spoken well. You'll be better placed to judge the quality of the AI voice if you know exactly how the words should sound.

What to listen for

How natural does the pacing feel? Does it sound like someone reading, or like someone speaking?
Does the emphasis fall in the right places? Are there words or phrases that sound oddly stressed or flat?
How does it handle punctuation? Do commas create natural pauses? Do question marks lift the intonation?
Would you believe this was a human recording if you didn't know otherwise?

Part 2: Speech-to-text

Record yourself speaking. Using your phone's voice recorder app, record yourself speaking for about one minute. Talk naturally about a personal topic. Don't read from a script, and don't worry about filler words or pauses.
Transcribe your recording. Choose a speech-to-text tool (Otter.ai, Google Docs voice typing, or phone dictation) and run your recording through it.
Evaluate the transcription. Read through it carefully and compare it to what you actually said. Look at overall accuracy, tricky spots, filler words, punctuation, and formatting.

Topic ideas for your recording

A hobby or interest you're passionate about
A film, book, or TV series you've enjoyed recently
A holiday or day trip you'd recommend
Your thoughts on how this 23 Things AI programme is going so far

Speak at your normal pace and don't worry about the occasional "um." The point is to give the transcription tool real, natural speech to work with.

What to look for in your transcription

Overall accuracy: what percentage of words did it get right? A rough estimate is fine.
Tricky spots: did it struggle with any specific words? Proper nouns, technical terms, slang, or dialect words are common trouble areas.
Filler words: did it capture your "ums" and "ahs," or silently remove them?
Punctuation and formatting: how well did it guess where sentences begin and end? Is the transcription readable as text?
Speaker identification: if you used Otter.ai, did it correctly identify you as a single speaker?

Privacy reminder: use personal examples or fictional scenarios, never actual work materials or confidential documents.

Going further

If you have time and want to explore more, try one of these extensions:

The accent test. If you have a regional accent, try dictating the same passage in your natural voice and then in a more "received pronunciation" style. Does the tool's accuracy change? This reveals something interesting about the biases in training data.
The language test. If you speak another language, try generating speech and transcribing in that language. ElevenLabs supports over 70 languages, and comparing quality across languages is revealing.
The emotion experiment. In ElevenLabs, try adding emotional direction to your text using the voice settings. Adjust the stability and similarity sliders, or try different voices with different emotional qualities for the same text. How much can you change the feel of a narration without changing a single word?

Your output

A document or blog post containing:

Your written script (the text you used for text-to-speech)
The audio file from ElevenLabs (your favourite version, noting which voice you used)
The transcription from Part 2, with any errors highlighted or annotated
A written reflection: how convincing was the AI-generated speech? How accurate was the transcription? What surprised you? Can you think of practical uses for either technology? What concerns, if any, do you have?

Why this matters

The practical uses are fairly obvious: accessible content for people who prefer listening to reading, transcription that saves hours of manual typing, narration for videos and presentations, dictation that's finally accurate enough to be a real alternative to typing.

But voice is also deeply personal. We recognise people by their voices. We make judgements about trustworthiness and authority based on how someone sounds. When AI can produce speech that's indistinguishable from a real person, that changes how we relate to audio content. It's worth understanding the technology precisely because it touches something so personal.

When you combine both directions (speech-to-text and text-to-speech), you also start to see the foundations for things like conversational AI assistants, real-time translation, and accessibility tools that could change how people interact with technology. The tools you've used today are early pieces of that bigger picture.

Claim your Open Badge

Submit your written script, the ElevenLabs audio file, your transcription with error annotations, and your written reflection as evidence for your Thing 10 badge via cred.scot.

Thing 10: AI voice and speech

Submit your script, audio file, annotated transcription, and reflection as evidence to claim this badge via cred.scot.

Claim now

How AI voice technology works

The current tool landscape

Ethics, privacy, and voice cloning

Resources to explore

Activity: both sides of the voice coin

Part 1: Text-to-speech

Part 2: Speech-to-text

Your output

Why this matters

Claim your Open Badge

What's next