Thing 11: AI audio editing

If you've ever tried to edit audio or video using traditional software (Audacity, GarageBand, Adobe Premiere, or anything similar), you'll know it involves staring at a waveform. That wobbly line represents your audio, and to cut a section, you have to find the right spot by eye, play it back, scrub through the timeline, and hope you don't accidentally chop the beginning off a word. It works, but it's fiddly, time-consuming, and feels completely disconnected from what was actually said.

AI audio editing flips this. The tool transcribes your audio, shows you the text, and lets you edit the audio by editing the text. Want to remove that "umm" in the middle of your sentence? Delete it from the transcript. Want to cut an entire paragraph you rambled through? Select the text and press delete. The audio follows the text. It sounds too simple to be real until you try it, at which point you wonder why audio editing ever worked any other way.

Text-based editing is only part of the picture, though. AI can also clean up your audio quality after the fact: removing background noise, reducing echo, and making a recording from your kitchen sound like it was made in a proper studio. These enhancement tools have changed what counts as "good enough" recording conditions. You no longer need a quiet room and an expensive microphone to produce audio that sounds professional.

This Thing gives you hands-on experience with both: editing audio through text, and enhancing audio quality with AI. By the end, you'll have a good sense of how capable these tools are, where they fall short, and whether they're useful in your own life.

How AI audio editing works

An abstract illustration suggesting audio editing, with geometric shapes, waveforms, and editing tools arranged on a desk workspace — AI audio editing lets you work with speech as text rather than waveforms, while enhancement tools clean up recordings automatically.

As with image generation and voice synthesis, you don't need a deep technical understanding to use these tools well. But knowing the basics helps you understand what you're hearing and why results vary.

Text-based editing

The foundation of text-based editing is transcription, the same speech-to-text technology you explored in Thing 10. The tool listens to your audio, creates a transcript, and maintains a precise link between every word in the transcript and the corresponding moment in the audio.

When you delete a word from the transcript, the tool removes the matching audio. When you rearrange sentences in the text, the audio rearranges to match. The transcript becomes your editing interface, and the waveform becomes something you can largely ignore.

If you work with spoken-word content, this is a big deal. Podcasters, trainers, anyone who records presentations or instructions: they all spend significant time editing out mistakes, hesitations, and tangents. Text-based editing cuts that work down because you can read what was said and make decisions at the speed of reading rather than the speed of listening.

It isn't perfect. Edits between words can sometimes sound slightly unnatural: a tiny gap or an abrupt transition that a careful listener might notice. The more natural and consistent your speaking pace, the cleaner the edits tend to be. But for most purposes, the results are surprisingly smooth.

AI audio enhancement

Audio enhancement is a different kind of AI trick. Rather than editing what was said, it improves how it sounds.

Traditional audio cleanup requires someone who understands equalisers, noise gates, compressors, and de-essers. These are specialist tools that take time to learn. AI enhancement replaces all of that with a single step: upload your file, let the AI process it, and download the cleaned-up version.

Under the hood, these tools use machine learning models trained on thousands of hours of audio. They've learned to distinguish between human speech and everything else: traffic noise, air conditioning hum, keyboard clicks, room echo, the dog barking in the background. The AI isolates the speech and removes or reduces everything else, then applies subtle processing to make the voice sound clearer and more present.

A recording made on your phone in a noisy café can come out sounding like you were sitting in a quiet studio. There are limits, of course. If the background noise is louder than your voice, or if the recording quality is very poor, even AI can only do so much. But for the kinds of recordings most people actually make (phone recordings, laptop microphones, video calls), the improvement is dramatic.

The current tool landscape

You've got two main categories of tool here: one for text-based editing, one for audio enhancement. Here's what's available and worth trying.

Descript

Descript (descript.com) is the leading tool for text-based audio and video editing, with around six million users. It popularised the idea of editing audio by editing text, and it's still the best implementation of the concept.

When you upload audio to Descript, it transcribes the recording and presents you with a text editor that looks and feels like a word processor. You can select text and delete it, and the corresponding audio disappears. You can highlight filler words ("um," "uh," "you know," "like") and remove them all in one click. You can rearrange sections by cutting and pasting text, just as you would in a document.

Descript also includes Studio Sound, an AI enhancement feature that cleans up audio quality by removing background noise, reducing echo, and making voices sound clearer. On paid plans, there's also Overdub, which lets you create a clone of your voice that can speak new words, useful for correcting small mistakes without re-recording.

The free plan gives you 60 media minutes per month and 100 one-time AI credits. That's enough to try the core features, but the AI credits are consumed each time you use features like Studio Sound (around 10 credits per use), so you'll want to be deliberate about what you enhance. The free plan also watermarks exported video, though this doesn't affect audio-only exports. Paid plans start from around £13/month (Hobbyist) if you need more capacity.

Descript requires creating an account and downloading the desktop application (available for Windows and macOS). It doesn't run in a browser.

Adobe Podcast Enhance Speech

Adobe Podcast Enhance Speech (podcast.adobe.com/enhance) takes a different approach. It doesn't do text-based editing at all. It focuses entirely on audio enhancement. You upload a recording, the AI processes it, and you download a cleaner version. That's it.

What makes it worth using is how well it works and how little effort it requires. There are no settings to configure, no sliders to adjust (unless you're on the premium plan), and no learning curve. You upload, you wait, you download. The enhancement is often dramatic: background noise vanishes, echo reduces significantly, and voices become noticeably clearer.

The free tier requires an Adobe account (free to create) and lets you process files up to 30 minutes long, with a daily limit of one hour of total processing. That's plenty for experimentation. Files up to 500MB are supported, and it accepts common audio formats including MP3 and WAV.

There's a V2 version of the enhancement engine, which Adobe says produces more natural results. The free tier uses automatic enhancement, while the premium plan (£8/month, or included with Creative Cloud subscriptions) adds adjustable enhancement strength and support for video files.

One thing worth knowing: the enhancement is designed specifically for speech. It works well on spoken-word recordings but isn't designed for music or complex audio with multiple sound sources.

Other tools worth knowing about

Auphonic (auphonic.com) offers automated audio post-production (levelling, noise reduction, and loudness normalisation) with a free tier of two hours per month. It's popular with podcasters who want a "set and forget" approach to making their audio sound consistent.

Audacity (audacityteam.org) is the long-standing free, open-source audio editor. It uses traditional waveform editing rather than text-based editing, but it now supports AI-powered noise removal through built-in tools and third-party plugins. If you want maximum control over your audio editing without paying anything, Audacity is worth knowing about, though the learning curve is steeper.

Krisp (krisp.ai) works differently. It removes background noise in real time during calls and recordings, rather than after the fact. If you frequently take calls from noisy environments, it's worth a look.

Resources to explore

Descript

Your primary text-based editing tool for the activity. Free tier with 60 media minutes per month. Requires desktop app (Windows/macOS).

Open tool

Adobe Podcast Enhance Speech

One-click audio enhancement. Free tier processes files up to 30 minutes with one hour of daily processing. No software to install.

Open tool

Auphonic

Automated audio post-production with levelling, noise reduction, and loudness normalisation. Free tier of two hours per month.

Open tool

Audacity

Free, open-source audio editor with traditional waveform editing. Steeper learning curve but full control over your audio.

Open tool

Krisp

Real-time background noise removal during calls and recordings. Useful if you frequently work from noisy environments.

Open tool

Activity: edit and enhance your audio

30–45 minutes Descript + Adobe Podcast

This activity has four parts. You'll record a short audio clip, then use two different AI tools to process it: one for text-based editing and one for audio enhancement. The comparison will show you two distinct ways AI is changing audio editing.

What you'll need: a computer with a microphone (built-in is fine), a free Descript account, and a free Adobe account.

Part 1: Record your source material

Record yourself talking for two to three minutes. Use your computer's built-in microphone or your phone. Don't worry about finding a quiet room or using good equipment. A slightly imperfect recording is actually better for this exercise because it gives the AI something to work with.

Don't script it; just talk naturally. The goal is a recording that sounds like a real person speaking off the cuff, complete with the "ums," pauses, restarts, and background noise that come with that.

Topic ideas for your recording

A hobby or interest you're passionate about
A place you've visited that you'd recommend
A book, film, or TV series you've enjoyed recently
Something you're currently learning about
Your thoughts on how this 23 Things AI programme is going so far

The content matters less than having natural, unscripted speech. Speak at your normal pace and don't worry about the occasional "um."

Tips for your recording

If you're recording on your phone, use the built-in voice recorder app and transfer the file to your computer afterwards
Save the file somewhere easy to find. You'll need it for both parts of the activity
Keep a copy of your original, unedited recording so you can compare it with the processed versions later
Don't try to speak perfectly. The whole point is to give the editing tools real, messy, natural speech to work with

Part 2: Text-based editing in Descript

Install and sign up. Download and install Descript from descript.com if you haven't already, and create a free account.
Upload your recording. Create a new project and upload your audio file.
Wait for transcription. Descript will transcribe your audio. This usually takes a minute or two.
Explore the transcript. Read through it and notice how it maps to your recording. Click on any word and the audio will play from that point.
Remove filler words. Descript highlights "ums" and "uhs" automatically. Try removing them individually or use the filler word removal feature to clear them in bulk.
Try a bigger edit. Delete a sentence or two from the middle of your recording. Play back the result and listen for how smooth (or not) the edit sounds.
Try Studio Sound (optional). If you have AI credits remaining, apply Studio Sound to hear the audio enhancement.
Export your edit. Export the edited audio.

Part 3: Audio enhancement with Adobe Podcast

Open the tool. Go to podcast.adobe.com/enhance and sign in with a free Adobe account.
Upload your original. Upload your original recording (the unedited version, not the Descript export).
Wait for processing. The AI typically takes under a minute for a short recording.
Compare the versions. Listen to the enhanced version and toggle between the original and enhanced audio to hear the difference.
Download the result. Download the enhanced version.

Part 4: Compare and reflect

You should now have three versions of your recording: the original, the Descript-edited version, and the Adobe-enhanced version. Listen to all three and write a short reflection (200–300 words) covering these questions:

What did the text-based editing in Descript change about the content of your recording? Did removing filler words and pauses make it sound more polished, or did it lose some natural character?
What did the Adobe Podcast enhancement change about the sound quality? Could you hear a difference in background noise, clarity, or overall polish?
Which type of improvement (content editing or quality enhancement) made the bigger difference to your recording?
Can you think of situations in your own life where either of these tools would be useful? Think about any audio or video content you create, even informally: voice messages, presentation recordings, training materials, or social media content.

Privacy reminder: use personal examples or fictional scenarios, never actual work materials or confidential documents.

Your output

A document or blog post containing:

Your original recording
The Descript-edited version (with filler words removed and at least one section cut)
The Adobe Podcast-enhanced version
A written reflection (200–300 words) comparing the three versions

Things to notice

As you work through the activity, pay attention to a few things that reveal how these tools work and where they have limitations.

Edit artefacts

When you remove words or sentences in Descript, listen carefully to the join points. Sometimes the edit is seamless; sometimes there's a tiny click, an unnatural pause, or an abrupt change in room tone. The AI is very good at making clean cuts, but it isn't perfect, especially if your speaking rhythm changed between the words on either side of the edit.

Over-enhancement

Adobe's Enhance Speech tool is impressive, but it can sometimes make voices sound slightly processed or artificial, particularly if it's working hard to remove a lot of noise. Listen for any "watery" or metallic qualities in the enhanced version. This is a common trade-off in AI audio processing: the more aggressively the tool cleans up the audio, the more risk there is of introducing subtle artefacts.

The "good enough" question

Neither tool will produce broadcast-quality results from a truly terrible recording. But both can take a recording that you'd be embarrassed to share and turn it into something perfectly presentable. For most professional uses (internal presentations, training recordings, social media content), "good enough" is exactly what you need, and these tools get you there quickly.

Why this matters

Audio and video content is increasingly part of professional life. Organisations use it for training, internal communications, social media, and knowledge sharing. But the traditional barrier to creating polished audio has always been the editing. It was slow, required specialist software knowledge, and felt like a completely separate skill from the actual content creation.

AI audio editing removes that barrier in two ways. Text-based editing means you can edit a recording as naturally as you'd edit a document, a skill everyone already has. And AI enhancement means you don't need professional equipment or a treated recording space to produce audio that sounds clean and clear.

You don't need to be a "content creator" for this to be relevant. If you've ever recorded a voice note and wished you could tidy it up before sending it, or given a presentation that was recorded and wished the audio quality was better, or thought about starting a podcast but felt put off by the editing, these tools are directly useful.

You'll see this same pattern throughout the programme: AI is steadily removing the technical barriers between having an idea and executing it. You don't need to learn audio engineering to produce clean audio, just as you don't need to learn graphic design to create images (Thing 9) or voice acting to generate narration (Thing 10). The skills that matter more and more are the creative and editorial ones: knowing what you want to say, recognising what sounds good, and making thoughtful decisions about the tools you use.

Claim your Open Badge

Submit your original recording, the Descript-edited version, the Adobe Podcast-enhanced version, and your written reflection as evidence for your Thing 11 badge via cred.scot.

Thing 11: AI audio editing

Submit your three audio versions and written reflection as evidence to claim this badge via cred.scot.

Claim now

How AI audio editing works

The current tool landscape

Resources to explore

Activity: edit and enhance your audio

Part 1: Record your source material

Part 2: Text-based editing in Descript

Part 3: Audio enhancement with Adobe Podcast

Part 4: Compare and reflect

Your output

Things to notice

Why this matters

Claim your Open Badge

What's next