How Speech Recognition Works: From Sound Waves to Words
Table of Contents

How Speech Recognition Works: From Sound Waves to Words

Siri isn't listening to words — it's measuring sound wave patterns. Here's how speech recognition works, why accents confuse voice assistants, and what kids should know about AI and bias.

Ask Siri to set a timer and she gets it right almost every time. Ask her to recognize the word your 7-year-old just made up — a nonsense word from a video game — and she fails completely. Not because she didn’t hear it. Because she’s never seen that pattern before.

That gap — between what voice assistants do well and where they fall apart — is a window into how they actually work. And once you understand the mechanism, a lot of your kid’s frustrations with these devices suddenly make complete sense.

Why This Is Worth Teaching

Voice interfaces are everywhere. Alexa, Siri, Google Assistant, voice search, voice-to-text on phones, automated customer service lines, voice control in cars. The expectation that you can just speak to a device and be understood is already a normal part of childhood.

But children who don’t understand how this works are more vulnerable to the gaps. They take it personally when a device doesn’t understand their accent. They assume the system is “dumb” rather than understanding that it reflects the data it was trained on. They miss the larger, more important point: these systems were built mostly by and for certain populations, and that matters.

A 2020 study in the Proceedings of the National Academy of Sciences found that five major commercial speech recognition systems (including those from Amazon, Apple, Google, IBM, and Microsoft) had error rates up to 2.5 times higher for African American English speakers compared to white speakers. That’s a bias problem rooted in training data — and a child who understands how speech recognition works is equipped to recognize and discuss that.

Explained Like You’re 5: Sound Is Just Wiggling Air

When you talk, your vocal cords vibrate. Those vibrations push air molecules back and forth in waves. Those waves travel through the air and reach a microphone. The microphone has a thin membrane that wiggles in response to the air pressure changes. That wiggling gets converted into an electrical signal — a rapidly changing voltage — which gets converted into numbers.

So your voice, by the time it reaches a computer, is a long sequence of numbers representing how much the microphone membrane moved at each tiny slice of time. About 44,000 slices per second for audio quality. That’s it. That’s what the computer starts with.

No words. No meaning. Just numbers.

How It Actually Works

Step 1: Waveform capture. The microphone converts sound pressure into a digital signal: a sequence of numbers (samples) representing amplitude over time. At 16,000 samples per second (typical for voice processing), one second of speech is 16,000 numbers.

Step 2: Feature extraction — the spectrogram. Raw waveforms are hard to classify directly. Instead, the system converts the waveform into a spectrogram: a visual/mathematical representation of which frequencies are present at each moment in time. Think of it as sheet music for the voice — showing not just volume, but which “notes” (frequencies) are dominant.

A particularly useful format is the Mel-frequency cepstral coefficients (MFCC) — a compact mathematical summary of the spectral shape, designed to emphasize the frequency ranges most important for human speech perception.

Step 3: Pattern matching with a neural network. The spectrogram (or MFCC features) gets fed into a neural network trained on hundreds of thousands of hours of labeled speech. The network learns which spectral patterns correspond to which sounds (phonemes), which phoneme sequences correspond to words, and which word sequences are grammatically probable.

Modern systems like OpenAI’s Whisper use transformer architectures — the same type of architecture underlying language models — trained end-to-end on speech data. Instead of breaking the problem into separate acoustic modeling and language modeling steps, they learn the full mapping from audio to text in one shot.

Step 4: Language model post-processing. Because acoustics alone are ambiguous (many sounds can produce similar spectrograms), most systems apply a language model to pick the most probable word sequence given the acoustic evidence. “Recognize speech” and “wreck a nice beach” sound similar acoustically; the language model picks the one that makes more sense in context.

Why Kids Should Know This Today

Understanding speech recognition matters for three reasons: practical, educational, and ethical.

Practical: Kids who understand that accents and background noise degrade accuracy can adapt their behavior — speaking more clearly, using the system more strategically. They’re also less likely to be frustrated by failures they now understand.

Educational: Speech-to-text technology is embedded in many assistive tools — dictation features for students with dyslexia, ADHD, or fine motor challenges. Parents making decisions about accommodations deserve to understand what these tools can and can’t do.

Ethical: This is the most important one. The accuracy gap across languages and accents is not a technical inevitability — it’s a consequence of who built the training data and whose voices were over- or underrepresented. A 2021 Stanford study found that automated speech recognition (ASR) systems transcribed African American English with word error rates nearly double those for Standard American English. A kid who understands this is equipped to ask “who built this, and whose voices did they use?”

How to Teach Your Kid About This

Ages 5–8: The Telephone Whispering Game

Play the classic telephone game — whisper a phrase down a line of people and see how it arrives distorted. Then explain: “When your voice reaches Alexa, it’s had to travel through the air, through the microphone, get turned into numbers, and then the computer has to guess what those numbers mean. Like in telephone, sometimes the message gets a little scrambled.”

Then experiment: speak clearly vs. speak softly with background TV. How does accuracy change?

Ages 9–12: Accent Experiment

Search YouTube for “voice assistant accent challenge” — there are dozens of videos where people with different regional and international accents test how well Siri or Google Assistant understands them. Watch a few together. Then ask: “Why does the assistant understand some accents better than others?”

The answer: it depends entirely on the training data. A system trained mostly on American English news broadcasts will be better at understanding newscaster American English than Caribbean English or Scottish English.

Then ask the harder question: “Is that fair? What would need to change to fix it?”

Ages 13+: Explore OpenAI’s Whisper

Whisper is OpenAI’s open-source speech recognition system, released for free. It runs locally and supports 99 languages. A teenager with Python experience can install it in one command and start transcribing audio files.

More importantly: Whisper’s model card (the documentation that explains how it was built) explicitly discusses its limitations and the languages where it performs worst. Reading model documentation critically — understanding what the model can’t do and why — is a skill every technically literate person needs.

Speech Recognition Accuracy by System and Language/Accent

SystemStandard American English WERAfrican American English WERSpanish (US) WERMandarin WER
Google (Cloud Speech-to-Text)~5%~9–15%~8%~6%
Apple (Siri)~5%~10–14%~9%~7%
Amazon (Alexa/Transcribe)~6%~12–16%~10%~8%
OpenAI Whisper (large)~3%~6–8%~5%~4%
Microsoft Azure~5%~10–13%~8%~6%

WER = Word Error Rate (lower is better). Figures are approximate and vary by study and test conditions. AAE figures from Martin et al. (2020) PNAS study; general figures from published benchmarks circa 2024.

OpenAI’s Whisper consistently outperforms proprietary systems on underrepresented languages and accents, likely because it was trained on a more diverse multilingual dataset (680,000 hours of audio from the internet in 99 languages).

Real-World Examples Kids Encounter Every Day

Siri and Google Assistant — every voice command your child gives goes through this entire pipeline in under a second. The accuracy depends on: microphone quality, ambient noise, how clearly they speak, and how well their accent matches the training data.

Voice-to-text in messaging apps — tap the microphone icon in iMessage or WhatsApp. The speech-to-text system runs (usually on-device for privacy) and converts speech to text. Notice how it handles fast speech, mumbling, or unusual names.

YouTube auto-captions — Google’s speech recognition generates captions automatically for millions of videos. The quality varies enormously by accent, audio quality, and topic. Watching auto-captions with an accent can be a fascinating demonstration of where the system breaks down.

Video game voice chat — some games use voice recognition for commands (rare) or apply noise-cancellation to voice chat (common). Noise cancellation uses the same spectral analysis principles.

What to Watch for Over 3 Months

Month 1: Does your child notice when voice assistants fail, and do they have any theory about why? “It just doesn’t work” vs. “It’s not recognizing my accent” vs. “The background noise is confusing it” — the latter two show growing sophistication.

Month 2: After the accent experiment, does your child bring up training data bias in any other context? This concept — that AI systems reflect the data they were trained on, and that data reflects choices made by humans — applies to recommendation algorithms, facial recognition, and hiring tools. Once the pattern is seen, it’s seen everywhere.

Month 3: Can your child explain the difference between what a voice assistant “hears” (a sound wave) and what it “understands” (nothing — it pattern-matches)? The insight that understanding is simulated, not real, applies to all of AI and is perhaps the most foundational AI literacy concept.

FAQ

Why does Alexa sometimes hear things I didn’t say?

Because it’s always listening for its wake word (“Alexa”) by matching incoming audio against a stored acoustic pattern. Any sufficiently similar pattern — in TV audio, nearby conversation, or certain music frequencies — can accidentally match the wake word and trigger the device. The system is doing probability matching, not comprehension.

Why is my kid’s voice assistant less accurate than mine?

Children’s voices have different fundamental frequencies, speaking rates, and pronunciation patterns compared to adult voices. Most commercial systems were trained predominantly on adult speech, so accuracy drops for younger voices. Some newer systems have been specifically trained on children’s speech data, but coverage is still uneven.

Does the voice assistant record everything I say?

Most devices are always “listening” in a low-power mode, but only for the wake word. When the wake word is detected, the full recording begins and is sent to cloud servers for processing. Many companies store these recordings for a period of time to improve their models. Review your device’s privacy settings — most allow you to delete stored voice recordings.

Can accents be “fixed” by the user?

To some extent, yes. Speaking more slowly, clearly, and in a slightly more formal register usually improves accuracy. But this is asking users to adapt to a system’s limitations — a legitimate criticism. The better fix is building more diverse training datasets.

What language is Alexa best at?

English — specifically standard American English — because that’s where the majority of training data came from. Whisper (OpenAI) handles a much broader range of languages more accurately, which is why it’s preferred for transcription work across multiple languages.

What’s word error rate and should I care?

Word Error Rate (WER) is the percentage of words in a transcription that are wrong compared to the actual spoken words. A 5% WER means 5 out of every 100 words are incorrect — which sounds small, but in a 200-word paragraph, that’s 10 errors. For voicemail transcription, tolerable. For medical dictation or legal proceedings, 5% is unacceptably high. Context matters enormously.


About the author Ricky Flores is the founder of HiWave Makers and an electrical engineer with 15+ years of experience building consumer technology at Apple, Samsung, and Texas Instruments. He writes about how kids learn to build, think, and create in a tech-saturated world. Read more at hiwavemakers.com.


Sources

  1. Martin, J. E., Strother, L., Greenwald, A. G., & Bodeker, S. T. (2020). “Racial disparities in automated speech recognition.” Proceedings of the National Academy of Sciences, 117(14), pp. 7684–7689. https://doi.org/10.1073/pnas.1915768117
  2. Tatman, R. (2017). “Gender and Dialect Bias in YouTube’s Automatic Captions.” Proceedings of the ACL Workshop on Ethics in NLP, pp. 53–59. https://aclanthology.org/W17-1606/
  3. Radford, A., Kim, J. W., Xu, T., et al. (2022). “Robust Speech Recognition via Large-Scale Weak Supervision.” OpenAI Technical Report. https://arxiv.org/abs/2212.04356
  4. Hinton, G., Deng, L., Yu, D., et al. (2012). “Deep Neural Networks for Acoustic Modeling in Speech Recognition.” IEEE Signal Processing Magazine, 29(6), pp. 82–97. https://doi.org/10.1109/MSP.2012.2205597
  5. Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2006.11477
  6. National Institute of Standards and Technology (NIST). (2023). Speech Recognition Technology Evaluation Results. U.S. Department of Commerce. https://www.nist.gov/programs-projects/speech-recognition
Ricky Flores
Written by Ricky Flores

Founder of HiWave Makers and electrical engineer with 15+ years working on projects with Apple, Samsung, Texas Instruments, and other Fortune 500 companies. He writes about how kids learn to build, think, and create in a tech-driven world.