Automatic Speech Recognition
Technology that converts spoken language into text, enabling machines to understand and transcribe human speech.
Also known as: ASR, Speech-to-Text, STT, Speech Recognition, Voice Recognition
Category: AI
Tags: ai, speech-processing, machine-learning, natural-language-processing, technology
Explanation
Automatic Speech Recognition (ASR), also known as speech-to-text, is the technology that converts spoken language into written text. It is one of the foundational technologies enabling voice interfaces, transcription services, and human-computer interaction through natural speech.
**How modern ASR works:**
Contemporary ASR systems typically use deep learning approaches:
1. **Audio preprocessing**: Raw audio is converted into spectrograms or mel-frequency cepstral coefficients (MFCCs)
2. **Acoustic model**: A neural network (often a transformer or conformer architecture) maps audio features to phonetic representations
3. **Language model**: Provides linguistic context to disambiguate similar-sounding words and phrases
4. **Decoder**: Combines acoustic and language model outputs to produce the most likely text transcription
**Key milestones:**
- **1952**: Audrey system (Bell Labs) recognized spoken digits
- **1970s-1990s**: Hidden Markov Models dominated ASR research
- **2012**: Deep neural networks dramatically improved accuracy
- **2020**: OpenAI's Whisper achieved near-human performance across many languages
- **2023+**: Multimodal models handle speech alongside text and images
**Challenges:**
- **Accents and dialects**: Performance varies significantly across speech varieties
- **Noisy environments**: Background noise degrades recognition accuracy
- **Domain-specific vocabulary**: Technical jargon, proper nouns, and neologisms are often misrecognized
- **Code-switching**: Speakers who switch between languages mid-sentence
- **Real-time constraints**: Streaming ASR must balance latency with accuracy
**Applications:**
- **Voice assistants**: Siri, Alexa, Google Assistant
- **Transcription services**: Meeting notes, medical dictation, legal proceedings
- **Accessibility**: Real-time captioning for hearing-impaired individuals
- **Voice search**: Hands-free information retrieval
- **Content creation**: Voice-to-text for writing, note-taking, and documentation
- **Call center analytics**: Automated analysis of customer service interactions
**Relationship to other technologies:**
ASR is often combined with speaker diarization (who said what), natural language understanding (what it means), and text-to-speech (responding verbally) to create complete voice-powered applications.
Related Concepts
← Back to all concepts