Benchmarking Voice Agents on Code-Switched Speech Reveals Flaws in Current ASR Models
7
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
Significant industry methodology update showing practical flaws in current ASR deployments; the low hype score reflects that this is highly specialized research, not general consumer news.
Article Summary
This article introduces a rigorous new benchmark and dataset (AU-Harness) designed to test Automatic Speech Recognition (ASR) performance specifically on code-switched speech—a common feature of global enterprise communication. Using real-world IT Service Management (ITSM) and Human Resources (HR) scenarios across Spanish, French, German, and English pairings, the researchers evaluate major models like ElevenLabs Scribe V2, Google Gemini 3 Flash, and Whisper. The testing utilizes three metrics: Word Error Rate (WER) for pure transcription accuracy, Semantic WER (SWER) for meaning preservation, and Answer Error Rate (AER) to test downstream comprehension failure. Key findings indicate that while certain specialized models lead in raw WER, models optimized for language understanding (like Gemini) often perform better on the critical, meaning-sensitive metrics (AER and SWER). The data pipeline and methodology are fully released to the community.Key Points
- The study established a critical benchmark (AU-Harness) for measuring ASR performance on code-switched speech, essential for global enterprise voice agents.
- Semantic metrics (SWER and AER) are shown to be more valuable indicators of failure than standard Word Error Rate (WER), as they test meaning preservation for downstream tasks.
- Top-performing models vary by metric and language pair; models optimized for general language reasoning, such as Gemini, excelled in meaning-sensitive tests (AER), even if their raw WER was not the lowest.

