ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub

Benchmarking Voice Agents on Code-Switched Speech Reveals Flaws in Current ASR Models

Code-switching Automatic Speech Recognition (ASR) Voice Agents Bilingual speech Word Error Rate (WER) Semantic Word Error Rate (SWER) IT Service management (ITSM)
June 09, 2026
Viqus Verdict Logo Viqus Verdict Logo 7
Functional Accuracy Over Raw Transcription
Media Hype 5/10
Real Impact 7/10

Article Summary

This article introduces a rigorous new benchmark and dataset (AU-Harness) designed to test Automatic Speech Recognition (ASR) performance specifically on code-switched speech—a common feature of global enterprise communication. Using real-world IT Service Management (ITSM) and Human Resources (HR) scenarios across Spanish, French, German, and English pairings, the researchers evaluate major models like ElevenLabs Scribe V2, Google Gemini 3 Flash, and Whisper. The testing utilizes three metrics: Word Error Rate (WER) for pure transcription accuracy, Semantic WER (SWER) for meaning preservation, and Answer Error Rate (AER) to test downstream comprehension failure. Key findings indicate that while certain specialized models lead in raw WER, models optimized for language understanding (like Gemini) often perform better on the critical, meaning-sensitive metrics (AER and SWER). The data pipeline and methodology are fully released to the community.

Key Points

  • The study established a critical benchmark (AU-Harness) for measuring ASR performance on code-switched speech, essential for global enterprise voice agents.
  • Semantic metrics (SWER and AER) are shown to be more valuable indicators of failure than standard Word Error Rate (WER), as they test meaning preservation for downstream tasks.
  • Top-performing models vary by metric and language pair; models optimized for general language reasoning, such as Gemini, excelled in meaning-sensitive tests (AER), even if their raw WER was not the lowest.

Why It Matters

For professionals building or adopting global AI contact centers and automated support systems, this benchmark is crucial. It moves the conversation beyond simple word-for-word accuracy and focuses on 'functional accuracy'—does the AI understand the intent and context, even if it misses a minor word? The results indicate that relying solely on the lowest WER score is insufficient; system architects must prioritize models that demonstrate high AER and SWER scores to ensure operational reliability in diverse, multilingual corporate environments. This elevates the standard for deployed AI voice systems.

You might also be interested in