Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

In our increasingly connected world, speaking more than one language is common for a huge portion of the global population. For many bilingual individuals, effortlessly switching between languages, sometimes even in the middle of a sentence, is a natural part of daily conversation. This phenomenon, known as code-switching, shows up everywhere—from casual chats with friends to critical customer service calls and IT helpdesks. People naturally adapt to whichever language feels most comfortable or appropriate in the moment.

The rise of AI-powered voice agents promises to reshape how businesses interact with their global customer base. These agents can handle high call volumes, offer 24/7 support, and potentially cut operational costs significantly. However, a critical question remains: can these advanced AI voice agents truly handle the complexity of bilingual customers, especially when they code-switch? Traditional Automatic Speech Recognition (ASR) systems, the foundational technology for any voice agent, often stumble when faced with this linguistic fluidity.

Recognizing this gap, researchers at ServiceNow-AI recently conducted a significant study titled "Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech." This benchmark, published on June 9, 2026, aimed to specifically evaluate how modern ASR models, including "frontier ASRs" and Large Audio Language Models (LALMs), perform on code-switched speech in enterprise contexts. Understanding these benchmarks is key for anyone looking to deploy effective AI voice solutions in a multilingual environment.

What Exactly is Code-Switching?

Code-switching refers to the practice of alternating between two or more languages or language varieties within a single conversation or even a single sentence. It's a common and vital form of language use in multilingual communities, driven by factors like comfort, context, and speaker identity. For example, a customer might start a sentence in Spanish, insert a technical term in English, and then finish the thought back in Spanish.

Linguists often distinguish between two main types of code-switching relevant to ASR:

Inter-sentential Code-Switching: This happens between sentences, where a speaker completes one sentence in one language and then starts the next in another. For example, "The meeting is at 2 PM. ¿Tienes alguna pregunta?" (Do you have any questions?).
Intra-sentential Code-Switching: This is much more challenging for ASR systems and occurs within a single sentence. An example might be, "No recuerdo mi bank password." (I don't remember my bank password). Here, the grammar and phonetics can shift abruptly, blurring word boundaries and creating a linguistic rollercoaster for models.

Why Code-Switching is an ASR Nightmare

While ASR technology has come a long way, handling code-switched speech presents unique and significant hurdles. Traditional ASR systems are usually designed to operate within a single language framework. This monolingual bias leads to several problems:

Language Confusion and Identification Errors: ASR systems need to discern not only the words being spoken but also the language they are uttered in. When languages are mixed, especially mid-sentence, the system struggles to identify the correct language, leading to misfires and incorrect decoding rules.
Unbalanced Training Data: Most global ASR datasets are dominated by English, meaning there's a scarcity of high-quality, diverse code-switched speech and text data for training. Models simply haven't "heard" enough of these transitions to handle them naturally.
Phonetic Overlap and Borrowing: When speakers code-switch, words from one language might adopt the pronunciation patterns of another. For instance, an English word embedded in a Hindi sentence might be pronounced with a Hindi accent, confusing models trained on canonical pronunciations.
Acoustic Model Confusion at Switch Points: Acoustic models learn language-specific phonetic distributions. When a speaker switches languages, the model's confidence can drop sharply at the language boundary, leading to increased error rates. For structurally distant language pairs like Chinese and Bahasa Malay, Word Error Rates (WER) can exceed 114% due to massive insertion errors.
Downstream Pipeline Damage: ASR is often the first step in a voice agent pipeline. Errors in transcription propagate forward, affecting subsequent tasks like natural language understanding, intent recognition, and response generation, leading to misrouted tickets or misunderstood customer queries.

Enter Frontier ASR: The Next Generation

To overcome these challenges, modern ASR systems are moving beyond simple language identification (LID) and monolingual routing. The "frontier ASRs" leverage advanced techniques to process language fluidity natively:

Deep Learning and Neural Networks: Advancements in deep learning and neural networks, particularly Transformer-based encoder-decoder architectures, have significantly improved the ability of ASR systems to handle code-switching. These models are trained on massive datasets to learn the intricacies of language mixing.
Multilingual Acoustic Models: Instead of separate models for each language, modern systems use shared neural frameworks and multilingual acoustic models trained on data from multiple languages. This allows the system to seamlessly switch between languages as needed. Techniques like explicit language conditioning, adapter modules, and LoRA help balance performance across high- and low-resource languages.
End-to-End Architectures: End-to-end multilingual models are proving more effective for real-time and intra-sentential code-switching compared to older "cascade" architectures that first identify the language and then route to a specific ASR model. Cascade systems often introduce latency and struggle with mid-sentence switches.
Synthetic Data Generation: Given the scarcity of natural code-switched datasets, researchers are increasingly using advanced methods to generate realistic synthetic speech and text data for training. Projects like LinguaMaster and SwitchLingua are creating large-scale, diverse datasets to bridge this gap.
Large Audio Language Models (LALMs): These models are at the forefront, integrating large language model capabilities with audio processing for more robust understanding and generation in multilingual contexts.

Companies like NVIDIA, with their NeMo framework, offer multilingual and code-switched ASR models. Meta's Omnilingual ASR aims to provide transcription for over 1,600 languages, including many low-resource ones. AethexAI's Kora 1 platform focuses on voice AI for emerging markets, including code-switching support.

Benchmarking the Bilingual Challenge: The ServiceNow-AI Study

The study by Shama Gupta, Lindsay Brin, and Fanny Riols from ServiceNow-AI addresses the practical need for robust code-switching ASR in enterprise settings. Their work highlights that while bilingualism is widespread, there's been little focus on how voice agents handle code-switched speech in real-world business scenarios, where transcription errors can have significant operational consequences.

Methodology and Focus

The ServiceNow-AI team built a custom benchmark and dataset to evaluate ASR models.

Language Pairs: The benchmark focused on four language pairs highly relevant to their customer base: Spanish-English, French-English, Canadian French-English, and German-English. The non-English language was generally considered the "matrix" (primary) language, with English words or phrases embedded within it.
Data Source: The benchmark uses synthetic audio generated via a Text-to-Speech (TTS) model. This synthetic data was created from an internal corpus of IT support and Human Resources (HR) interactions, covering common scenarios like employee inquiries about benefits or password resets. While synthetic data allows for controlled experiments, the authors acknowledge it might not fully capture the nuances of natural code-switched speech.
Evaluation Metrics: To provide a comprehensive view, the study reported three key metrics:
- Word Error Rate (WER): The standard metric for ASR accuracy, measuring the number of incorrect words.
- Semantic Word Error Rate (SWER): This metric aims to capture how well the meaning of the utterance is preserved, even if the exact words are not transcribed perfectly.
- Answer Error Rate (AER): This measures the ability of the system to derive the correct answer or intent from the transcription, which is crucial for downstream tasks in voice agents.
Models Evaluated: The benchmark included seven ASR systems, encompassing a mix of Large Audio Language Models (LALMs), other "frontier ASRs," and open-source solutions. While specific model names aren't fully detailed in the abstract, other similar benchmarks mention systems like ElevenLabs Scribe v2, Gemini 3 Flash, AssemblyAI, OpenAI Whisper, Deepgram Nova-3, and Voxtral-Mini-4B.

Key Findings

The study revealed several important insights into ASR performance on code-switched speech:

Variable Performance: The "cost of codeswitching"—meaning the degradation in performance compared to monolingual speech—varied significantly depending on the specific language pair and the ASR model tested.
Surprising Error Concentration: Counterintuitively, errors in code-switched utterances tended to concentrate on the English portions rather than the non-English (matrix language) parts. This is unexpected because English is typically a high-resource language that models handle well in monolingual settings. Researchers suggest this might be due to English segments often containing technical vocabulary or named entities that are harder to transcribe, or simply because any embedded-language segment creates a challenging context requiring the model to adapt to a different phonological and lexical register mid-utterance.
Top Performers: Among the systems benchmarked, ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI demonstrated the smallest performance deltas, indicating better robustness to bilingual input. Scribe V2 notably outperformed its own L2 baseline, suggesting genuine strength in handling code-switched audio.

Limitations of the Benchmark

The authors acknowledge certain limitations:

The use of synthetic audio means the benchmark might not fully capture the subtle prosodic and phonological characteristics of natural code-switched speech.
Models were evaluated using only "auto language detection," without explicit language hints or forced language tokens that some systems offer. This choice reflects a common production setting but might not show a model's absolute best performance if such configurations were used.

The benchmark and data are released through ServiceNow-AI's AU-Harness, their evaluation framework for voice models.

Beyond the Benchmark: Other Efforts and Tools

The ServiceNow-AI study is part of a broader effort to improve code-switched ASR. Other notable contributions include:

SwitchLingua: Introduced in December 2025, this is a large-scale, multilingual, and multi-ethnic code-switching dataset. It includes 420K textual samples across 12 languages and over 80 hours of audio from diverse speakers. SwitchLingua also proposes a new metric, Semantic-Aware Error Rate (SAER), to better assess ASR performance in code-switching scenarios by incorporating semantic information.
CS-FLEURS: Presented in August 2025, this dataset is a massively multilingual and code-switched ASR and Speech Translation (ST) dataset covering 52 languages and 113 unique code-switched pairs.
BAAI/CS-Dialogue: A 104-hour dataset of spontaneous Mandarin-English code-switching dialogues, providing natural conversational data.
Perle.ai Benchmark: Another recent benchmark (May 2026) evaluates five commercial ASR providers on Arabic-English, Persian-English, and German-English code-switching, using both WER and BERTScore (which is argued to be more reliable for languages with transliteration variance). ElevenLabs Scribe v2 also performed strongly in this evaluation.
Commercial Solutions: Many platforms are integrating advanced multilingual and code-switching capabilities into their voice AI agents. These include Retell AI, Google Dialogflow CX (with Gemini-2 live translation), IBM Watson Assistant, Twilio Voice, and Gladia. Many now offer instant language detection and mid-conversation switching.

The Real-World Impact on Voice Agents

The ability of voice agents to handle code-switched speech is not just a technical challenge; it has significant real-world implications:

Enhanced Customer Experience: Bilingual customers prefer to communicate in a way that feels natural to them. Agents that can accurately understand code-switched speech can provide faster, more accurate, and more empathetic support, making customers feel valued and understood.
Global Business Expansion: For businesses operating in multilingual regions or serving diverse customer bases, robust code-switching ASR is essential. It removes language barriers, allowing companies to scale their support without needing to hire a vast number of human agents for every language combination.
Inclusivity: By accommodating natural speech patterns, AI voice agents become more inclusive, reaching a wider demographic and ensuring that technology serves all users, regardless of their linguistic habits.
Operational Efficiency: Accurate transcriptions from code-switched interactions lead to better downstream processing, reducing errors in ticket routing, data entry, and analytical insights.

The Road Ahead for Bilingual Voice AI

The journey to perfect code-switched ASR is ongoing. Continuous research and development are vital to refine these technologies, addressing the variability in accents, dialects, and the inherently fluid nature of human language. Future work will likely focus on:

Developing more diverse and natural code-switched datasets, moving beyond synthetic data to capture real-world conversational nuances.
Improving language-aware decoders that can dynamically switch dictionaries and phonetic sets mid-sentence.
Leveraging large language models (LLMs) more effectively for contextual understanding and generation of code-switched text, as seen in methods like the Simplified Equivalence Constraint Theory (SECT) prompting strategy.
Optimizing models for real-time performance and deployment on diverse hardware, including CPUs, as demonstrated by Gladia's modular ensemble approach.

Conclusion

The challenge of enabling voice agents to handle bilingual customers who code-switch is complex but crucial for the future of AI in global communication. The pioneering work by ServiceNow-AI and others in benchmarking "frontier ASR" systems provides invaluable insights into current capabilities and areas for improvement. While significant progress has been made with advanced deep learning models and end-to-end architectures, the unique linguistic complexities of code-switching continue to push the boundaries of ASR technology. As AI continues to evolve, the goal remains to build voice agents that can communicate as naturally and effortlessly as humans do, truly breaking down language barriers and making AI accessible to everyone.