Voice-based educational tools increasingly rely on automatic speech recognition (ASR) to engage young learners. However, current ASR systems perform even less accurately on children’s speech—especially for those who speak languages other than English at home (i.e., English as a second language). These biases, inherent in many ASR models, may hinder children’s ability to fully benefit from such technologies.

This project investigates how Spanish-English bilingual children are understood by commercial ASR systems and how their linguistic profiles influence recognition accuracy. Our team has already collected audio recordings from 250 bilingual Spanish-English children—primarily of Mexican origin—ages 3 to 7, along with a comparison corpus of monolingual English-speaking children of matched age. This audio corpus includes word-level, sentence-level, and discourse-level utterances, supporting both the identification of features that drive recognition disparities and index cultural models among bilingual versus monolingual speakers, as well as theory-guided fine-tuning of speech models.

Highlights

  • ASR accuracy improves with age and English proficiency, but varies widely across commercial models.

  • Children with more consistent speech (lower VOT variability) are transcribed more accurately. English-like pronunciations boost accuracy in some systems, but not all.

  • Disparities remain: Bilingual children show higher error rates than corpora reported data from monolingual children.

Bar chart comparing word error rates for sentence repetition across Amazon, Google, IBM, and Whisper

Next Steps

We are building a benchmark dataset of bilingual children’s speech, including diverse Spanish dialects (e.g., Mexican, Caribbean, Andean) and validated assessments of language proficiency. This dataset will support three core research goals:

  • Linguistic profiling: Examine how phonological and syntactic features differ across Spanish dialects and bilingual experiences.

  • Model refinement: Use theory-guided fine-tuning to improve ASR accuracy, leveraging sociolinguistic insights and child-specific features.

  • Cultural analysis: Apply Critical Discourse Analysis (CDA) to examine how children’s interactions with AI reinforce or reshape their beliefs about language and identity.

Publications and Presentations

  • Thomas, T., Takahesu-Tabori, A., Stoehr, A., & Xu, Y. (2025). The impact of voice onset time variability on ASR performance in bilingual Spanish-English children. Oral presentation at ISB15, Donostia–San Sebastián.

  • Thomas, T., Takahesu-Tabori, A., Stoehr, A., & Xu, Y. (2024). The impact of bilingual language proficiency on ASR accuracy in children. ISBPAC, Swansea.

  • Thomas, T., Stoehr, A., & Xu, Y. (2024). Bilingual proficiency and ASR performance. Midwest Speech and Language Days, Ann Arbor.

  • Thomas, T., Stoehr, A., & Xu, Y. (2024). ASR in Spanish-English bilingual children. CDS Conference, Pasadena.

Updated: