Second language acquisition (SLA) research has extensively studied cross-linguistic transfer, the influence of linguistic structure of a speaker’s native language [L1] on the successful acquisition of a foreign language [L2]. Effects of such transfer can be positive (facilitating acquisition) or negative (impeding acquisition). We find that NLP literature has not given enough attention to the phenomenon of negative transfer. To understand patterns of both positive and negative transfer between L1 and L2, we model sequential second language acquisition in LMs. Further, we build a Mutlilingual Age Ordered CHILDES (MAO-CHILDES) – a dataset consisting of 5 typologically diverse languages, i.e., German, French, Polish, Indonesian, and Japanese – to understand the degree to which native Child-Directed Speech (CDS) [L1] can help or conflict with English language acquisition [L2]. To examine the impact of native CDS, we use the TILT-based cross lingual transfer learning approach established by Papadimitriou and Jurafsky (2020) and find that, as in human SLA, language family distance predicts more negative transfer. Additionally, we find that conversational speech data shows greater facilitation for language acquisition than scripted speech data. Our findings call for further research using our novel Transformer-based SLA models and we would like to encourage it by releasing our code, data, and models
TACL
AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR
Olatunji, Tobi, Afonja, Tejumade, Yadavalli, Aditya, Emezue, Chris Chinenye, Singh, Sahib, Dossou, Bonaventure F. P., Osuchukwu, Joanne, Osei, Salomey, Tonja, Atnafu Lambebo, Etori, Naome, and Mbataku, Clinton
Transactions of the Association for Computational Linguistics Dec 2023
Africa has a very poor doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
ACL Findings
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Moradshahi, Mehrad, Shen, Tianhao, Bali, Kalika, Choudhury, Monojit, Chalendar, Gael, Goel, Anmol, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Semmar, Nasredine, Semnani, Sina, Seo, Jiwon, Seshadri, Vivek, Shrivastava, Manish, Sun, Michael, Yadavalli, Aditya, You, Chaobin, Xiong, Deyi, and Lam, Monica
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
2022
SLT
How Do Phonological Properties Affect Bilingual Automatic Speech Recognition?
Jain, Shelly*, Yadavalli, Aditya*, Mirishkar, Ganesh, and Vuppala, Anil Kumar
In IEEE Spoken Language Technology Workshop Dec 2022
Multilingual Automatic Speech Recognition (ASR) for Indian languages is an obvious technique for leveraging their similarities. We present a detailed analysis of how phonological similarities and differences between languages affect Time Delay Neural Network (TDNN) and End-to-End (E2E) ASR. To study this, we select genealogically similar pairs from five Indian languages and train bilingual acoustic models. We compare these against corresponding monolingual acoustic models and find similar phoneme distributions within speech to be the primary factor for improving model performance, with phoneme overlap being secondary. The influence of phonological properties on performance is visible in both cases. Word Error Rate (WER) of E2E decreased by a median of 2.35%, and upto 8.5% when the phonological similarity was greatest. WER of TDNN increased by 11.69% when the similarity was lowest. Thus, it is clear that the choice of supplementary language is important for model performance.
Interspeech
Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition
Yadavalli, Aditya, Mirishkar, Ganesh, and Vuppala, Anil Kumar
Conventional Automatic Speech Recognition (ASR) systems are susceptible to dialect variations within a language, thereby adversely affecting the ASR. Therefore, the current practice is to use dialect-specific ASRs. However, dialect-specific information or data is hard to obtain making it difficult to build dialect-specific ASRs. Furthermore, it is cumbersome to maintain multiple dialect-specific ASR systems for each language. We build a unified multi-dialect End-to-End ASR that removes the need for a dialect recognition block and the need to maintain multiple dialect-specific ASRs for three Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We find that pooling the data and training a multi-dialect ASR benefits the low-resource dialect the most – an improvement of over 9.71% in relative Word Error Rate (WER). Subsequently, we experiment with multi-task ASRs where the primary task is to transcribe the audio and the secondary task is to predict the dialect. We do this by adding a Dialect ID to the output targets. Such a model outperforms naive multi-dialect ASRs by up to 8.24% in relative WER. Additionally, we test this model on a dialect recognition task and find that it outperforms strong baselines by 6.14% in accuracy.
NAACL-SRW
Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition
Yadavalli, Aditya, Mirishkar, Ganesh, and Vuppala, Anil Kumar
In North American Chapter of the Association of Computational Linguistics Student Research Workshop Dec 2022
Previous research has found that Acoustic Models (AM) of an Automatic Speech Recognition (ASR) system are susceptible to dialect variations within a language, thereby adversely affecting the ASR. To counter this, researchers have proposed to build a dialect-specific AM while keeping the Language Model (LM) constant for all the dialects. This study explores the effect of dialect mismatched LM by considering three different Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We show that dialect variations that surface in the form of a different lexicon, grammar, and occasionally semantics can significantly degrade the performance of the LM under mismatched conditions. Therefore, this degradation has an adverse effect on the ASR even when dialect-specific AM is used. We show a degradation of up to 13.13 perplexity points when LM is used under mismatched conditions. Furthermore, we show a degradation of over 9% and over 15% in Character Error Rate (CER) and Word Error Rate (WER), respectively, in the ASR systems when using mismatched LMs over matched LMs.
IC3
Investigation of Subword-Based Bilingual Automatic Speech Recognition for Indian Languages
Yadavalli, Aditya, Jain, Shelly, Mirishkar, Ganesh, and Vuppala, Anil Kumar
In 2022 Thirteenth International Conference on Contemporary Computing (IC3-2022) Dec 2022
2021
ICON
IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary
Dementia is a syndrome chronic or progressive that usually affects the cognitive functioning of the subjects. Alzheimer’s, a neurodegenerative disorder, is the leading cause of dementia. One of the many symptoms of Alzheimer’s Dementia is the inability to speak and understand language clearly. The last decade has seen a surge in the research done in Alzheimer’s Dementia detection using Linguistics and acoustic features. This paper takes up the Alzheimer’s Dementia classification task of ADReSS INTERSPEECH-2020 challenge, ”Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge”. It uses eight different acoustic features to find the attributes in the human speech production system (vocal track and excitation source) affected by Alzheimer’s Dementia. In this study, the Alzheimer’s dementia classification is performed using five different Machine Learning models using ADReSS INTERSPEECH-2020 challenge dataset. Since most of the studies in the previous literature have used linguistic features successfully for Alzheimer’s dementia classification, the current study also demonstrates the performance of the BERT model for the dementia classification task. The maximum accuracy obtained by the acoustic feature is 64.5%, and the BERT Model provides a classification accuracy of 79.1% over the test dataset. Finally, the score-level fusion of the acoustic model with the BERT Model shows an improvement of 6.1% classification accuracy over the BERT Model, which indicates the complementary nature of acoustic features to linguistic features.