publications
2024
- COMPASSSpeaking in Terms of Money: Financial Knowledge Acquisition via Speech Data GenerationBhat, Advait, Kulkarni, Nidhi, Husain, Safiya, Yadavalli, Aditya, Kaur, Jivat Neet, Shukla, Anurag, Shelar, Monali, and Seshadri, VivekACM J. Comput. Sustain. Soc. 2024
Earning a living often leaves low-income individuals with little time for learning new skills, perpetuating a cycle where the need for immediate income restricts access to learning. In this study, we investigate if digital work, specifically speech data generation, can facilitate domain-specific knowledge acquisition. For the purposes of this study we focus on finance and banking. We conducted a two-week financial literacy program with low-income individuals (n = 55) in Wagholi, a semi-urban area in Pune, India. Participants read aloud and recorded a nine-lesson financial curriculum, earning ₹2,000 (≈$24) for ≈90 minutes of voice-recording. By conducting pre- and post-tests, we found a significant increase in participants’ financial knowledge with a high effect size (cohen’s d = 1.32) and medium normalized score gain (hake’s g = 0.58). Fourteen follow-up interviews indicated the work was accessible and conveniently integrated into participants’ daily lives. Additionally, the program triggered attitude change among participants and community dialogue about critical financial concepts. Our results suggest that digital work can become an effective method for knowledge acquisition and should be tested at a larger scale.
- EMNLPPARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural DataWatts, Ishaan, Gumma, Varun, Yadavalli, Aditya, Seshadri, Vivek, Swaminathan, Manohar, and Sitaram, SunayanaIn Proc. EMNLP 2024
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
@article{Watts2024PARIKSHAAL, title = {PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data}, author = {Watts, Ishaan and Gumma, Varun and Yadavalli, Aditya and Seshadri, Vivek and Swaminathan, Manohar and Sitaram, Sunayana}, journal = {In Proc. EMNLP}, year = {2024} }
- FAccTAkal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language TechnologyHada, Rishav, Husain, Safiya, Gumma, Varun, Diddee, Harshita, Yadavalli, Aditya, Seth, Agrima, Kulkarni, Nidhi, Gadiraju, Ujwal, Vashistha, Aditya, Seshadri, Vivek, and Bali, KalikaIn Proc. FAccT 2024
Existing research in measuring and mitigating gender bias predominantly centers on English, overlooking the intricate challenges posed by non-English languages and the Global South. This paper presents the first comprehensive study delving into the nuanced landscape of gender bias in Hindi, the third most spoken language globally. Our study employs diverse mining techniques, computational models, field studies and sheds light on the limitations of current methodologies. Given the challenges faced with mining gender biased statements in Hindi using existing methods, we conducted field studies to bootstrap the collection of such sentences. Through field studies involving rural and low-income community women, we uncover diverse perceptions of gender bias, underscoring the necessity for context-specific approaches. This paper advocates for a community-centric research design, amplifying voices often marginalized in previous studies. Our findings not only contribute to the understanding of gender bias in Hindi but also establish a foundation for further exploration of Indic languages. By exploring the intricacies of this understudied context, we call for thoughtful engagement with gender bias, promoting inclusivity and equity in linguistic and cultural contexts beyond the Global North.
@article{10.1145/3630106.3659017, title = {Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology}, author = {Hada, Rishav and Husain, Safiya and Gumma, Varun and Diddee, Harshita and Yadavalli, Aditya and Seth, Agrima and Kulkarni, Nidhi and Gadiraju, Ujwal and Vashistha, Aditya and Seshadri, Vivek and Bali, Kalika}, journal = {In Proc. FAccT}, year = {2024}, pages = {1926–1939}, numpages = {14}, keywords = {Community centric, Gender bias, Global South, Hindi, India, Indic languages}, location = {Rio de Janeiro, Brazil} }
- COMPUTELMunTTS: A Text-to-Speech System for MundariGumma, Varun, Hada, Rishav, Yadavalli, Aditya, Gogoi, Pamir, Mondal, Ishani, Seshadri, Vivek, and Bali, KalikaIn Proc. COMPUTEL 2024
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age
@article{Gumma2024MunTTSAT, title = {MunTTS: A Text-to-Speech System for Mundari}, author = {Gumma, Varun and Hada, Rishav and Yadavalli, Aditya and Gogoi, Pamir and Mondal, Ishani and Seshadri, Vivek and Bali, Kalika}, journal = {In Proc. COMPUTEL}, year = {2024} }
- EACL FindingsAccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target AccentsOwodunni, Abraham*, Yadavalli, Aditya*, Emezue, Chris*, Olatunji, Tobi*, and Mbataku, ClintonIn Proc. EACL Findings 2024
Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents
@article{Owodunni2024AccentFoldAJ, title = {AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents}, author = {Owodunni, Abraham* and Yadavalli, Aditya* and Emezue, Chris* and Olatunji, Tobi* and Mbataku, Clinton}, journal = {In Proc. EACL Findings}, year = {2024} }
2023
- ACLSLABERT Talk Pretty One Day: Modeling Second Language Acquisition with BERTYadavalli, Aditya*, Yadavalli, Alekhya*, and Tobin, VeraIn Proc. ACL 2023
Second language acquisition (SLA) research has extensively studied cross-linguistic transfer, the influence of linguistic structure of a speaker’s native language [L1] on the successful acquisition of a foreign language [L2]. Effects of such transfer can be positive (facilitating acquisition) or negative (impeding acquisition). We find that NLP literature has not given enough attention to the phenomenon of negative transfer. To understand patterns of both positive and negative transfer between L1 and L2, we model sequential second language acquisition in LMs. Further, we build a Mutlilingual Age Ordered CHILDES (MAO-CHILDES) – a dataset consisting of 5 typologically diverse languages, i.e., German, French, Polish, Indonesian, and Japanese – to understand the degree to which native Child-Directed Speech (CDS) [L1] can help or conflict with English language acquisition [L2]. To examine the impact of native CDS, we use the TILT-based cross lingual transfer learning approach established by Papadimitriou and Jurafsky (2020) and find that, as in human SLA, language family distance predicts more negative transfer. Additionally, we find that conversational speech data shows greater facilitation for language acquisition than scripted speech data. Our findings call for further research using our novel Transformer-based SLA models and we would like to encourage it by releasing our code, data, and models
@article{Yadavalli2023SLABERTTP, title = {SLABERT Talk Pretty One Day: Modeling Second Language Acquisition with BERT}, author = {Yadavalli, Aditya* and Yadavalli, Alekhya* and Tobin, Vera}, journal = {In Proc. ACL}, year = {2023}, volume = {abs/2305.19589} }
- TACLAfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASROlatunji, Tobi, Afonja, Tejumade, Yadavalli, Aditya, Emezue, Chris Chinenye, Singh, Sahib, Dossou, Bonaventure F. P., Osuchukwu, Joanne, Osei, Salomey, Tonja, Atnafu Lambebo, Etori, Naome, and Mbataku, ClintonTransactions of the Association for Computational Linguistics 2023
Africa has a very poor doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
@article{Olatunji2023, title = {AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR}, author = {Olatunji, Tobi and Afonja, Tejumade and Yadavalli, Aditya and Emezue, Chris Chinenye and Singh, Sahib and Dossou, Bonaventure F. P. and Osuchukwu, Joanne and Osei, Salomey and Tonja, Atnafu Lambebo and Etori, Naome and Mbataku, Clinton}, journal = {Transactions of the Association for Computational Linguistics}, year = {2023}, volume = {11}, pages = {1669-1685} }
- ACL FindingsX-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot AgentsMoradshahi, Mehrad, Shen, Tianhao, Bali, Kalika, Choudhury, Monojit, Chalendar, Gael, Goel, Anmol, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Semmar, Nasredine, Semnani, Sina, Seo, Jiwon, Seshadri, Vivek, Shrivastava, Manish, Sun, Michael, Yadavalli, Aditya, You, Chaobin, Xiong, Deyi, and Lam, MonicaIn Proc. ACL Findings 2023
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
@article{Moradshahi2023X-RiSAWOZ, title = {X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents}, author = {Moradshahi, Mehrad and Shen, Tianhao and Bali, Kalika and Choudhury, Monojit and de Chalendar, Gael and Goel, Anmol and Kim, Sungkyun and Kodali, Prashant and Kumaraguru, Ponnurangam and Semmar, Nasredine and Semnani, Sina and Seo, Jiwon and Seshadri, Vivek and Shrivastava, Manish and Sun, Michael and Yadavalli, Aditya and You, Chaobin and Xiong, Deyi and Lam, Monica}, journal = {In Proc. ACL Findings}, year = {2023} }
2022
- SLTHow Do Phonological Properties Affect Bilingual Automatic Speech Recognition?Jain, Shelly*, Yadavalli, Aditya*, Mirishkar, Ganesh, and Vuppala, Anil KumarIn IEEE Spoken Language Technology Workshop 2022
Multilingual Automatic Speech Recognition (ASR) for Indian languages is an obvious technique for leveraging their similarities. We present a detailed analysis of how phonological similarities and differences between languages affect Time Delay Neural Network (TDNN) and End-to-End (E2E) ASR. To study this, we select genealogically similar pairs from five Indian languages and train bilingual acoustic models. We compare these against corresponding monolingual acoustic models and find similar phoneme distributions within speech to be the primary factor for improving model performance, with phoneme overlap being secondary. The influence of phonological properties on performance is visible in both cases. Word Error Rate (WER) of E2E decreased by a median of 2.35%, and upto 8.5% when the phonological similarity was greatest. WER of TDNN increased by 11.69% when the similarity was lowest. Thus, it is clear that the choice of supplementary language is important for model performance.
@inproceedings{Shelly2022SLT, author = {Jain, Shelly* and Yadavalli, Aditya* and Mirishkar, Ganesh and Vuppala, Anil Kumar}, booktitle = {IEEE Spoken Language Technology Workshop}, title = {How Do Phonological Properties Affect Bilingual Automatic Speech Recognition?}, year = {2022} }
- InterspeechMulti-Task End-to-End Model for Telugu Dialect and Speech RecognitionYadavalli, Aditya, Mirishkar, Ganesh, and Vuppala, Anil KumarIn Proc. Interspeech 2022
Conventional Automatic Speech Recognition (ASR) systems are susceptible to dialect variations within a language, thereby adversely affecting the ASR. Therefore, the current practice is to use dialect-specific ASRs. However, dialect-specific information or data is hard to obtain making it difficult to build dialect-specific ASRs. Furthermore, it is cumbersome to maintain multiple dialect-specific ASR systems for each language. We build a unified multi-dialect End-to-End ASR that removes the need for a dialect recognition block and the need to maintain multiple dialect-specific ASRs for three Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We find that pooling the data and training a multi-dialect ASR benefits the low-resource dialect the most – an improvement of over 9.71% in relative Word Error Rate (WER). Subsequently, we experiment with multi-task ASRs where the primary task is to transcribe the audio and the secondary task is to predict the dialect. We do this by adding a Dialect ID to the output targets. Such a model outperforms naive multi-dialect ASRs by up to 8.24% in relative WER. Additionally, we test this model on a dialect recognition task and find that it outperforms strong baselines by 6.14% in accuracy.
@inproceedings{Aditya2022Interspeech, author = {Yadavalli, Aditya and Mirishkar, Ganesh and Vuppala, Anil Kumar}, booktitle = {Proc. Interspeech}, title = {Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition}, pages = {1387--1391}, doi = {10.21437/Interspeech.2022-10739}, year = {2022} }
- NAACL-SRWExploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech RecognitionYadavalli, Aditya, Mirishkar, Ganesh, and Vuppala, Anil KumarIn North American Chapter of the Association of Computational Linguistics Student Research Workshop 2022
Previous research has found that Acoustic Models (AM) of an Automatic Speech Recognition (ASR) system are susceptible to dialect variations within a language, thereby adversely affecting the ASR. To counter this, researchers have proposed to build a dialect-specific AM while keeping the Language Model (LM) constant for all the dialects. This study explores the effect of dialect mismatched LM by considering three different Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We show that dialect variations that surface in the form of a different lexicon, grammar, and occasionally semantics can significantly degrade the performance of the LM under mismatched conditions. Therefore, this degradation has an adverse effect on the ASR even when dialect-specific AM is used. We show a degradation of up to 13.13 perplexity points when LM is used under mismatched conditions. Furthermore, we show a degradation of over 9% and over 15% in Character Error Rate (CER) and Word Error Rate (WER), respectively, in the ASR systems when using mismatched LMs over matched LMs.
@inproceedings{Aditya2022NAACLSRW, author = {Yadavalli, Aditya and Mirishkar, Ganesh and Vuppala, Anil Kumar}, booktitle = {North American Chapter of the Association of Computational Linguistics Student Research Workshop}, title = {Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition}, publisher = {Association for Computational Linguistics (ACL)}, year = {2022} }
- IC3Investigation of Subword-Based Bilingual Automatic Speech Recognition for Indian LanguagesYadavalli, Aditya, Jain, Shelly, Mirishkar, Ganesh, and Vuppala, Anil KumarIn 2022 Thirteenth International Conference on Contemporary Computing (IC3-2022) 2022
2021
- ICONIE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation DictionaryJain, Shelly*, Yadavalli, Aditya*, Mirishkar, Ganesh*, Yarra, Chiranjeevi, and Vuppala, Anil KumarIn Proceedings of the 18th International Conference on Natural Language Processing (ICON) 2021
@inproceedings{Shelly2021ICON, title = {{IE-CPS Lexicon}: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary}, booktitle = {Proceedings of the 18th International Conference on Natural Language Processing (ICON)}, author = {Jain, Shelly* and Yadavalli, Aditya* and Mirishkar, Ganesh* and Yarra, Chiranjeevi and Vuppala, Anil Kumar}, year = {2021}, publisher = {NLP Association of India (NLPAI)}, eventtitle = {Proceedings of the 18th International Conference on Natural Language Processing (ICON)} }
- ICONAn Investigation of Hybrid architectures for Low Resource Multilingual Speech Recognition system in Indian contextMirishkar, Ganesh, Yadavalli, Aditya, and Vuppala, Anil KumarIn Proceedings of the 18th International Conference on Natural Language Processing (ICON) 2021
@inproceedings{MirishkarHybrid, title = {An Investigation of Hybrid architectures for Low Resource Multilingual Speech Recognition system in Indian context}, booktitle = {Proceedings of the 18th International Conference on Natural Language Processing (ICON)}, author = {Mirishkar, Ganesh and Yadavalli, Aditya and Vuppala, Anil Kumar}, year = {2021}, publisher = {NLP Association of India (NLPAI)}, eventtitle = {Proceedings of the 18th International Conference on Natural Language Processing (ICON)} }
- IC3ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTIONVats, Nayan Anand, Yadavalli, Aditya, Gurugubelli, Krishna, and Vuppala, Anil KumarIn 2021 Thirteenth International Conference on Contemporary Computing (IC3-2021) 2021
Dementia is a syndrome chronic or progressive that usually affects the cognitive functioning of the subjects. Alzheimer’s, a neurodegenerative disorder, is the leading cause of dementia. One of the many symptoms of Alzheimer’s Dementia is the inability to speak and understand language clearly. The last decade has seen a surge in the research done in Alzheimer’s Dementia detection using Linguistics and acoustic features. This paper takes up the Alzheimer’s Dementia classification task of ADReSS INTERSPEECH-2020 challenge, ”Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge”. It uses eight different acoustic features to find the attributes in the human speech production system (vocal track and excitation source) affected by Alzheimer’s Dementia. In this study, the Alzheimer’s dementia classification is performed using five different Machine Learning models using ADReSS INTERSPEECH-2020 challenge dataset. Since most of the studies in the previous literature have used linguistic features successfully for Alzheimer’s dementia classification, the current study also demonstrates the performance of the BERT model for the dementia classification task. The maximum accuracy obtained by the acoustic feature is 64.5%, and the BERT Model provides a classification accuracy of 79.1% over the test dataset. Finally, the score-level fusion of the acoustic model with the BERT Model shows an improvement of 6.1% classification accuracy over the BERT Model, which indicates the complementary nature of acoustic features to linguistic features.
@inproceedings{10.1145/3474124.3474162, author = {Vats, Nayan Anand and Yadavalli, Aditya and Gurugubelli, Krishna and Vuppala, Anil Kumar}, title = {ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTION}, year = {2021}, isbn = {9781450389204}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3474124.3474162}, doi = {10.1145/3474124.3474162}, booktitle = {2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)}, pages = {267–272}, numpages = {6}, keywords = {Dementia, Alzheimer’s, BERT, Acoustic features}, location = {Noida, India}, series = {IC3 '21} }