Surveys

Survey talk 1, Tue 31 Aug 11:30 CEST Room A+B

Abstract

In the last decade we have seen how speech technologies for typical speech have matured and thus enabled the advancement of a multitude of services and technologies including voice-enabled conversational interfaces, dictation and successfully underpinning the use of state-of-the-art NLP techniques. This ever more pervasive offering allows for an often far more convenient and natural way of interacting with machines and systems. However it also represents an ever-growing gap experienced by people with atypical (dysarthric) voices: people with even just mild-to-moderate speech disorders cannot achieve satisfactory performance with current automatic speech recognition (ASR) systems and hence they are falling further and further behind in terms of their ability to use modern devices and interfaces. This talk will present the major challenges in porting mainstream ASR methodologies to work for atypical speech, discuss recent advances and present thoughts on where the research effort should be focusing to have real impact in this community of potential users. Being able to speak a query or dictate an email offers a lot of convenience to most of us but for this group of people can have significant implications on ability to fully take part in society and life quality.

Bio

Dr Heidi Christensen is a Senior Lecturer in Computer Science at the University of Sheffield, United Kingdom. Her research interests are on the application of AI-based voice technologies to healthcare and focus on two main areas: i) the automatic recognition of atypical speech and ii) the detection and monitoring of people’s physical and mental health including verbal and non-verbal traits for expressions of emotion, anxiety, depression and neurodegenerative conditions in e.g., therapeutic or diagnostic settings.

Survey talk 2, Wed 1 Sep 13:00 CEST Room A+B

Abstract

The investigation of acoustic biomarkers of respiratory diseases has societal and public health impact following the onset of COVID-19 pandemic. The efforts in the pre-pandemic period focused on developing smartphone friendly diagnostic tools for the detection of chronic pulmonary diseases, Tuberculosis and asthmatic conditions using cough sounds. During the past two years, several research works of varying scales have been undertaken by the speech and signal processing community for analyzing the acoustic symptoms of COVID. The motivation for the development of acoustic-based tools for COVID diagnostics arises from the key limitations of cost, time, and safety of the current gold standard in COVID testing, namely the reverse transcription polymerase chain reaction (RT-PCR) testing.

In this talk, I will survey the major efforts undertaken by groups across the world in i) developing data resources of acoustic signals for COVID-19 diagnostics, and ii) designing models and learning algorithms for tool development. The landscape of data resources ranges from controlled hospital recordings to crowdsourced smartphone-based data. While the primary signal modality recorded is the cough data, the impact of COVID on other modalities like breathing, speech and symptom data are also studied. In the talk, I will also discuss the considerations in designing data representations and machine learning models for COVID detection from acoustic data. The pointers to open-source data resources and tools will be highlighted with the aim of encouraging budding researchers to pursue this important direction.

The talk will conclude by remarking about the progress made by our group, Coswara, where a multi-modal combination of information from several modalities shows the potential to surpass regulatory requirements needed for a rapid acoustic-based point of care testing (POCT) tool.

Bio

Sriram Ganapathy is a faculty member at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the learning and extraction of acoustic patterns (LEAP) lab. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights, USA. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland.

At the LEAP lab, his research interests include signal processing, digital health, machine learning methodologies for speech analytics and auditory neuroscience. He is a subject editor for the Speech Communications journal, member of ISCA and a senior member of IEEE. He is the recipient of young scientist awards from Department of Science and Technology (DST), India, Department of Atomic Energy (DAE), India and the Pratiksha Trust, Indian Institute of Science, Bangalore. Over the past 10 years, he has published more than 100 peer-reviewed journals/conference publications in the areas of deep learning, and speech/audio processing.

Survey talk 3, Thu 2 Sep 13:00 CEST Room A+B

Abstract

Speech is usually recorded as an acoustic signal, but it often appears in context with other signals. In addition to the acoustic signal, we may have available a corresponding visual scene, the video of the speaker, physiological signals such as the speaker's movements or neural recordings, or other related signals. It is often possible to learn a better speech model or representation by considering the context provided by these additional signals, or to learn with less training data. Typical approaches to training from multi-modal data are based on the idea that models or representations of each modality should be in some sense predictive of the other modalities. Multi-modal approaches can also take advantage of the fact that the sources of noise or nuisance variables are different in different measurement modalities, so an additional (non-acoustic) modality can help learn a speech representation that suppresses such noise. This talk will survey several lines of work in this area, both older and newer. It will cover some basic techniques from machine learning and statistics, as well as specific models and applications for speech.

Bio

Karen Livescu is an Associate Professor at TTI-Chicago. She completed her PhD in electrical engineering and computer science at MIT. Her main research interests are in speech and language processing, as well as related problems in machine learning. Some specific interests include multi-view representation learning, visually grounded speech models, acoustic word embeddings, new models for speech recognition and understanding, unsupervised and weakly supervised models for speech and text, and sign language recognition from video. Her professional activities include serving as a program chair of ICLR 2019, ASRU 2015/2017/2019, and Interspeech 2022, and on the editorial boards of IEEE OJ-SP and IEEE TPAMI. She is an ISCA fellow and an IEEE SPS Distinguished Lecturer.

Survey talk 4, Fri 3 Sep 13:00 CEST Room A+B

Abstract

In recent years, the ease with which we can collect audio (and to a lesser extent visual information) with wearables has improved dramatically. These allow unprecedented access to the speech that children produce, and that which they year. Although many conclusions drawn from short observations seem to generalize to these naturalistic datasets, others appear questionable based on human annotations of data collected with wearables. Making the best of such recordings also requires unique tool development.

Bio

Alejandrina Cristia is a senior researcher at the Centre National de la Recherche Scientifique (CNRS), leader of the Language Acquisition Across Cultures team, and director of the Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP) cohosted by the Ecole Normale Supérieure, EHESS, and PSL. In 2021, she is an invited researcher in the Foundations of Learning Program of the Abdul Latif Jameel Poverty Action Lab (J-PAL), and a guest researcher at the Max Planck Institute for Evolutionary Anthropology. Her long-term aim is to answer the following questions: What are the linguistic representations that infants and adults have? Why and how are they formed? How may learnability biases shape the world’s languages? To answer these questions, she combines multiple methodologies including spoken corpora analyses, behavioral studies, neuroimaging (NIRS), and computational modeling. This interdisciplinary approach has resulted in over 100 publications in psychology, linguistics, and development journals as well as IEEE and similar conferences. With an interest in cumulative, collaborative, and transparent science, she contributed to the creation of the first meta-meta-analysis platform (metalab.stanford.edu) and several international networks, including saliently the LangVIEW consortium that is leading [/L+/, the First truly global summer/winter school on language acquisition](https://www.dpss.unipd.it/summer-school-2021/home). She received the 2017 John S. McDonnell Scholar Award in Understanding Human Cognition, the 2020 Médaille de Bronze CNRS Section Linguistique, and an ERC Consolidator Award (2021-2026) for the [ExELang](exelang.fr) project.