The Speakers

 

 

Didier Meuwly

Development and validation of an automatic approach addressing the forensic question of identity of source -- the contribution of the speaker recognition field

ABSTRACT - This presentation will first introduce the type of biometric data and datasets used in the different forensic applications [identity verification, identification (closed- and open-set), investigation/surveillence, intelligence and interpretation of the evidence], using practical examples. Then it will describe the methodological difference between the biometric identification process developped for access control and the forensic question of identity of inference of identity of source developed to interpret the evidence. Finally it will focus on the development and validation of an automatic approach addressing the forensic question of identity of source and enhance the contribution of the speaker recognition field to this development.


 Didier Meuwly shares his time between the Netherlands Forensic Institute, where he is a principal scientist and the University of Twente, where he holds the chair of Forensic Biometrics. He specialises in the probabilistic evaluation of the forensic-biometric evidence. Didier has served as a criminalist in several international terrorist cases on request of the ICTY, STL, UN, UK and CH. He has authored and co-authored more than 60 scientific articles and chapters in the forensic science field. He is an associate and guest editor of Forensic Science International (FSI), a member of the R&D standing committee of the ENFSI and a member of the ISO TC 272 Forensic Sciences.

 

Joon Son Chung

ABSTRACT - Supervised learning with deep neural networks has brought phenomenal advances to speech  recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision.


We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion.


We use unlabelled audio and video for self-supervised learning of speech and speaker representations. We will discuss the use of the learnt representations for speech-related downstream tasks such as automatic speech recognition, speaker recognition and lip reading.


 Joon Son Chung is an assistant professor at the School of Electrical Engineering, KAIST, where he is directing the Multimodal AI Lab. Previously, he was a research team lead at Naver Corporation, where he managed the development of speech recognition models for various applications including Clova Note. He received his BA and PhD from the University of Oxford, working with Prof. Andrew Zisserman. He published in top tier publications including TPAMI and IJCV, and has been the recipient of best paper awards at Interspeech and ACCV. His research interests include speaker recognition, multimodal learning, visual speech synthesis and audio-visual speech recognition.

 

Jesus Villalba-Lopez

Towards Speech Processing Robust to Adversarial Deceptions

ABSTRACT - As speech AI systems become increasingly integrated into our daily lives, ensuring their robustness against malicious attacks is paramount. While preventing spoofing attacks remains a primary objective for the speaker recognition community, recent advances in deep learning have facilitated the emergence of novel threat models targeting speech processing systems. This talk delves into the intricate world of adversarial attacks, where subtle perturbations in input data can lead to erroneous outputs, and poisoning attacks, where maliciously crafted training data corrupts the model's learning process. We explore the vulnerabilities present in speech AI systems, examining them alongside strategies for detecting and defending against attacks. By comprehensively understanding these threats, we empower ourselves to fortify speech AI systems against nefarious exploitation, thereby safeguarding the integrity and reliability of this transformative technology.

 Jesus Villalba is an assistant research professor in the Department of Electrical and Computer Engineering, and an affiliate of the Center for Language and Speech Processing.

His current research interests relate to information extraction from speech, such as speaker identity, language, age, and emotion. He is also interested in speaker diarization and unsupervised learning for speech-related applications.

Villalba received his MS in telecommunications engineering (2004) and his Ph.D. in biomedical engineering (2014) from University of Zaragoza, Spain. His thesis focused on several topics related to speaker recognition in adverse environments.

Villalba joined the Johns Hopkins Center for Language and Speech Processing as a postdoctoral fellow in October of 2016. He was appointed assistant research professor in 2019.

 

Carlos Busso

Toward Robust and Discriminative Emotional Speech Representations 

ABSTRACT - Human speech communication involves a complex orchestration of cognitive, physiological, physical, cultural, and social processes where emotions play an essential role. Emotion is at the core of speech technology, changing the acoustic properties and impacting speech-based interfaces from analysis to synthesis and recognition. For example, understanding the acoustic variability in emotional speech can be instrumental in mitigating the reduced performance often observed with emotional speech for tasks such as automatic speech recognition (ASR) and speaker verification and identification. Emotions can also improve the naturalness of human-computer interactions, especially in speech synthesis and voice conversion, where natural human voices are generated to convey the emotional nuances that make human-machine communication effective. Furthermore, since emotions change the intended meaning of the message, identifying a user's emotion can be crucial for spoken dialogue and conversational systems. It is critical for the advancement in speech technology to computationally characterize emotion in speech and obtain robust and discriminative feature representations. This keynote will describe key observations that need to be considered to create emotional speech representations, including the importance of modeling temporal information, self-supervised learning (SSL) strategies to leverage unlabeled data, efficient techniques to adapt regular SSL speech representation to capture the externalization of emotion in speech, and novel distance-based formulation to build emotional speech representations. The seminar will describe the potential of these feature representations in speech-based technologies. 

 Carlos Busso is a Professor at the University of Texas at Dallas’s Electrical and Computer Engineering Department, where he is also the director of the Multimodal Signal Processing (MSP) Laboratory. His research interest is in human-centered multimodal machine intelligence and application, with a focus on the broad areas of speech processing, affective computing, and machine learning methods for multimodal processing. He has worked on speech emotion recognition, multimodal behavior modeling for socially interactive agents, and robust multimodal speech processing. He is a recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. In 2015, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Transactions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Best Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie).  In 2023, he received the Distinguished Alumni Award in the Mid-Career/Academia category by the Signal and Image Processing Institute (SIPI) at the University of Southern California. He received in 2023 the ACM ICMI Community Service Award. He is currently serving as an associate editor of the IEEE Transactions on Affective Computing. He is a member AAAC and a senior member of ACM. He is an IEEE Fellow and ISCA Fellow. 

 

Craig S. Greenberg

...

 Craig Greenberg is a Mathematician at the National Institute of Standards and Technology (NIST), where he oversees NIST’s Speaker Recognition Evaluation series and Language Recognition Evaluation series and researches the measurement and evaluation of Artificial Intelligence (AI) and other topics in AI and machine learning. Prior to joining NIST, Dr. Greenberg worked as an English language annotator at the Institute for Research for Cognitive Science, as a programmer at the Linguistic Data Consortium, and as a research assistant in computational linguistics at the University of Pennsylvania. Dr. Greenberg received his PhD in 2020 from the University of Massachusetts Amherst with a dissertation on uncertainty and exact and approximate inference in flat and hierarchical clustering, his M.S. degree in Computer Science from University of Massachusetts Amherst in 2016, his M.S. degree in Applied Mathematics from Johns Hopkins University in 2012, his B.A. (Hons.) degree in Logic, Information, & Computation from the University of Pennsylvania in 2007, and his B.M. degree in Percussion Performance from Vanderbilt University in 2003. Among his accolades, Dr. Greenberg has received two official letters of commendation for his contributions to speaker recognition evaluation.