The Full Schedule
Tuesday 18th
Registration 8:00 - 9:00
9:00 - 9:30
Opening Ceremony
9:30 - 10:30
Keynote 1 : Development and validation of an automatic approach addressing the forensic question of identity of source -- the contribution of the speaker recognition field
This presentation will first introduce the type of biometric data and datasets used in the different forensic applications [identity verification, identification (closed- and open-set), investigation/surveillence, intelligence and interpretation of the evidence], using practical examples. Then it will describe the methodological difference between the biometric identification process developped for access control and the forensic question of identity of inference of identity of source developed to interpret the evidence. Finally it will focus on the development and validation of an automatic approach addressing the forensic question of identity of source and enhance the contribution of the speaker recognition field to this development.
Didier Meuwly
Break 10:30 - 11:00
11:00 - 12:20
Oral Presentations: Forensic Speaker Recognition
Lunch 12:20 - 13:50
13:50 - 15:10
Oral Presentations: Speaker and Language Recognition
Break 15:10 - 15:40
15:40 - 16:40
Keynote 2
Craig S. Greenberg
16:40 - 18:20
Oral Presentations: Speaker Verification
Welcome reception 19:00 - 22:00
Wednesday 19th
8:30 - 9:30
Oral Presentations: Speech Pathologies
9:30 - 10:30
Keynote 3: Towards Speech Processing Robust to Adversarial Deceptions
As speech AI systems become increasingly integrated into our daily lives, ensuring their robustness against malicious attacks is paramount. While preventing spoofing attacks remains a primary objective for the speaker recognition community, recent advances in deep learning have facilitated the emergence of novel threat models targeting speech processing systems. This talk delves into the intricate world of adversarial attacks, where subtle perturbations in input data can lead to erroneous outputs, and poisoning attacks, where maliciously crafted training data corrupts the model's learning process. We explore the vulnerabilities present in speech AI systems, examining them alongside strategies for detecting and defending against attacks. By comprehensively understanding these threats, we empower ourselves to fortify speech AI systems against nefarious exploitation, thereby safeguarding the integrity and reliability of this transformative technology.
Jesus Villalba-Lopez
Break 10:30 - 11:00
11:00 - 12:40
Oral Presentations: Spoofing and Adversarial Attacks
Lunch 12:40 - 14:20
14:20 - 15:20
Discussion Panel
Break 15:20 - 15:50
15:50 - 18:00
Oral Presentations: Speaker Diarization
Banquet: 18:00 - 22:00
Thursday 20th
8:30 - 9:50
Oral Presentations: The Emotion Recognition Challenge
9:50 - 10:50
Keynote 4 : Toward Robust and Discriminative Emotional Speech Representations
Human speech communication involves a complex orchestration of cognitive, physiological, physical, cultural, and social processes where emotions play an essential role. Emotion is at the core of speech technology, changing the acoustic properties and impacting speech-based interfaces from analysis to synthesis and recognition. For example, understanding the acoustic variability in emotional speech can be instrumental in mitigating the reduced performance often observed with emotional speech for tasks such as automatic speech recognition (ASR) and speaker verification and identification. Emotions can also improve the naturalness of human-computer interactions, especially in speech synthesis and voice conversion, where natural human voices are generated to convey the emotional nuances that make human-machine communication effective. Furthermore, since emotions change the intended meaning of the message, identifying a user's emotion can be crucial for spoken dialogue and conversational systems. It is critical for the advancement in speech technology to computationally characterize emotion in speech and obtain robust and discriminative feature representations. This keynote will describe key observations that need to be considered to create emotional speech representations, including the importance of modeling temporal information, self-supervised learning (SSL) strategies to leverage unlabeled data, efficient techniques to adapt regular SSL speech representation to capture the externalization of emotion in speech, and novel distance-based formulation to build emotional speech representations. The seminar will describe the potential of these feature representations in speech-based technologies.
Carlos Busso
Break 10:50 - 11:20
11:20 - 12:50
Oral Presentations: The Emotion Recognition Challenge
Social event - Sugar Shack visit: from 12:50
Friday 21st
8:30 - 9:50
Oral Presentations: Applications and Multimedia
Break 9:50 - 10:20
10:20 - 11:20
Keynote 5
Supervised learning with deep neural networks has brought phenomenal advances to speech recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision.
We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion.
We use unlabelled audio and video for self-supervised learning of speech and speaker representations. We will discuss the use of the learnt representations for speech-related downstream tasks such as automatic speech recognition, speaker recognition and lip reading.
Joon Son Chung
11:20 - 12:40
Oral Presentations: Speech Synthesis
12:40 - 13:00