The Full Schedule

[15th May] This schedule is not definitive, it could be subject to minor modifications in the following days.

Tuesday 18th

Registration 8:00 - 9:00

9:00 - 9:30

Opening Ceremony

9:30 - 10:30

Keynote 1 : Development and validation of an automatic approach addressing the forensic question of identity of source -- the contribution of the speaker recognition field

This presentation will first introduce the type of biometric data and datasets used in the different forensic applications [identity verification, identification (closed- and open-set), investigation/surveillence, intelligence and interpretation of the evidence], using practical examples. Then it will describe the methodological difference between the biometric identification process developped for access control and the forensic question of identity of inference of identity of source developed to interpret the evidence. Finally it will focus on the development and validation of an automatic approach addressing the forensic question of identity of source and enhance the contribution of the speaker recognition field to this development.

Didier Meuwly

Break 10:30 - 11:00

11:00 - 12:20

Oral Presentations: Forensic Speaker Recognition

Lunch 12:20 - 13:50

13:50 - 15:10

Oral Presentations: Speaker and Language Recognition

Break 15:10 - 15:40

15:40 - 16:40

Keynote 2

Craig S. Greenberg

16:40 - 18:20

Oral Presentations: Speaker Verification

Welcome reception 19:00 - 22:00

 Wednesday 19th

8:30 - 9:30

Oral Presentations: Speech Pathologies

9:30 - 10:30

Keynote 3: Towards Speech Processing Robust to Adversarial Deceptions

As speech AI systems become increasingly integrated into our daily lives, ensuring their robustness against malicious attacks is paramount. While preventing spoofing attacks remains a primary objective for the speaker recognition community, recent advances in deep learning have facilitated the emergence of novel threat models targeting speech processing systems. This talk delves into the intricate world of adversarial attacks, where subtle perturbations in input data can lead to erroneous outputs, and poisoning attacks, where maliciously crafted training data corrupts the model's learning process. We explore the vulnerabilities present in speech AI systems, examining them alongside strategies for detecting and defending against attacks. By comprehensively understanding these threats, we empower ourselves to fortify speech AI systems against nefarious exploitation, thereby safeguarding the integrity and reliability of this transformative technology.

Jesus Villalba-Lopez

Break 10:30 - 11:00

11:00 - 12:40

Oral Presentations: Spoofing and Adversarial Attacks

Lunch 12:40 - 14:20

14:20 - 15:20

Discussion Panel 

Break 15:20 - 15:50

15:50 - 18:00

Oral Presentations: Speaker Diarization

Banquet: 18:00 - 22:00

Thursday 20th

8:30 - 9:50

Oral Presentations: The Emotion Recognition Challenge

9:50 - 10:50

Keynote 4 : Toward Robust and Discriminative Emotional Speech Representations 

 Human speech communication involves a complex orchestration of cognitive, physiological, physical, cultural, and social processes where emotions play an essential role. Emotion is at the core of speech technology, changing the acoustic properties and impacting speech-based interfaces from analysis to synthesis and recognition. For example, understanding the acoustic variability in emotional speech can be instrumental in mitigating the reduced performance often observed with emotional speech for tasks such as automatic speech recognition (ASR) and speaker verification and identification. Emotions can also improve the naturalness of human-computer interactions, especially in speech synthesis and voice conversion, where natural human voices are generated to convey the emotional nuances that make human-machine communication effective. Furthermore, since emotions change the intended meaning of the message, identifying a user's emotion can be crucial for spoken dialogue and conversational systems. It is critical for the advancement in speech technology to computationally characterize emotion in speech and obtain robust and discriminative feature representations. This keynote will describe key observations that need to be considered to create emotional speech representations, including the importance of modeling temporal information, self-supervised learning (SSL) strategies to leverage unlabeled data, efficient techniques to adapt regular SSL speech representation to capture the externalization of emotion in speech, and novel distance-based formulation to build emotional speech representations. The seminar will describe the potential of these feature representations in speech-based technologies. 

Carlos Busso

Break 10:50 - 11:20

11:20 - 12:50

Oral Presentations: The Emotion Recognition Challenge

Social event - Sugar Shack visit: from 12:50

Friday 21st

8:30 - 9:50

Oral Presentations: Applications and Multimedia

Break 9:50 - 10:20

10:20 - 11:20

Keynote 5

Supervised learning with deep neural networks has brought phenomenal advances to speech  recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision.


We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion.


We use unlabelled audio and video for self-supervised learning of speech and speaker representations. We will discuss the use of the learnt representations for speech-related downstream tasks such as automatic speech recognition, speaker recognition and lip reading.

Joon Son Chung

11:20 - 12:40

Oral Presentations: Speech Synthesis

12:40 - 13:00

Closing Ceremony