The Full Schedule

Tuesday 18th

Registration 8:00 - 9:00

9:00 - 9:30

Opening Ceremony

9:30 - 10:30

Keynote 1: Development and validation of an automatic approach addressing the forensic question of identity of source -- the contribution of the speaker recognition field

This presentation will first introduce the type of biometric data and datasets used in the different forensic applications [identity verification, identification (closed- and open-set), investigation/surveillance, intelligence and interpretation of the evidence], using practical examples. Then it will describe the methodological difference between the biometric identification process developped for access control and the forensic question of identity of inference of identity of source developed to interpret the evidence. Finally it will focus on the development and validation of an automatic approach addressing the forensic question of identity of source and enhance the contribution of the speaker recognition field to this development.

Didier Meuwly

Break 10:30 - 11:00

11:00 - 12:20

Oral Presentations: Forensic Speaker Recognition

Exploring individual speaker behaviour within a forensic automatic speaker recognition system (20m)

Forensic speaker recognition with BA-LR: calibration and evaluation on a forensically realistic database (20m)

ROXSD: The ROXANNE Multimodal and Simulated Dataset for Advancing Criminal Investigations (20m)

Exploring speaker similarity based selection of relevant populations for forensic automatic speaker recognition (20m)

Lunch 12:20 - 13:50

13:50 - 15:10

Oral Presentations: Speech Synthesis

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with A Conditional Diffusion Model (20m)

Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion (20m)

Automatic Voice Identification after Speech Resynthesis using PPG (20m)

Exploring speech style spaces with language models: Emotional TTS without emotion labels (20m)

Break 15:10 - 15:40

15:40 - 16:40

Keynote 2: A Brief History of the NIST Speaker Recognition Evaluations

ABSTRACT:  NIST conducted its first evaluation of speaker recognition technology in 1996, involving 10 systems completing 4000 trials. In the ensuing nearly 30 years, NIST has conducted approximately 20 Speaker Recognition Evaluations (SREs), with recent SREs typically involving hundreds of systems completing millions of trials.  In this talk, we will discuss the history of the NIST SREs, including how the practice of evaluating speaker recognition technology has evolved, current challenges in speaker recognition evaluation, and some possible future directions.

Craig S. Greenberg

16:40 - 18:20

Oral Presentations: Speaker Verification

Attention-based Comparison on Aligned Utterances for Text-Dependent Speaker Verification (20m)

Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (20m)

An investigative study of the effect of several regularization techniques on label noise robustness of self-supervised speaker verification systems (20m)

Using Pretrained Language Models for Improved Speaker Identification (20m)

A Phonetic Analysis of Speaker Verification Systems through Phoneme selection and Integrated Gradients (20m)

Welcome reception 18:30 - 22:00

 Wednesday 19th

8:30 - 9:30

Oral Presentations: Speech Pathologies

Discovering Invariant Patterns of Cognitive Decline Via an Automated Analysis of the Cookie Thief Picture Description Task (20m)

A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness (20m)


9:30 - 10:30

Keynote 3: Towards Speech Processing Robust to Adversarial Deceptions

As speech AI systems become increasingly integrated into our daily lives, ensuring their robustness against malicious attacks is paramount. While preventing spoofing attacks remains a primary objective for the speaker recognition community, recent advances in deep learning have facilitated the emergence of novel threat models targeting speech processing systems. This talk delves into the intricate world of adversarial attacks, where subtle perturbations in input data can lead to erroneous outputs, and poisoning attacks, where maliciously crafted training data corrupts the model's learning process. We explore the vulnerabilities present in speech AI systems, examining them alongside strategies for detecting and defending against attacks. By comprehensively understanding these threats, we empower ourselves to fortify speech AI systems against nefarious exploitation, thereby safeguarding the integrity and reliability of this transformative technology.

Jesus Villalba-Lopez

Break 10:30 - 11:00

11:00 - 12:40

Oral Presentations: Spoofing and Adversarial Attacks

Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks (20m)

Spoofing detection in the wild: an investigation of approaches to improve generalisation (20m)

Meaningful Embeddings for Explainable Countermeasures (20m)

a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification (20m)

Unraveling Adversarial Examples against Speaker Identification - Techniques for Attack Detection and Victim Model Classification (20m)

Lunch 12:40 - 14:20

14:20 - 15:20

Discussion Panel 

Break 15:20 - 15:50

15:50 - 18:10

Oral Presentations: Speaker Diarization

On Speaker Attribution with SURT (20m)

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications (20m)

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios (20m)

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings (20m)

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information? (20m)

Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization (20m)

3MAS: a multitask, multilabel, multidataset semi-supervised audio segmentation model (20m)

Banquet: 18:10 - 22:00

Thursday 20th

8:30 - 9:50

Oral Presentations: The Emotion Recognition Challenge

Odyssey 2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results (20m)

TalTech Systems for the Odyssey 2024 Emotion Recognition Challenge (20m)

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem (20m)

Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge (20m)

9:50 - 10:50

Keynote 4 : Toward Robust and Discriminative Emotional Speech Representations 

 Human speech communication involves a complex orchestration of cognitive, physiological, physical, cultural, and social processes where emotions play an essential role. Emotion is at the core of speech technology, changing the acoustic properties and impacting speech-based interfaces from analysis to synthesis and recognition. For example, understanding the acoustic variability in emotional speech can be instrumental in mitigating the reduced performance often observed with emotional speech for tasks such as automatic speech recognition (ASR) and speaker verification and identification. Emotions can also improve the naturalness of human-computer interactions, especially in speech synthesis and voice conversion, where natural human voices are generated to convey the emotional nuances that make human-machine communication effective. Furthermore, since emotions change the intended meaning of the message, identifying a user's emotion can be crucial for spoken dialogue and conversational systems. It is critical for the advancement in speech technology to computationally characterize emotion in speech and obtain robust and discriminative feature representations. This keynote will describe key observations that need to be considered to create emotional speech representations, including the importance of modeling temporal information, self-supervised learning (SSL) strategies to leverage unlabeled data, efficient techniques to adapt regular SSL speech representation to capture the externalization of emotion in speech, and novel distance-based formulation to build emotional speech representations. The seminar will describe the potential of these feature representations in speech-based technologies. 

Carlos Busso

Break 10:50 - 11:20

11:20 - 12:20

Oral Presentations: The Emotion Recognition Challenge

The ViVoLab System for the Odyssey Emotion Recognition Challenge 2024 Evaluation (10m)

The CONILIUM proposition for Odyssey Emotion Challenge : Leveraging major class with complex annotations (10m)

Multimodal Audio-Language Model for Speech Emotion Recognition (10m)

IRIT-MFU Multi-modal systems for emotion classification for Odyssey 2024 challenge (10m)

Adapting WavLM for Speech Emotion Recognition (10m)

MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition (10m)

12:20 - 12:50

Panel discussion with the participants of the challenge

Friday 21st

8:30 - 9:50

Oral Presentations: Speaker and Language Recognition

Low-resource speech recognition and dialect identification of Irish in a multi-task framework (20m)

Normalizing Flows for Speaker and Language Recognition Backend (20m)

Joint Language and Speaker Classification in Naturalistic Bilingual Adult-Toddler Interactions (20m)

MAGLIC: The Maghrebi Language Identification Corpus (20m)

Break 9:50 - 10:20

10:20 - 11:20

Keynote 5

Supervised learning with deep neural networks has brought phenomenal advances to speech  recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision.

We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion.

We use unlabelled audio and video for self-supervised learning of speech and speaker representations. We will discuss the use of the learnt representations for speech-related downstream tasks such as automatic speech recognition, speaker recognition and lip reading.

Joon Son Chung

11:20 - 12:00

Oral Presentations: Applications and Multimedia

Optimizing Auditory Immersion Safety on Edge Devices: An On-Device Sound Event Detection System (20m)

Cross-Modal Transformers for Audio-Visual Person Verification (20m)

12:00 - 12:20

Closing Ceremony