The Full Schedule
Tuesday 18th
Registration 8:00 - 9:00
9:00 - 9:30
Opening Ceremony
9:30 - 10:30
Keynote 1: Development and validation of an automatic approach addressing the forensic question of identity of source -- the contribution of the speaker recognition field
This presentation will first introduce the type of biometric data and datasets used in the different forensic applications [identity verification, identification (closed- and open-set), investigation/surveillance, intelligence and interpretation of the evidence], using practical examples. Then it will describe the methodological difference between the biometric identification process developped for access control and the forensic question of identity of inference of identity of source developed to interpret the evidence. Finally it will focus on the development and validation of an automatic approach addressing the forensic question of identity of source and enhance the contribution of the speaker recognition field to this development.
Didier Meuwly
Break 10:30 - 11:00
11:00 - 12:20
Oral Presentations: Forensic Speaker Recognition
Exploring individual speaker behaviour within a forensic automatic speaker recognition system (20m)
Vincent Hughes (University of York)
Philip Harrison (University of York)
Poppy Welch (University of York)
Finnian Kelly (Oxford Wave Research)
Forensic speaker recognition with BA-LR: calibration and evaluation on a forensically realistic database (20m)
Imen Ben-Amor (Université d'Avignon)
Jean-Francois Bonastre (Université d’Avignon)
David van der Vloed (Netherlands Forensic Institute)
ROXSD: The ROXANNE Multimodal and Simulated Dataset for Advancing Criminal Investigations (20m)
Petr Motlicek (Idiap)
Srikanth Madikeri (Idiap)
Pradeep Rangappa (Idiap)
Johan Rohdin (Brno University of Technology)
Daniel Kudenko (L3S Research Center Leibniz University Hannover)
Zahra Ahmadi (L3S Research Center)
Hoang H. Nguyen (L3S Research Center, Leibniz Universität Hannover)
Aravind Krishnan (Saarland University)
Dawei Zhu (Saarland University)
Dietrich Klakow (Saarland University)
Exploring speaker similarity based selection of relevant populations for forensic automatic speaker recognition (20m)
Linda Gerlach (University of Cambridge, Oxford Wave Research)
Finnian Kelly (Oxford Wave Research)
Kirsty McDougall (University of Cambridge)
Anil Alexander (Oxford Wave Research)
Lunch 12:20 - 13:50
13:50 - 15:10
Oral Presentations: Speech Synthesis
Converting Anyone's Voice: End-to-End Expressive Voice Conversion with A Conditional Diffusion Model (20m)
Zongyang Du (The University of Texas at Dallas)
Junchen Lu (National University of Singapore)
Kun Zhou (Alibaba Group)
Berrak Sisman (The University of Texas at Dallas)
Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion (20m)
Kun Zhou (Alibaba Group)
Berrak Sisman (The University of Texas at Dallas)
Carlos Busso (University of Texas at Dallas)
Bin Ma (Alibaba)
Haizhou Li (The Chinese University of Hong Kong (Shenzhen))
Automatic Voice Identification after Speech Resynthesis using PPG (20m)
Thibault Gaudier (Le Mans Université)
Marie Tahon (LIUM)
Anthony Larcher (Université du Mans - LIUM)
Yannick Estève (LIA - Avignon University)
Exploring speech style spaces with language models: Emotional TTS without emotion labels (20m)
Shreeram Suresh Chandra (The University of Texas at Dallas)
Zongyang Du (The University of Texas at Dallas)
Berrak Sisman (The University of Texas at Dallas)
Break 15:10 - 15:40
15:40 - 16:40
Keynote 2: A Brief History of the NIST Speaker Recognition Evaluations
ABSTRACT: NIST conducted its first evaluation of speaker recognition technology in 1996, involving 10 systems completing 4000 trials. In the ensuing nearly 30 years, NIST has conducted approximately 20 Speaker Recognition Evaluations (SREs), with recent SREs typically involving hundreds of systems completing millions of trials. In this talk, we will discuss the history of the NIST SREs, including how the practice of evaluating speaker recognition technology has evolved, current challenges in speaker recognition evaluation, and some possible future directions.
Craig S. Greenberg
16:40 - 18:20
Oral Presentations: Speaker Verification
Attention-based Comparison on Aligned Utterances for Text-Dependent Speaker Verification (20m)
Nathan Griot (LIA (Laboratoire d'informatique d'Avignon))
Mohammad Mohammadamini (Avignon University)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (20m)
Theo Lepage (LRE-EPITA)
Reda DEHAK (ESLR - EPITA)
An investigative study of the effect of several regularization techniques on label noise robustness of self-supervised speaker verification systems (20m)
Abderrahim Fathan (Computer Research Institute of Montreal (CRIM), Montreal, Quebec, Canada)
Jahangir Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada)
Using Pretrained Language Models for Improved Speaker Identification (20m)
Oleksandra Zamana (Tallinn Technical University)
Tanel Alumae (Tallinn University of Technology)
A Phonetic Analysis of Speaker Verification Systems through Phoneme selection and Integrated Gradients (20m)
Thomas Thebaud (Johns Hopkins University)
Gabriel Hernández Sierra (CENATAV)
Sarah Samson Juan (Universiti Malaysia Sarawak)
Marie Tahon (LIUM)
Welcome reception 18:30 - 22:00
Wednesday 19th
8:30 - 9:30
Oral Presentations: Speech Pathologies
Discovering Invariant Patterns of Cognitive Decline Via an Automated Analysis of the Cookie Thief Picture Description Task (20m)
Anna Favaro (Johns Hopkins University)
Najim Dehak (Johns Hopkins University)
Thomas Thebaud (Johns Hopkins University)
Jesus Villalba (Johns Hopkins University)
Laureano Moro-Velazquez (Johns Hopkins University)
A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness (20m)
Oubaida Chouchane (EURECOM)
Christoph Busch (Hochschule Darmstadt)
Chiara Galdi ("Eurecom, France")
Nicholas Evans (EURECOM)
Massimiliano Todisco (EURECOM)
NOISE ROBUST WHISPER FEATURES FOR DYSARTHRIC AUTOMATIC SPEECH RECOGNITION (20m)
Japan Bhatt (DAIICT)
Harsh Patel (DAIICT)
Hemant Patil (DAICT, Gujrat)
9:30 - 10:30
Keynote 3: Towards Speech Processing Robust to Adversarial Deceptions
As speech AI systems become increasingly integrated into our daily lives, ensuring their robustness against malicious attacks is paramount. While preventing spoofing attacks remains a primary objective for the speaker recognition community, recent advances in deep learning have facilitated the emergence of novel threat models targeting speech processing systems. This talk delves into the intricate world of adversarial attacks, where subtle perturbations in input data can lead to erroneous outputs, and poisoning attacks, where maliciously crafted training data corrupts the model's learning process. We explore the vulnerabilities present in speech AI systems, examining them alongside strategies for detecting and defending against attacks. By comprehensively understanding these threats, we empower ourselves to fortify speech AI systems against nefarious exploitation, thereby safeguarding the integrity and reliability of this transformative technology.
Jesus Villalba-Lopez
Break 10:30 - 11:00
11:00 - 12:40
Oral Presentations: Spoofing and Adversarial Attacks
Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks (20m)
Mingrui He (Donghua University)
Longting Xu (Donghua University)
Wang Han (Donghua University)
Mingjun Zhang (DongHua University)
Rohan Kumar Das (Fortemedia)
Spoofing detection in the wild: an investigation of approaches to improve generalisation (20m)
Anh-Tuan DAO (LIA)
Nicholas Evans (EURECOM)
Meaningful Embeddings for Explainable Countermeasures (20m)
Itshak Lapidot (Afeka, Tel-Aviv College of Engineering)
a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification (20m)
Hye-jin Shim (Carnegie Mellon University)
Jee-weon Jung (Carnegie Mellon University)
Tomi Kinnunen (University of Eastern Finland)
Nicholas Evans (EURECOM)
Jean-Francois Bonastre (Université d’Avignon)
Itshak Lapidot (Afeka, Tel-Aviv College of Engineering)
Unraveling Adversarial Examples against Speaker Identification - Techniques for Attack Detection and Victim Model Classification (20m)
Sonal Joshi (Johns Hopkins University)
Thomas Thebaud (Johns Hopkins University)
Jesus Villalba (Johns Hopkins University)
Najim Dehak (Johns Hopkins University
Lunch 12:40 - 14:20
14:20 - 15:20
Discussion Panel
Break 15:20 - 15:50
15:50 - 18:10
Oral Presentations: Speaker Diarization
On Speaker Attribution with SURT (20m)
Desh Raj (Johns Hopkins University)
Matthew Wiesner (Johns Hopkins University)
Paola Garcia (Johns Hopkins University)
Daniel Povey (Xiaomi, Inc.)
Sanjeev Khudanpur (Johns Hopkins University)
Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications (20m)
Can Cui (Inria)
Imran Sheikh (Vivoka)
Mostafa Sadeghi (INRIA)
Emmanuel Vincent (Inria)
Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios (20m)
Juan Ignacio Alvarez-Trejos (Universidad Autonoma de Madrid)
Beltrán Labrador (Audias - Universidad Autónoma de Madrid)
Alicia Lozano-Diez (Universidad Autonoma de Madrid)
PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings (20m)
Joonas Kalda (Tallinn University of Technology)
Ricard Marxer (Université de Toulon, Aix Marseille Univ, CNRS, LIS, Toulon)
Tanel Alumae (Tallinn University of Technology)
Hervé Bredin (CNRS)
Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information? (20m)
Lin Zhang (National Institute of Informatics)
Themos Stafylakis (Omilia - Conversational Intelligence)
Federico Landini (Brno University of Technology)
Mireia Diez (Brno University of Technology)
Anna Silnova ( Brno University of Technology)
Lukáš Burget (Brno University of Technology)
Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization (20m)
Jenthe Thienpondt (IDLab, Ghent University)
Kris Demuynck (Ghent Universitty)
3MAS: a multitask, multilabel, multidataset semi-supervised audio segmentation model (20m)
Martin Lebourdais (IRIT/CNRS)
Théo Mariotte (LTCI, Télécom Paris, Institut Polytechnique de Paris)
Marie Tahon (LIUM)
Alfonso Ortega (Universidad de Zaragoza)
Anthony Larcher (Université du Mans - LIUM)
Banquet: 18:10 - 22:00
Thursday 20th
8:30 - 9:50
Oral Presentations: The Emotion Recognition Challenge
Odyssey 2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results (20m)
Lucas Goncalves (The University of Texas at Dallas)
Ali Salman (University of Texas at Dallas )
Abinay Reddy Naini (The University of Texas at Dallas)
Thomas Thebaud (Johns Hopkins University)
Laureano Moro-Velazquez (Johns Hopkins University)
Paola Garcia (Johns Hopkins University)
Najim Dehak (Johns Hopkins University)
Berrak Sisman (The University of Texas at Dallas)
Carlos Busso (University of Texas at Dallas)
TalTech Systems for the Odyssey 2024 Emotion Recognition Challenge (20m)
Henry Härm (Tallinn University of Technology)
Tanel Alumae (Tallinn University of Technology)
1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem (20m)
Mingjie Chen (University of Sheffield)
Hezhao Zhang (The University of Sheffield)
Yuanchao Li (University of Edinburgh)
JIACHEN LUO (Queen Mary University of London)
Wen Wu (University of Cambridge)
Ziyang Ma (Shanghai Jiao Tong University)
Peter Bell (University of Edinburgh )
Catherine Lai (University of Edinburgh)
Joshua D. Reiss (Queen Mary University of London)
Lin Wang (Centre for Intelligent Sensing, Queen Mary University of London)
Phil Woodland (Machine Intelligence Laboratory, Cambridge University Department of Engineering)
Xie Chen (Shanghai Jiaotong University)
Huy Phan (Amazon Alexa)
Thomas Hain (University of Sheffield)
Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge (20m)
Federico Costa (Universitat Politècnica de Catalunya)
9:50 - 10:50
Keynote 4 : Toward Robust and Discriminative Emotional Speech Representations
Human speech communication involves a complex orchestration of cognitive, physiological, physical, cultural, and social processes where emotions play an essential role. Emotion is at the core of speech technology, changing the acoustic properties and impacting speech-based interfaces from analysis to synthesis and recognition. For example, understanding the acoustic variability in emotional speech can be instrumental in mitigating the reduced performance often observed with emotional speech for tasks such as automatic speech recognition (ASR) and speaker verification and identification. Emotions can also improve the naturalness of human-computer interactions, especially in speech synthesis and voice conversion, where natural human voices are generated to convey the emotional nuances that make human-machine communication effective. Furthermore, since emotions change the intended meaning of the message, identifying a user's emotion can be crucial for spoken dialogue and conversational systems. It is critical for the advancement in speech technology to computationally characterize emotion in speech and obtain robust and discriminative feature representations. This keynote will describe key observations that need to be considered to create emotional speech representations, including the importance of modeling temporal information, self-supervised learning (SSL) strategies to leverage unlabeled data, efficient techniques to adapt regular SSL speech representation to capture the externalization of emotion in speech, and novel distance-based formulation to build emotional speech representations. The seminar will describe the potential of these feature representations in speech-based technologies.
Carlos Busso
Break 10:50 - 11:20
11:20 - 12:20
Oral Presentations: The Emotion Recognition Challenge
The ViVoLab System for the Odyssey Emotion Recognition Challenge 2024 Evaluation (10m)
Miguel Pastor (Universidad de Zaragoza)
Alfonso Ortega (Universidad de Zaragoza)
Antonio Miguel (University of Zaragoza)
Dayana Ribas (ViVoLab, University of Zaragoza)
The CONILIUM proposition for Odyssey Emotion Challenge : Leveraging major class with complex annotations (10m)
Meysam Shamsi (LIUM)
Lara Gauder (University of Buenos Aires)
Marie Tahon (LIUM)
Multimodal Audio-Language Model for Speech Emotion Recognition (10m)
Jaime Bellver (Universidad Politécnica de Madrid)
Fernando Fernández-Martínez (Universidad Politécnica de Madrid)
Luis Fernando D'Haro (Speech Technology and Machine Learning Group - Universidad Politécnica de Madrid)
IRIT-MFU Multi-modal systems for emotion classification for Odyssey 2024 challenge (10m)
Adrien Lafore (IRIT)
Jérôme Farinas (IRIT)
Sebastião Quintas (IRIT, Université de Toulouse, CNRS, Toulouse, France)
Hervé Bredin (CNRS)
Thomas Pellegrini (IRIT)
Isabelle FERRANE (IRIT - University of Toulouse)
Adapting WavLM for Speech Emotion Recognition (10m)
Daria Diatlova (VK)
Anton Udalov (VK Lab)
Vitalii Shutov (VK)
Egor Spirin (VK Lab)
MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition (10m)
jarod duret (LIA)
Yannick Estève (LIA - Avignon University)
Mickael Rouvier (LIA - Avignon University)
12:20 - 12:50
Panel discussion with the participants of the challenge
Friday 21st
8:30 - 9:50
Oral Presentations: Speaker and Language Recognition
Low-resource speech recognition and dialect identification of Irish in a multi-task framework (20m)
Liam Lonergan (Trinity College Dublin)
Mengjie Qian (Cambridge University)
Christer Gobl (Trinity College Dublin)
Ailbhe Ni Chasaide (Trinity College Dublin)
Normalizing Flows for Speaker and Language Recognition Backend (20m)
Amrutha Prasad (Idiap Research Institute)
Petr Motlicek (Idiap)
Srikanth Madikeri (Idiap)
Joint Language and Speaker Classification in Naturalistic Bilingual Adult-Toddler Interactions (20m)
Satwik Dutta (The University of Texas at Dallas)
Iván López-Espejo (University of Granada)
John Hansen (Univ. of Texas at Dallas)
MAGLIC: The Maghrebi Language Identification Corpus (20m)
KAREN JONES (University of Pennsylvania)
Stephanie Strassel (Linguistic Data Consortium)
Break 9:50 - 10:20
10:20 - 11:20
Keynote 5
Supervised learning with deep neural networks has brought phenomenal advances to speech recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision.
We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion.
We use unlabelled audio and video for self-supervised learning of speech and speaker representations. We will discuss the use of the learnt representations for speech-related downstream tasks such as automatic speech recognition, speaker recognition and lip reading.
Joon Son Chung
11:20 - 12:00
Oral Presentations: Applications and Multimedia
Optimizing Auditory Immersion Safety on Edge Devices: An On-Device Sound Event Detection System (20m)
Reza Amini Gougeh (McGill University)
Zeljko Zilic (McGill University)
Cross-Modal Transformers for Audio-Visual Person Verification (20m)
Gnana Praveen Rajasekhar (Computer Research Institute of Montreal)
Jahangir Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada)
12:00 - 12:20