Emotion Recognition Challenge

We are glad to invite you to participate in the emotion recognition challenge to compare different emotion recognition systems submitted by teams all around the world. The challenge uses recordings from the MSP-Podcast corpus, which contains speech segments obtained from audio-sharing websites. The speaking turns have been perceptually annotated by at least five raters with categorical and attribute-based emotional labels.

Last modifications of this page were done on 15th of March.

Check the leaderboard here!

The Tasks

The challenge consists of two independent tasks. Each team can participate in one or both.

Task 1 - Categorical emotion recognition

Classification across the eight emotional classes provided: anger, happiness, sadness, fear, surprise, contempt, disgust, and neutral state. The test set for the challenge has a balanced distribution across the emotional categories.

Task 2 - Emotional attribute prediction

Prediction of the emotional attributes for arousal (calm to active), valence (negative to positive), and dominance (weak to strong). The emotional attributes are annotated with continuous values between 1 and 7 for each dimension.

Important Dates

The schedule is as follows:

Release of training data January 20th, 2024
Submission of the results opens February 7th, 2024
Submission of the results closes March 1st, 2024
Notification of the final results March 1st, 2024
Deadline to submit a paper for the Challenge,

describing the proposed methods March 22nd, 2024

Notification of acceptance April 15th, 2024
Speaker Odyssey 18 - 21 June, 2024
Special Session for the Challenge June 20th, 2024

Organizers

Carlos Busso University of Texas at Dallas
Berrak Sisman University of Texas at Dallas
Ali Salman University of Texas at Dallas
Lucas Goncalves University of Texas at Dallas
Abinay Reddy University of Texas at Dallas
Laureano Moro Velazquez Johns Hopkins University
Thomas Thebaud Johns Hopkins University
Leibny Paola Garcia Johns Hopkins University
Najim Dehak Johns Hopkins University

How to participate?

There is no registration fee.

To register your team and obtain the full dataset, complete and send the following academic license to msp-lab@utdallas.edu.

This includes your signature at the end of the third page

Only one group per laboratory will be allowed. Participant groups enrolled several times using different names will be disqualified.

Participants can submit their papers to our special session.

Submission

You can submit your results there.

On the two tabs, you can see a leaderboard that is automatically updated after submission.

After you submit, you get an email back with the results.

Each team can submit only three times for each task.

You can see under the csv files that we expect to receive:

Submission for Task 1

The letters must be {A, C, D, F, H, N, S, U}.

Any other letter will create an error.

A=Anger, C= Contempt, D= Disgust, F= Fear, H= Happiness, N= Neutral, S= Sadness, U= Surprise.

We decided to use the Macro-F1 score for task 1.

Macro-F1 score calculation: For each of the 8 emotion classes, calculate precision and recall. Then, we Calculate F1 Score for Each Class: The F1 score for each class is calculated as the harmonic mean of precision and recall. The formula is: F1=2×(precision*recall)/(precision+recall). The macro-F1 score is calculated by taking the average of the F1 scores of all classes. This means adding up all the F1 scores of the individual classes and then dividing by the number of classes (which is 8 in our case).

Here is a sample submission file expected for Task 1

Submission for Task 2

The values are from 1 to 7.

Values outside this range will create an error.

Arousal (1 calm, 7 active), Valence (1 negative, 7 positive), and Dominance (1 weak vs. 7 strong).

We will use the concordance correlation coefficient (CCC) for task 2. The ranking will be based on the average CCC across the three emotional attributes

Here is a sample submission file expected for Task 2

The Dataset

The training set has 68,119 speaking turns. The development set has 19,815 speaking segments from 454 speakers. The test set comprises 2,347 unique segments from 187 speakers, for which the labels have not been made publicly available. The segments for the test set have been curated to maintain a balanced representation based on primary categorical emotions. The protocol used to collect this corpus is described in this paper:

Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, October-December 2019.

Award

Award Certification will be given together with $1.000 for each task for the top-performing team.

To qualify for the award, a paper needs to be submitted and accepted to the conference.

Challenge Special Session at Odyssey 2024

The Emotion Recognition Challenge is a special session at Odyssey 2024. Please attend Odyssey 2024 and come to our poster to listen to the challenge summary.

Participating teams whose paper got accepted will also present their work there, and papers will be published in the workshop’s proceedings.

FAQ

Registering

As our team members are from different universities, we understand everyone needs a license. Do we need to provide a license from each university, or is one from the registrant sufficient?

Yes, every institution should have a license.

Should the license be signed?

Yes, at the end of the third page.

The license and under clause 6 require researchers to send unpublished papers to UTD. Is this really needed?

No. The first 3 pages are a standard data transfer agreement signed by several US institutions to facilitate the sharing of data (all the legal terms are already agreed). We cannot change any of these three pages. Any amendments to these first three pages are made in the attachments. If you check attachment 2, you will read:

Pursuant to Section 6, Page 1, Recipient is not required to send publications or public disclosures to Provider for comment. Recipient can make publicly available the results of the Project without prior review or comment by Provider."

Therefore, you do not need to send papers to UTD.

Use of SSL models

Would it be allowed to use public models, such as wav2vec and emotion2vec, for generating audio representations? Additionally, if transcripts are not provided, is it allowed to use ASR models to obtain the transcripts?

General audio representations yes.

General models trained for emotions no.

Is it allowed to use a pre-trained Speech-self-supervised model for fine-tuning/feature extraction within the MSP-Podcast corpus?

Yes

Can we use pre-trained text generative models such as ChatGPT?

Yes

Data and labels

Would it be allowed to use additional training data in addition to the provided training set?

No, for fairness.

Is it allowed to use additional datasets for training (publicly available and described in the submitted solution) except MSP-Podcast corpus?

No, for fairness.

Would it be allowed to annotate the provided data with additional labels by ourselves, such as gender?

No. (BTW, gender is included in the data so you can use it).

Are there any "No agreement" emotion class in the test3 ?

No, the test has the 8 emotions and it is balanced across classes.

Is the test a part of the MSP-Podcast corpus? (If so, and if the usage of the Speech-self-supervised model is allowed, we must not use those that saw MSP-Podcast corpus during training).

The test set for the challenge is the test3 set in the MSP-Podcast that first appears in version 1.11. Labels for this set have never been released.

Is the data you as organizers provide will be different to the original MSP-PODCAST corpus?

The data does not have test 1 and test 2, samples without speaker id have been removed to make the task speaker independent.

Is data augmentation allowed?

Yes.

The use of augmentation datasets such as RIRS or MUSAN is also allowed.

Can an Automatic Speech Recognition system be used to retrieve and use linguistic information from the audio?

Yes, ASR and text based approaches can be used.

Submission and Baseline

Is there a platform to submit the model? How will be the evaluation proceed?

You only need to submit the results there, using the formating described under the "submission" section.

On the two tabs, you can see a leaderboard that is automatically updated after submission.

After you submit, you get an email back with the results.

What are the metrics that will be used?

We decided to use the Macro-F1 score for task 1, and the concordance correlation coefficient (CCC) for task 2.

What is the file format expected for the submission?

We are expecting .csv files with the labels or values for each file of the test set.

Sample submission files are available for each task under the "submission" section.

Can you confirm that we have to submit our result between the 20th of February and 1st March, and not before the 20th ?

You can start submitting very soon (you will receive an email from us).

The last day that you can submit is March 1 AoE.

Is there a maximum number of submission and if yes, how many ?

Yes 3

Will we have access to the teams ranking and score during the submission period ?

You can see the leaderboard. The table should update very soon after you submit your answer.

https://lab-msp.com/MSP-Podcast_Competition/leaderboard.php

Is there an official baseline?

Yes, we ran a simple baseline, detailed in this document.

Contact

Do you have more questions? Please get in touch with us at this address: msp-lab@utdallas.edu

Page updated

Google Sites

Report abuse