I make algorithms that understand sounds
Emotion is a very important attribute of audio signals, either if it is speech or music. In the case of speech it can be used to enhance speech analytics by emotional and behavioral content. For music, it also makes a more representative and detailed description of the musical content, in a way that song retrieval and recommendation applications can be emotion-aware. Therefore, recognizing emotions directly from the audio information has gained increased interest during the last years. In the demo described in this article I am trying to show how estimating emotions in musical signals can be performed using Python and simple ML models, and then, for each music signal, how to map the estimated emotions to colors to be used in the lighting of the room where the music is played. For changing the room’s lighting a smart lighting device is obviously required: in this demo we are using the Yeelight color bulb
Let’s start with a quick description of how we solve the “music to emotions” task.
Audio data and ground truth
Valence and energy (usually the term “Arousal” is used in the bibliography especially for speech emotion recognition, but here we have selected to use the term energy to be aligned with the Spotify attributes) are dimensional representations of the emotional states, initially adopted by phycologists as an alternative to distinct emotions (sadness, happiness, etc). They have become common in emotion recognition. Valence is related to the positivity (or negativity) of the emotional state and energy (or arousal) is the strength of the respective emotion.
The Spotify valence and energy values are continuous in the (0, 1) range (valence ranging from negative to positive and energy from weak to strong). In this demo, we have selected to sample the valence and energy values to three distinct classes: valence to negative, neutral and positive and energy to weak, normal and strong. Towards this end, different thresholds have been used to maintain a relative class balance for both classification tasks.
In addition, as shown in the experiments below, initial audio signals have been downsampled to 8K, 16K and 32K. As we show later, we have selected 8K as the higher sampling frequencies do not significantly improve the performance. Also, from each song, a fix-sized segment has been used to extract features and train the respective models. The segment size has varied for the experiments presented below, but for this particular demo a default segment size of 5 seconds has been selected, because this has been considered to be a rational segment size upon which music emotion predictions are made: longer segment sizes would mean the “user” would have to wait for a longer time for emotional predictions to be returned. On the other had, as we will see below, shorter segment sizes lead to higher emotion recognition error rates.
To sum up, our dataset consists from 5K music segments (from 5K songs), organized in 3 emotion energy classes (low, normal, high) and 3 valence classes (negative, neutral, positive) taken at various sampling frequencies and various segment sizes. The final selected parameters for this demo are 8K sampling frequency and a segment size of 5 seconds. These samples are then used to train two models (for energy and valence)
Features and Classifier Training
Music emotion recognition performance results
How accurate can the two classifiers be for the selected parameters (5 sec length, 8KHz sampling)? During training and parameter tuning a typical cross-validation procedure has been followed: data has been split intro train and test subsets and performance measures have been evaluated for each random train/test split. The overall confusion matrices for both classification tasks are shown below. Rows represent ground truth classes and columns predictions. E.g. 5.69% of the data has been misclassified as “neutral”, while their true energy label was “low”. As expected, emotional energy is an “easier” classification task, with an overall F1 close to 70%, while F1 for valence is around 55%. In both cases, the posibility of an “extreme” error (e.g. misclassication of high energy as low) is very low.
(energy, F1 = 67%)
(valence, F1 = 55%)
Does signal quality (sampling rate) matter? The obvious answer is that higher sampling frequencies would lead to better overall performance. However, experimental results with 8K, 16K and 32K sampling frequencies have proven that the average (for many segment lengths from 5 to 60 seconds) performance does not significantly change. In particular, and compared to the 8K signal quality, overall performance is just 0.5% better for 16K and 2% for 32K. This indicates using 8K for this particular demo, as the aforementioned performance boosting does not justify the computational cost (32K means 4 times more feature extraction computations that 8K signals). So the short answer is that the sampling frequency does not significantly affect the overall prediction accuracy.
How does segment length affect performance? It is obvious that, the longer the segments the more information the classifier would have to estimate music emotion. But how does segment length affect overall F1 measure? The results in the following figure prove that overall performance (F1) for energy can even be above 70% for 30-second segments, while valence seems to asymptotically reach 60%. Ofcourse, the segment length cannot be as high as possible and it is limited by the application: in our case, where we want to automatically control our color bulbs based on the estimated emotion, a rational delay would be less than 10 seconds (as explained before, 5 is the selected segment length).
Emotion to color
So we have shown above that we can build two models for predicting emotional energy and valence in 5-second music segments with an accuracy around 70% and 55% respectivelly. How can these two values (energy and valence) be used to change the color of the room’s lighting the music is played in?
To define a continuous 2D valence-energy plane where each point is mapped to a single color, we define a descrete mapping of points to colors based on the aforementioned assumptions. As an example, anger, i.e. point (-0.5 , 0.5) in the valence, energy plane is mapped to red and happiness (point (0.5, 0.5)) to yellow, etc. All intermediate points are assigned as linear combination of the basic color mappings, based on their Eucledian distance from the points of reference. The result is the continuous 2D valence-energy color plane shown below. The horizontal axis varies in the negative – positive range (valence), while the vertical axis corresponds to emotional energy (weak to strong). The particular colormap is the default mapping of (valence, energy) value pairs to color, based on a set of simple configurable rules, as describbed above (Note: In the Python repo of this article, this can be changed by editing the “emo_map” dictionary in main.py).