Using Machine Learning Principles to Understand Song Popularity 1
CS578/STAT590: Introduction to Machine Learning
Fall 2014
Final Project Report
Stephen Mussmann, John Moore, Brandon Coventry
Handed In: December 20
Using Machine Learning Principles to Understand
Song Popularity
Can Lyrical Content Predict Song Trajectories on the Billboard 100?
Music is an important part of popular culture, reflecting views and attitudes of the
general populus. The music industry is an economic giant, bringing in 7 billion dollars
in revenue in the year 2013. The metric for pop music song success is the Billboard
”Hot” 100, a weekly ranking of the top 100 pop songs. The ability to predict song
rankings on the charts would be extremely informative to the music community. In
this project, we attempt to use the lyrics of songs to predict the ranking of a song for
the next week. This is accomplished using a two step process. First, song lyrics are
used to predict song popularity trajectory labels. Second, we utilize trajectory labels
to predict song rankings. In this report, we will begin by discussing the nature and
extraction of the Billboard 100 dataset and the song lyrics dataset. Next, we discuss
the lyrical features and models to predict trajectory labels of songs on the the Billboard
100. Following is a discussion of the trajectories in the Billboard 100 dataset and time
series modeling of them. Finally, we combine the two steps to create a full algorithm.
Results show marginal increases in predictive power and emphasize the difficulty of
predicting complex human interactions.
Since the late 1940’s Billboard has been publishing a list of the top or ”hot” songs of the year.
This list persists till present day, where it publishes its top 100 songs per week. Billboard
is highly regarded as one of the epicenters of popular music rankings and is thus integral
to an artist’s success. The music industry is huge economically, with 7 billion dollars in
revenue in 2013[5]. From a purely monetary standpoint, artists and recording labels should
be interested in predicting song ranking and trajectories to ensure songs generate the most
revenue. Music, however, is an intensely varied paradigm, with much divergence between
individual tastes. Music carries several aspects, including rhythm and beat, lyrical, and
melodic content. Thus, music is a very difficult paradigm to model. A study by Chon et
al [3] seemed to characterize this difficulty by attempting to predict the trajectories of albums
on the billboard jazz charts using current and previous album sales and nearest neighbor
metrics. The results failed to show that album sales had good predictive power, most likely
due to chart complexity. Therefore, models with better predictive power are desired. Henard
et al [6] showed that lyrical content can be a strong predictor of whether or not a song will
Stephen Mussmann, John Moore, Brandon Coventry
reach the number one position, making it an ideal choice to include in the chart trajectory
This project consists of two steps. First, we use lyrical content to predict trajectory
labels. In our case trajectory labels are simply binary labels that indicate whether or not
a song matches a given trajectory attribute (ie if song is in top 50). Second, we use these
trajectory labels to predict song rankings. The goal of this project is to determine if lyrical
content can aid in prediction of rankings.
We generate a new dataset by using the 500GB Million Songs Dataset (MSD) [5,6] which
contains song metadata up to year 2008. To obtain relevant lyrical information, MSD was
also used in conjunction with the musiXmatch(mXm) dataset[7]. Both datasets can be
accessed via the Python programming language using SQL libraries and calls. While MSD
allows access to metadata from a large variety of songs, copyright issues limit lyrical content
available. To best meet legal requirements, the mXm dataset supplies lyrics in a ”bagof-words” format for 237,662 tracks. Because of this format, we are only able to generate
features where word order is not considered. To generate feature sets from the data, the entire
MSD was searched for songs on the charts between years 1999-2008. We chose this span of
time in order to increase our generated data set size. Although, music trends could have
changed over the ten year cycle, this was accounted for in the ”year” feature of our feature
set, which is discussed later. Once a song has been found, the mXm dataset was queried for
song lyrics. If the lyrics were present, the song’s lyrical features and metadata were added
to our generated dataset. If the song lyrics were not present, the song was excluded. For
this reason, a total of 940 songs were included in our new dataset although there were 3732
songs in our time period. Since all songs are not included in our generated dataset, this
could potentially impede our ability to predict trajectory labels. However, our results show
later that even with missing data, our method still improves ranking prediction.
Lyrical Analysis
To obtain features, metadata from MSD as well as derived natural language features were
obtained. Natural language processing was largely done using the nltk toolkit for Python[2]
with sentiment analysis done utilizing the textblob toolbox[12]. A study by DanescuNiculescu-Mizil, et al [4] has shown that memorability in movie quotes stems from word
scarcity, i.e. phrases are more memorable if they are ”rare.” We hypothesize that this is
also the case with song lyrics and may affect song placement on the Billboard Hot 100. As
discussed earlier, Hernard and Rossetti[6] showed that lyrical content was a strong predictor
for whether or not a song would hit number one on the billboard hot 100. Therefore, our
features must capture the memorability of a song by looking at the scarcity of the lyrics. At
each stage, stop words were removed from analysis.
Stephen Mussmann, John Moore, Brandon Coventry
Figure 1: Top ten most frequent lyrics present in our training set
First, term-frequency, inverse document frequency(TF-IDF) was calculated. The TF-IDF
metric quantifies the relative importance of a word to a document and is calculated as follows:
total # of songs
T F − IDF = (# appearances of a word in a song) × log
total # songs containing word
This was calculated for each word in the song dataset. Intuitively, the memorability of a
song should be dependent on the contribution of individual words, and a song containing
more rare words will have a higher memorability.
While we consider individual word frequencies should carry a great amount of information with regards to a song’s memorability, we must also account for the memorability of the
entire song itself. Therefore, we need a metric which describes the integration of each lyric
in a song. Shannon entropy, which is a measure of variability, provides a useful metric for
this. Shannon entropy, henceforth shortened to entropy, quantifies the variability of a set of
random variables. Qualitatively, entropy should allow us to determine whether a song itself
is memorable. Songs reflecting a higher entropy should correspond to the presence of more
rare words and will theoretically make the song more memorable. Entropy is calculated as
E(S) =
pi · log2 (pi )
Next, we defined several features which look at the lyrical content and sentiment of a song.
By a cursory listening of songs currently on the top 10, several themes, such as love, desire,
and time are very frequently present. Figure 1 demonstrates the top ten most frequently
encountered words in our time period of interest. The top ten contains both words with
very clear meanings, such as love and time, to non-informational words such as yeah. This
supports general observations that songs on the top 100 are not entirely full of meaning. To
help classify song lyrical content, we looked at the frequency of top three most used words
in a given song. Second, we looked at the number of words that are between 3 and 8 letters
Stephen Mussmann, John Moore, Brandon Coventry
(a) Distribution of Polarity
(b) Distribution of Subjectivity
Figure 2: Sentiment Analysis Distributions
Figure 3: Sentiment Polarity vs Year on Chart
in length. Finally, we observed the average word length per a given song. Longer words
are usually associated with more complex meanings and thus will carry useful contextual
Next, we hypothesized that song sentiment might play a role in a song’s performance. Music
is traditionally an outlet of social commentary, were periods of economic depression or social
unrest reflected in popular songs. We calculated song sentiment polarity, the relative positivity or negativity of a song, as well as song subjectivity. Figures 2a and 2b demonstrates
the relative distribution of songs according to polarity and subjectivity. Figure 3 shows the
average sentiment polarity per year. For each year, a two-sample t-test was run to assess a
years difference between the baseline year, in our case 1999, and the current year. Interestingly, sentiment polarity was depressed between years 2002 and 2005 (p < 0.05). There was
also a significant decrease in sentiment polarity from baseline in the year 2008 (p < 0.05).
Qualitatively, this might reflect national trends of social trajecty or ecomonic downturns reflected in the artist’s words, and is therefore a critical component on chart prediction. Song
subjectivity remained statistically unchanged.
Stephen Mussmann, John Moore, Brandon Coventry
While we hypothesize that lyrical content is important for song ranking, we know that
other variables also contribute. To account for these, metadata from the MSD was utilized.
Parameters included artist familiarity, artist hotness, the year the song was released, the
energy and tempo of a song, and the overall ”loudness” of a song. These features were
determined by us to most likely have a non-trivial role in chart placement.
Trajectory Label Prediction
In order to use feature sets effectively, we iteratively refined our feature sets until we reached
feature sets that were predictive of a few trajectory labels. Every feature value is continuous
so we used Weka’s [7] built in discretization method to threshold features into different bins.
This procedure optimizes the number of equal-width bins.
On each iteration, we would predict each of 24 different trajecotry labels and evaluate
whether or not we needed to use more or less features. First, we tried using only 4 features:
artist familiarity, danceability, polarity in terms of sentiment, and sentiment subjectivity.
On this first iteration, trajectory labels were not predicted very well because our predictions
using 10-fold cross validation resulted in accuracies at or below the baseline of a given class
label. A given label’s baseline is defined as the highest frequency of a class divided by the
total frequencies of classes. Intuitively, the baseline is the maximum accuracy possible by
making a prediction independent of the data. On our second iteration we added TF-IDF
features with pruned stopwords and the rest of our features that include lyrical content.
We tested our algorithms using this featureset, which ended up being about 3800 features.
Similarly, in this high feature space, all of our model’s accuracies did not hit the baselines.
On our third iteration, we used a feature selection algorithm known as Correlation based
feature selection. We chose this method for its simplicity, speed, and ease of access through
the Weka toolkit. [1]
We decided to run this feature selection algorithm separately with each class label since
selection is dependent on the class label. Each feature set was anywhere between 20-200
features. We could now run various algorithms on our new feature sets and some labels were
finally predicted about 10%-20% above our baseline as seen in figure 4. The models refer to
whichever model had highest accuracy for a given label, and attributes were selected if their
accuracy in one of our models had a 3% gain when compared to the baseline’s accuracy since
any gain above 3% is most not likely due to noise.
We used naive Bayes with Laplacian smoothing, C4.5 decision trees with default parameters, and a stack of naive Bayes, decision forests, and adaboosted decision stumps in Weka.
Stacking is a technique used to combine many classifiers into one by carefully weighting each
class label decision through the use of a meta classifier. In our case, logistic regression was
our meta classifier. Standard (default) parameters were used in Weka for each algorithm.
Naive bayes tended to do surprisingly well by obtaining some of our highest prediction accuracies. Stacking was on par with naive bayes and predicted with similar accuracies, while
decision trees had slightly worse accuracies. This may be due to decision trees overfitting
since even when the dimensionality was reduced, decision trees still had to deal with a large
(200) feature space in some cases. Thus, naive bayes would be less likely to overfit.
Stephen Mussmann, John Moore, Brandon Coventry
50 #2 #3 #5 #9 #15 #19
Trajectory Label Number
Figure 4: Our Models vs Baselines
Trajectory Extraction and Characterization
We chose to use song data from 1999-2008 from the Billboard Hot 100 to ensure that our
training set had an adequate number of songs with lyrics. Care was taken to ensure that the
population of training data consisted of relatively recent songs. The hot 100 for each week
was scraped from their website and includes information for each song in ranked lists for
each week. To meet the goal of ranking projection, song data was transformed to trajectory
information. Thus, if a song was on the billboard for a series of consecutive weeks, the
rankings for each of those weeks were stored as a trajectory.
Importantly, if a song was on the charts at the beginning or the end of our time period, we
do not include it in trajectory analysis. In this case, it is most likely that the full trajectory
extends into a period of time for which we do not have information. Another concern is the
inspection paradox, which is more likely to remove long running trajectories, adding a skew
to our dataset. However, because we are examining ten years of data and most songs are on
the charts for a significantly shorter time, the skew should be minimal.
Notably, in some cases, a song appears on the chart for a few consecutive weeks, disappears, then reappears later. For these cases, the song was represented as multiple trajectories,
one for each appearance on the chart.
Using the above method, 3732 trajectories were extracted. Figure 5a displays the conglomeration of all trajectories. Interestingly, there is a large empty block in the upper right
corner of the plot. It appears that Billboard 100 knocks off any song that is on the charts for
more than 20 weeks and falls below 50. In order to demonstrate individual trajectoy paths,
trajectory data was sampled and each 100th trajectory displayed in figure5b.
Figure 5c shows the histogram of lengths of time songs are on the chart. Note the anomaly
at songs of length 20 because of the phenomenon mentioned above. The small number of
songs with a length greater than 50 weeks was truncated to 50 weeks.
Intuitively, many songs exhibit a ”U” shape, starting with a low ranking, advancing
up the chart, and then fall off the chart after a certain period of time. Figure 5d shows
Stephen Mussmann, John Moore, Brandon Coventry
(a) All Trajectories
(b) Every 100 Trajectories
(c) Frequencies of Trajectory Lengths
(d) Frequencies of Rankings
Figure 5: Characterization of Trajectories
the frequency of song rankings. The green line plots the frequencies of the ranking at the
beginning of a song’s trajectory, the blue line plots the frequencies of the ranking at the
peak of a song’s trajectory, and the red line plots the frequencies of the ranking at the end
of a song’s trajectory. We see an anomaly in the red line around rank 50 because of the
phenomenon of removing songs lower than rank 50 after 20 weeks. It is interesting to note
that the peak frequencies are approximately uniform.
Feature Labels
As mentioned above, the method of labeling the trajectories will act as an intermediate
between lyric feature extraction and prediction. To accomplish this, 24 binary trajectory
labels were calculated. The label should provide information that aids in prediction and
should be able to be predicted from lyrical content. Each of these issues will be discussed
later. Table 1 discusses the label attributes.
Trajectory Segmentation
Initially, a parametric model was used to predict trajectories as it was hypothesized that the
model could take advantage of the ”U” structure. However, the trajectories do not seem to
Stephen Mussmann, John Moore, Brandon Coventry
Label Numbers
Label Meaning
The song is in the top {10,20,50} for a week
The song spends {5,10,20} weeks in the top {10,20,50}
The song spends {5,10,20} weeks on the chart
The song starts in the top {25,50,75}
The reaches its peak after week {5,10}
The average change in the song’s ranking is greater than {5,10}
The median change in the song’s ranking is greater than {5,10}
Table 1: Trajectory Labels
fit a global parametric model. Instead, some songs reach their peak much sooner and some
reach their peak much later. Some trajectories stay at their peak for many weeks and some
trajectories only stay in their peak ranking for one week.
In the end, we chose a more feasible approach. To predict a given week, we only used
the previous couple weeks. This approach is inspired by the autoregressive model utilized
in time series analysis. Thus, to extract the data for prediction, we chose small segments of
the trajectories and attempted to use the first points to predict the final point.To facilitate
the maximum number of weeks projected, only a small sample of past week performance
was utilized. This is due to the fact that, unless a song is a massive success, its time on the
chart is relatively short, and using to many past weeks will not allow for long term future
predictions. Thus, we try to use 1, 2, and 3 previous weeks for prediction.
We also attempt to predict the time a song leaves the chart. Initially, a song which has
left the chart was given a ranking of 101. However, since songs which are not in the top 50
for 20 weeks seem to be removed, this exit ranking seemed to give rise to a large amount of
error. To correct for this, we changed our trajectory segmentation program in two key ways.
First, if we at at a point on a trajectory that is after 20 weeks and the song leaves the chart,
we assign the ranking ’51’ instead of ’101’. Secondly, we do not include segments that would
involve predicting week 21.
Trajectory Modeling
For predicting the trajectories, two models and a baseline were developed. Our baseline
is the random walk model where our predicted ranking for a week is the previous week’s
ranking. Additionally, we try the autoregressive model which is equivalent to the linear
regression model on the trajectory segments. Thirdly, we use the k nearest neighbors model.
In addition to using the previous weeks’ rankings, we also considered using the number of
weeks since the song entered the chart in our model. In a time series sense, this allowing our
models to be time dependent. As autoregression is a windowed linear regression, a square
error cost function is used. We used 10-fold cross validation to measure performance. Results
can be seen in Table 2. For preliminary results, we use a value of k = 20for the kNN model.
In all cases, casting our model as a function of time improves results. We can also see
that our two models do not improve much relative to the baseline from 2 weeks to 3 weeks.
The baseline was altered to account for the number of previous weeks used for prediction.
Stephen Mussmann, John Moore, Brandon Coventry
1wk & time
2wk & time
3wk & time
Random Walk 9.2600
Autoregressive 9.2377
Table 2: Trajectory Modelling Results
To facilitate ease of comparison, the predicted weeks were the same for each dataset and
results displayed in table 3. The time dependent model was utilized in all cases for this table
for its improved performance.
Random Walk 8.1040 8.1067 8.0962
7.5420 7.1850 7.1662
Autoregressive 7.7251 7.3513 7.2992
Table 3: Trajectory Modelling Results, Same Predicted Weeks
While the increase from 1 week to 2 weeks makes a significant improvement, the increase
from 2 weeks to 3 weeks does not and actually hurts the accuracy of kNN. Further, since
using 3 weeks limits us by not allowing us to predict the third week, we will opt to use the
trajectory segments with a history of 2 rankings and the time. Next, we will attempt to
choose the optimal k using 10-fold cross validation, see Table 4.
Error 7.8012 7.6261 7.5929 7.5798 7.5736 7.5670 7.5804
Table 4: Optimizing the Value of K in kNN
Thus, we see that k is optimal around the values of 40, 50, and 75. In the future, when
we utilize the trajectory labels, we will be using smaller effective datasets. For this reason,
we will lean towards smaller values of k. Thus, we choose k = 40.
Utilizing Trajectory Labels
Recall that our goal is to use the trajectory labels predicted from the features to aid in
predicting the ranking of a song for the next unknown week. To utilize the labels, for each
label, we create two different models for each binary level. Then, as we are predicting we
use the trajectory label to decide which learned model to make the prediction. We do this
process for each of the 24 trajectory labels and show the results in Table 5.
It appears that the eight trajectory labels that perform the best for both regression and
KNN are 4, 9, 13, 14, 15, 19, 22 and 24.
Stephen Mussmann, John Moore, Brandon Coventry
Trajectory Label
Table 5: Using the Trajectory Label to Improve Results
In the previous sections, we found trajectory labels that could be predicted from the lyrical
content (step one) and trajectory labels that could predict the next week’s ranking (step
two). We take the intersection of these step one and two’s trajectory label sets to form a set
of most predictive labels used in our results. These labels are 9, 15, and 19.
Note that we can only test on a subset of songs that had lyrics. Since the dataset is
smaller, we will re-evaluate the value of k. See Table 6. Once again, the objective is to find
the smallest k that does well, and since there is not much improvement after k = 25, we use
k = 25.
Error 7.4228 7.1130 7.0134 6.9850 6.9568 9.9493 6.9485 6.9408 6.9412 6.9628
Table 6: Optimizing the Value of K in kNN
Stephen Mussmann, John Moore, Brandon Coventry
Figure 6: Improvement using two-step prediction based on improvement for one-step predictions
Once again we use 10-fold cross validation to test the results which can be seen in Table
Our first observation is that the baseline error is lower. This could be because the songs
that have lyrics are more likely to stay on the charts longer and are, thus, easier to predict.
Secondly, it appears that kNN is not really affected while the regression model sees marginal
improvement, especially on label 19.
Although this is a marginal increase in the accuracy, we argue that it is not simply due to
randomness. Firstly, the differences in the accuracy are greater than the sampling variance
due to the cross validation. Secondly, improvement in the two-step process for a trajectory
label is closely tied to the effectiveness of the lyrics to predict that trajectory label and
the effectiveness of the trajectory label to predict the next ranking. This scatterplot of the
trajectory labels is shown in Figure 6 where y-axis is the effectiveness of the lyrics to predict
the trajectory label, the x-axis is the effectiveness of the trajectory label, and the grayness
is the relative predictive power of the two-step algorithm. We use a gray scale scheme so
that the darker the point, the less predictive it is. As we can see, trajectory labels that are
effective for both steps in our process are lighter, indicating higher final accuracy.
Trajectory Label
Regression Regression using Lyrics kNN KNN using Lyrics
Table 7: Using the Trajectory Label to Improve Results
In this project, we attempted to use the lyrics of a song to predict song rankings in the
BillBoard Hot 100. We introduced a two step process in which we first predict a set of
trajectory labels given lyrical feature data. This feature data was refined iteratively to
Stephen Mussmann, John Moore, Brandon Coventry
obtain the most predictive features, and we found sets of features that are able to predict
some trajectory labels relatively well. For our second step, we used the trajectory label
predictions in a time series model to predict the next week’s ranking of a song. Random
walk was our baseline and we used KNN and Regression to predict rankings. We are able
to show that this two step process does, indeed, very slightly improve ranking predictions.
Thus, lyrical information can aid in song ranking prediction.
