Using Machine Learning Principles to Understand Song Popularity 1
Transcription
Using Machine Learning Principles to Understand Song Popularity 1
CS578/STAT590: Introduction to Machine Learning Fall 2014 Final Project Report Stephen Mussmann, John Moore, Brandon Coventry Handed In: December 20 Using Machine Learning Principles to Understand Song Popularity Can Lyrical Content Predict Song Trajectories on the Billboard 100? Abstract Music is an important part of popular culture, reflecting views and attitudes of the general populus. The music industry is an economic giant, bringing in 7 billion dollars in revenue in the year 2013. The metric for pop music song success is the Billboard ”Hot” 100, a weekly ranking of the top 100 pop songs. The ability to predict song rankings on the charts would be extremely informative to the music community. In this project, we attempt to use the lyrics of songs to predict the ranking of a song for the next week. This is accomplished using a two step process. First, song lyrics are used to predict song popularity trajectory labels. Second, we utilize trajectory labels to predict song rankings. In this report, we will begin by discussing the nature and extraction of the Billboard 100 dataset and the song lyrics dataset. Next, we discuss the lyrical features and models to predict trajectory labels of songs on the the Billboard 100. Following is a discussion of the trajectories in the Billboard 100 dataset and time series modeling of them. Finally, we combine the two steps to create a full algorithm. Results show marginal increases in predictive power and emphasize the difficulty of predicting complex human interactions. 1 Introduction Since the late 1940’s Billboard has been publishing a list of the top or ”hot” songs of the year. This list persists till present day, where it publishes its top 100 songs per week. Billboard is highly regarded as one of the epicenters of popular music rankings and is thus integral to an artist’s success. The music industry is huge economically, with 7 billion dollars in revenue in 2013[5]. From a purely monetary standpoint, artists and recording labels should be interested in predicting song ranking and trajectories to ensure songs generate the most revenue. Music, however, is an intensely varied paradigm, with much divergence between individual tastes. Music carries several aspects, including rhythm and beat, lyrical, and melodic content. Thus, music is a very difficult paradigm to model. A study by Chon et al [3] seemed to characterize this difficulty by attempting to predict the trajectories of albums on the billboard jazz charts using current and previous album sales and nearest neighbor metrics. The results failed to show that album sales had good predictive power, most likely due to chart complexity. Therefore, models with better predictive power are desired. Henard et al [6] showed that lyrical content can be a strong predictor of whether or not a song will 1 Stephen Mussmann, John Moore, Brandon Coventry 2 reach the number one position, making it an ideal choice to include in the chart trajectory model. This project consists of two steps. First, we use lyrical content to predict trajectory labels. In our case trajectory labels are simply binary labels that indicate whether or not a song matches a given trajectory attribute (ie if song is in top 50). Second, we use these trajectory labels to predict song rankings. The goal of this project is to determine if lyrical content can aid in prediction of rankings. 2 Datasets We generate a new dataset by using the 500GB Million Songs Dataset (MSD) [5,6] which contains song metadata up to year 2008. To obtain relevant lyrical information, MSD was also used in conjunction with the musiXmatch(mXm) dataset[7]. Both datasets can be accessed via the Python programming language using SQL libraries and calls. While MSD allows access to metadata from a large variety of songs, copyright issues limit lyrical content available. To best meet legal requirements, the mXm dataset supplies lyrics in a ”bagof-words” format for 237,662 tracks. Because of this format, we are only able to generate features where word order is not considered. To generate feature sets from the data, the entire MSD was searched for songs on the charts between years 1999-2008. We chose this span of time in order to increase our generated data set size. Although, music trends could have changed over the ten year cycle, this was accounted for in the ”year” feature of our feature set, which is discussed later. Once a song has been found, the mXm dataset was queried for song lyrics. If the lyrics were present, the song’s lyrical features and metadata were added to our generated dataset. If the song lyrics were not present, the song was excluded. For this reason, a total of 940 songs were included in our new dataset although there were 3732 songs in our time period. Since all songs are not included in our generated dataset, this could potentially impede our ability to predict trajectory labels. However, our results show later that even with missing data, our method still improves ranking prediction. 3 3.1 Lyrics Lyrical Analysis To obtain features, metadata from MSD as well as derived natural language features were obtained. Natural language processing was largely done using the nltk toolkit for Python[2] with sentiment analysis done utilizing the textblob toolbox[12]. A study by DanescuNiculescu-Mizil, et al [4] has shown that memorability in movie quotes stems from word scarcity, i.e. phrases are more memorable if they are ”rare.” We hypothesize that this is also the case with song lyrics and may affect song placement on the Billboard Hot 100. As discussed earlier, Hernard and Rossetti[6] showed that lyrical content was a strong predictor for whether or not a song would hit number one on the billboard hot 100. Therefore, our features must capture the memorability of a song by looking at the scarcity of the lyrics. At each stage, stop words were removed from analysis. Stephen Mussmann, John Moore, Brandon Coventry 3 Figure 1: Top ten most frequent lyrics present in our training set First, term-frequency, inverse document frequency(TF-IDF) was calculated. The TF-IDF metric quantifies the relative importance of a word to a document and is calculated as follows: ! total # of songs T F − IDF = (# appearances of a word in a song) × log total # songs containing word (1) This was calculated for each word in the song dataset. Intuitively, the memorability of a song should be dependent on the contribution of individual words, and a song containing more rare words will have a higher memorability. While we consider individual word frequencies should carry a great amount of information with regards to a song’s memorability, we must also account for the memorability of the entire song itself. Therefore, we need a metric which describes the integration of each lyric in a song. Shannon entropy, which is a measure of variability, provides a useful metric for this. Shannon entropy, henceforth shortened to entropy, quantifies the variability of a set of random variables. Qualitatively, entropy should allow us to determine whether a song itself is memorable. Songs reflecting a higher entropy should correspond to the presence of more rare words and will theoretically make the song more memorable. Entropy is calculated as follows: X E(S) = pi · log2 (pi ) (2) i Next, we defined several features which look at the lyrical content and sentiment of a song. By a cursory listening of songs currently on the top 10, several themes, such as love, desire, and time are very frequently present. Figure 1 demonstrates the top ten most frequently encountered words in our time period of interest. The top ten contains both words with very clear meanings, such as love and time, to non-informational words such as yeah. This supports general observations that songs on the top 100 are not entirely full of meaning. To help classify song lyrical content, we looked at the frequency of top three most used words in a given song. Second, we looked at the number of words that are between 3 and 8 letters Stephen Mussmann, John Moore, Brandon Coventry (a) Distribution of Polarity 4 (b) Distribution of Subjectivity Figure 2: Sentiment Analysis Distributions Figure 3: Sentiment Polarity vs Year on Chart in length. Finally, we observed the average word length per a given song. Longer words are usually associated with more complex meanings and thus will carry useful contextual information. Next, we hypothesized that song sentiment might play a role in a song’s performance. Music is traditionally an outlet of social commentary, were periods of economic depression or social unrest reflected in popular songs. We calculated song sentiment polarity, the relative positivity or negativity of a song, as well as song subjectivity. Figures 2a and 2b demonstrates the relative distribution of songs according to polarity and subjectivity. Figure 3 shows the average sentiment polarity per year. For each year, a two-sample t-test was run to assess a years difference between the baseline year, in our case 1999, and the current year. Interestingly, sentiment polarity was depressed between years 2002 and 2005 (p < 0.05). There was also a significant decrease in sentiment polarity from baseline in the year 2008 (p < 0.05). Qualitatively, this might reflect national trends of social trajecty or ecomonic downturns reflected in the artist’s words, and is therefore a critical component on chart prediction. Song subjectivity remained statistically unchanged. Stephen Mussmann, John Moore, Brandon Coventry 5 While we hypothesize that lyrical content is important for song ranking, we know that other variables also contribute. To account for these, metadata from the MSD was utilized. Parameters included artist familiarity, artist hotness, the year the song was released, the energy and tempo of a song, and the overall ”loudness” of a song. These features were determined by us to most likely have a non-trivial role in chart placement. 3.2 Trajectory Label Prediction In order to use feature sets effectively, we iteratively refined our feature sets until we reached feature sets that were predictive of a few trajectory labels. Every feature value is continuous so we used Weka’s [7] built in discretization method to threshold features into different bins. This procedure optimizes the number of equal-width bins. On each iteration, we would predict each of 24 different trajecotry labels and evaluate whether or not we needed to use more or less features. First, we tried using only 4 features: artist familiarity, danceability, polarity in terms of sentiment, and sentiment subjectivity. On this first iteration, trajectory labels were not predicted very well because our predictions using 10-fold cross validation resulted in accuracies at or below the baseline of a given class label. A given label’s baseline is defined as the highest frequency of a class divided by the total frequencies of classes. Intuitively, the baseline is the maximum accuracy possible by making a prediction independent of the data. On our second iteration we added TF-IDF features with pruned stopwords and the rest of our features that include lyrical content. We tested our algorithms using this featureset, which ended up being about 3800 features. Similarly, in this high feature space, all of our model’s accuracies did not hit the baselines. On our third iteration, we used a feature selection algorithm known as Correlation based feature selection. We chose this method for its simplicity, speed, and ease of access through the Weka toolkit. [1] We decided to run this feature selection algorithm separately with each class label since selection is dependent on the class label. Each feature set was anywhere between 20-200 features. We could now run various algorithms on our new feature sets and some labels were finally predicted about 10%-20% above our baseline as seen in figure 4. The models refer to whichever model had highest accuracy for a given label, and attributes were selected if their accuracy in one of our models had a 3% gain when compared to the baseline’s accuracy since any gain above 3% is most not likely due to noise. We used naive Bayes with Laplacian smoothing, C4.5 decision trees with default parameters, and a stack of naive Bayes, decision forests, and adaboosted decision stumps in Weka. Stacking is a technique used to combine many classifiers into one by carefully weighting each class label decision through the use of a meta classifier. In our case, logistic regression was our meta classifier. Standard (default) parameters were used in Weka for each algorithm. Naive bayes tended to do surprisingly well by obtaining some of our highest prediction accuracies. Stacking was on par with naive bayes and predicted with similar accuracies, while decision trees had slightly worse accuracies. This may be due to decision trees overfitting since even when the dimensionality was reduced, decision trees still had to deal with a large (200) feature space in some cases. Thus, naive bayes would be less likely to overfit. Stephen Mussmann, John Moore, Brandon Coventry 100 Accuracy 90 Baseline 6 Model 80 70 60 50 #2 #3 #5 #9 #15 #19 Trajectory Label Number Figure 4: Our Models vs Baselines 4 4.1 Trajectories Trajectory Extraction and Characterization We chose to use song data from 1999-2008 from the Billboard Hot 100 to ensure that our training set had an adequate number of songs with lyrics. Care was taken to ensure that the population of training data consisted of relatively recent songs. The hot 100 for each week was scraped from their website and includes information for each song in ranked lists for each week. To meet the goal of ranking projection, song data was transformed to trajectory information. Thus, if a song was on the billboard for a series of consecutive weeks, the rankings for each of those weeks were stored as a trajectory. Importantly, if a song was on the charts at the beginning or the end of our time period, we do not include it in trajectory analysis. In this case, it is most likely that the full trajectory extends into a period of time for which we do not have information. Another concern is the inspection paradox, which is more likely to remove long running trajectories, adding a skew to our dataset. However, because we are examining ten years of data and most songs are on the charts for a significantly shorter time, the skew should be minimal. Notably, in some cases, a song appears on the chart for a few consecutive weeks, disappears, then reappears later. For these cases, the song was represented as multiple trajectories, one for each appearance on the chart. Using the above method, 3732 trajectories were extracted. Figure 5a displays the conglomeration of all trajectories. Interestingly, there is a large empty block in the upper right corner of the plot. It appears that Billboard 100 knocks off any song that is on the charts for more than 20 weeks and falls below 50. In order to demonstrate individual trajectoy paths, trajectory data was sampled and each 100th trajectory displayed in figure5b. Figure 5c shows the histogram of lengths of time songs are on the chart. Note the anomaly at songs of length 20 because of the phenomenon mentioned above. The small number of songs with a length greater than 50 weeks was truncated to 50 weeks. Intuitively, many songs exhibit a ”U” shape, starting with a low ranking, advancing up the chart, and then fall off the chart after a certain period of time. Figure 5d shows Stephen Mussmann, John Moore, Brandon Coventry 7 (a) All Trajectories (b) Every 100 Trajectories (c) Frequencies of Trajectory Lengths (d) Frequencies of Rankings Figure 5: Characterization of Trajectories the frequency of song rankings. The green line plots the frequencies of the ranking at the beginning of a song’s trajectory, the blue line plots the frequencies of the ranking at the peak of a song’s trajectory, and the red line plots the frequencies of the ranking at the end of a song’s trajectory. We see an anomaly in the red line around rank 50 because of the phenomenon of removing songs lower than rank 50 after 20 weeks. It is interesting to note that the peak frequencies are approximately uniform. 4.2 Feature Labels As mentioned above, the method of labeling the trajectories will act as an intermediate between lyric feature extraction and prediction. To accomplish this, 24 binary trajectory labels were calculated. The label should provide information that aids in prediction and should be able to be predicted from lyrical content. Each of these issues will be discussed later. Table 1 discusses the label attributes. 4.3 Trajectory Segmentation Initially, a parametric model was used to predict trajectories as it was hypothesized that the model could take advantage of the ”U” structure. However, the trajectories do not seem to Stephen Mussmann, John Moore, Brandon Coventry Label Numbers 1-3 4-12 13-15 16-18 19-20 21-22 23-24 8 Label Meaning The song is in the top {10,20,50} for a week The song spends {5,10,20} weeks in the top {10,20,50} The song spends {5,10,20} weeks on the chart The song starts in the top {25,50,75} The reaches its peak after week {5,10} The average change in the song’s ranking is greater than {5,10} The median change in the song’s ranking is greater than {5,10} Table 1: Trajectory Labels fit a global parametric model. Instead, some songs reach their peak much sooner and some reach their peak much later. Some trajectories stay at their peak for many weeks and some trajectories only stay in their peak ranking for one week. In the end, we chose a more feasible approach. To predict a given week, we only used the previous couple weeks. This approach is inspired by the autoregressive model utilized in time series analysis. Thus, to extract the data for prediction, we chose small segments of the trajectories and attempted to use the first points to predict the final point.To facilitate the maximum number of weeks projected, only a small sample of past week performance was utilized. This is due to the fact that, unless a song is a massive success, its time on the chart is relatively short, and using to many past weeks will not allow for long term future predictions. Thus, we try to use 1, 2, and 3 previous weeks for prediction. We also attempt to predict the time a song leaves the chart. Initially, a song which has left the chart was given a ranking of 101. However, since songs which are not in the top 50 for 20 weeks seem to be removed, this exit ranking seemed to give rise to a large amount of error. To correct for this, we changed our trajectory segmentation program in two key ways. First, if we at at a point on a trajectory that is after 20 weeks and the song leaves the chart, we assign the ranking ’51’ instead of ’101’. Secondly, we do not include segments that would involve predicting week 21. 4.4 Trajectory Modeling For predicting the trajectories, two models and a baseline were developed. Our baseline is the random walk model where our predicted ranking for a week is the previous week’s ranking. Additionally, we try the autoregressive model which is equivalent to the linear regression model on the trajectory segments. Thirdly, we use the k nearest neighbors model. In addition to using the previous weeks’ rankings, we also considered using the number of weeks since the song entered the chart in our model. In a time series sense, this allowing our models to be time dependent. As autoregression is a windowed linear regression, a square error cost function is used. We used 10-fold cross validation to measure performance. Results can be seen in Table 2. For preliminary results, we use a value of k = 20for the kNN model. In all cases, casting our model as a function of time improves results. We can also see that our two models do not improve much relative to the baseline from 2 weeks to 3 weeks. The baseline was altered to account for the number of previous weeks used for prediction. Stephen Mussmann, John Moore, Brandon Coventry 9 Model 1wk 1wk & time 2wk 2wk & time 3wk 3wk & time Random Walk 9.2600 9.2594 8.5816 8.5838 8.0966 8.1056 kNN 9.4150 8.7141 7.8314 7.6334 7.2489 7.1678 Autoregressive 9.2377 8.9191 7.9186 7.8611 7.3249 7.2944 Table 2: Trajectory Modelling Results To facilitate ease of comparison, the predicted weeks were the same for each dataset and results displayed in table 3. The time dependent model was utilized in all cases for this table for its improved performance. Model 1wk 2wk 3wk Random Walk 8.1040 8.1067 8.0962 kNN 7.5420 7.1850 7.1662 Autoregressive 7.7251 7.3513 7.2992 Table 3: Trajectory Modelling Results, Same Predicted Weeks While the increase from 1 week to 2 weeks makes a significant improvement, the increase from 2 weeks to 3 weeks does not and actually hurts the accuracy of kNN. Further, since using 3 weeks limits us by not allowing us to predict the third week, we will opt to use the trajectory segments with a history of 2 rankings and the time. Next, we will attempt to choose the optimal k using 10-fold cross validation, see Table 4. k 10 20 30 40 50 75 100 Error 7.8012 7.6261 7.5929 7.5798 7.5736 7.5670 7.5804 Table 4: Optimizing the Value of K in kNN Thus, we see that k is optimal around the values of 40, 50, and 75. In the future, when we utilize the trajectory labels, we will be using smaller effective datasets. For this reason, we will lean towards smaller values of k. Thus, we choose k = 40. 4.5 Utilizing Trajectory Labels Recall that our goal is to use the trajectory labels predicted from the features to aid in predicting the ranking of a song for the next unknown week. To utilize the labels, for each label, we create two different models for each binary level. Then, as we are predicting we use the trajectory label to decide which learned model to make the prediction. We do this process for each of the 24 trajectory labels and show the results in Table 5. It appears that the eight trajectory labels that perform the best for both regression and KNN are 4, 9, 13, 14, 15, 19, 22 and 24. Stephen Mussmann, John Moore, Brandon Coventry Trajectory Label Baseline 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Regression 7.8611 7.7530 7.7171 7.7758 7.7560 7.7013 7.5303 7.7749 7.7103 7.5939 7.8488 7.7922 7.6506 7.3703 7.3452 7.5097 7.8461 7.8226 7.8359 7.5858 7.7159 7.7735 7.5922 7.7193 7.5326 10 kNN 7.5798 7.3943 7.2647 7.3677 7.4346 7.2849 6.8759 7.4815 7.3938 7.1668 7.5544 7.5480 7.4278 6.9821 6.7658 7.1314 7.5873 7.5614 7.5672 7.1187 7.3807 7.5017 7.2009 7.4006 7.1587 Table 5: Using the Trajectory Label to Improve Results 5 Results In the previous sections, we found trajectory labels that could be predicted from the lyrical content (step one) and trajectory labels that could predict the next week’s ranking (step two). We take the intersection of these step one and two’s trajectory label sets to form a set of most predictive labels used in our results. These labels are 9, 15, and 19. Note that we can only test on a subset of songs that had lyrics. Since the dataset is smaller, we will re-evaluate the value of k. See Table 6. Once again, the objective is to find the smallest k that does well, and since there is not much improvement after k = 25, we use k = 25. k 5 10 15 20 25 30 35 40 45 50 Error 7.4228 7.1130 7.0134 6.9850 6.9568 9.9493 6.9485 6.9408 6.9412 6.9628 Table 6: Optimizing the Value of K in kNN Stephen Mussmann, John Moore, Brandon Coventry 11 Figure 6: Improvement using two-step prediction based on improvement for one-step predictions Once again we use 10-fold cross validation to test the results which can be seen in Table 7. Our first observation is that the baseline error is lower. This could be because the songs that have lyrics are more likely to stay on the charts longer and are, thus, easier to predict. Secondly, it appears that kNN is not really affected while the regression model sees marginal improvement, especially on label 19. Although this is a marginal increase in the accuracy, we argue that it is not simply due to randomness. Firstly, the differences in the accuracy are greater than the sampling variance due to the cross validation. Secondly, improvement in the two-step process for a trajectory label is closely tied to the effectiveness of the lyrics to predict that trajectory label and the effectiveness of the trajectory label to predict the next ranking. This scatterplot of the trajectory labels is shown in Figure 6 where y-axis is the effectiveness of the lyrics to predict the trajectory label, the x-axis is the effectiveness of the trajectory label, and the grayness is the relative predictive power of the two-step algorithm. We use a gray scale scheme so that the darker the point, the less predictive it is. As we can see, trajectory labels that are effective for both steps in our process are lighter, indicating higher final accuracy. Trajectory Label 9 15 19 Regression Regression using Lyrics kNN KNN using Lyrics 7.1684 7.1532 6.9754 6.9971 7.1554 7.1406 6.9561 7.0070 7.1603 7.1325 6.9666 6.9665 Table 7: Using the Trajectory Label to Improve Results 6 Conclusion In this project, we attempted to use the lyrics of a song to predict song rankings in the BillBoard Hot 100. We introduced a two step process in which we first predict a set of trajectory labels given lyrical feature data. This feature data was refined iteratively to Stephen Mussmann, John Moore, Brandon Coventry 12 obtain the most predictive features, and we found sets of features that are able to predict some trajectory labels relatively well. For our second step, we used the trajectory label predictions in a time series model to predict the next week’s ranking of a song. Random walk was our baseline and we used KNN and Regression to predict rankings. We are able to show that this two step process does, indeed, very slightly improve ranking predictions. Thus, lyrical information can aid in song ranking prediction. References [1] Mark Hall, (2009) Correlation-based Feature Selection for Machine Learning. [2] Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. OReilly Media Inc. [3] Chon, Song, Hui, Malcolm Slaney, and Jonathan Berger (2006). Predicting success from music sales data: a statistical and adaptive approach in Audio and music computing multimedia: Proceedings of the 1st ACM workshop, (AMCMM ’06), pp.83-88. [4] Cristian Danescu-Niculescu-Mizil, Justin Cheng, Jon Kleinberg, and Lillian Lee. You had me at hello: How phrasing affects memorability. Proceedings of ACL, p. 892–901. 2012 [5] Friedlander, Joshua P. News and Notes on 2013 RIAA Music Industry Shipment and Revenue Statistics. RIAA. 2014. [6] Henard, David H and Christian L Rossetti. All you need is love? Communication insights from pop music’s number-one hits. J. Advertising Research, 54(2), p. 178-191. [7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. [8] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011. [9] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, Paul Lamere. The Million Song dataset. Proceedings of the 12th International Conference on Music Information. Retrieval (ISMIR 2011), 2011. [10] Million Song Dataset, official website by Thierry Bertin-Mahieux, available at: http://labrosa.ee.columbia.edu/millionsong/ [11] musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch [12] Textblob package https://textblob.readthedocs.org/en/dev/