Genus X

Classify music genre using machine learning.

Learn More

What is it?

Genus X is a machine learning project that classifies music into 10 different genres based on different parameters of the song given by this data set.

...



Visuals

/>

Popularity to Tempo

Starting off our project, we were trying to get as much data from the dataset as possible. To make it easier, we used different graphs that represent correlations between data columns inside the dataset. On this graph, we can see the correlation between the song tempo that is most commonly used in each individual genre and the popularity of that song.

/>

Loudness to Tempo

This scatterplot helps to show the massive overlap between certain features of songs, in this case tempo and loudness. While there are some clear clusters of genres at certain loudness ranges, many genres overlap throughout these clusters, which makes it especially difficult for a computer to differentiate between genres.

/>

Duration to Loudness to Popularity

This 3d scatterplot takes into account song duration, loudness, and popularity on the x-y-z axis as well as a song's energy determining the size of the point. Although there still is overlap, this helps to show that as more variables are taken into account, each song begins to appear as more unique, allowing more patterns to be seen across genres.

/>

Heat Map of All Variables

Here is a heat map that shows the correlation between different variables in each song. Some variables show greater correlation than others, such as energy and loudness, while other variables are less correlated. Greater correlation would make it easier to find patterns across genres, so because there are many variables with correlation close to zero, the computer will have a difficult time comparing many of the song parameters.




Our Project

...

Logistic Regression

 Logistic Regression is a Machine Learning classification algorithm that  is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).


Multinomial Logistic Regression

Multinomial Logistic Regression is a type of Logistic Regression that deals with cases when the target or independent variable has three or more possible values. In our case, there are 10.


...

Pros and Cons of LR

Pros:

- Logistic regression is easier to implement, interpret, and very efficient to train.

- Good accuracy for many simple data sets and it performs well when the dataset is linearly separable.

- It can easily extend to multiple classes(multinomial regression) and a natural probabilistic view of class predictions.

Cons:

- It can only be used to predict discrete functions. (Classes)

- It constructs linear boundaries. (In opposed to curvilinear)

- It is tough to obtain complex relationships using logistic regression.


...

Accuracy of Logistic Regression

Confusion matrix on the right, you can see the accuracy of Logistic Regression model that we used for the data set. On the y-axis are the actual values, and on the x-axis are predictions of these genres that were made used by the LR model. Thus, the higher the number is, the more accurate the predictions are.

The best and the worst predictions

From this graph, we can see that the most accurate predictions were made for Classical music, and the least accurate for Alternative music.


...

Naive Bayes

 About Naive Bayes : Native Bayes is a machine learning model for classification that uses the Bayes Theorem( img )
- P(H) is the probability of hypothesis H being true. This is known as the prior probability.
- P(E) is the probability of the evidence(regardless of the hypothesis).
- P(E|H) is the probability of the evidence given that hypothesis is true.
- P(H|E) is the probability of the hypothesis given that the evidence is there.


...

Classification Trees

 About Classification Trees:
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Pros
- Simple to understand and to interpret. Trees can be visualized.
- Requires little data preparation.
- Able to handle both numerical and categorical data.
- Possible to validate a model using statistical tests.

Cons
- Decision\-tree learners can create over complex trees that do not generalize the data well.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated
- Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure.
- Decision tree learners create biased trees if some classes dominate.


...

Random Forest

 
Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

Benefits
- Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data. However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error.
- Provides flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists. Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing.
- Easy to determine feature importance: Random forest makes it easy to evaluate variable importance, or contribution, to the model. There are a few ways to evaluate feature importance. Gini importance and mean decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a given variable is excluded. However, permutation importance, also known as mean decrease accuracy (MDA), is another importance measure. MDA identifies the average decrease in accuracy by randomly permutating the feature values in oob samples.

Challenges
- Time-consuming process: Since random forest algorithms can handle large data sets, they can be provide more accurate predictions, but can be slow to process data as they are computing data for each individual decision tree.
- Requires more resources: Since random forests process larger data sets, they’ll require more resources to store that data.
- More complex: The prediction of a single decision tree is easier to interpret when compared to a forest of them.


...

XGBoost

 
XGBoost is a popular supervised machine learning algorithm that tends to outperform many other algorithms.

Benefits
- XGBoost can be used for both regression and classification models and for large datasets.
- It provides highly accurate results by using more accurate approximations in order to find the best tree models.

Challenges
- XGBoost often takes much longer to run than other algorithms because of how system-heavy it is.
- Requires more resources: Like random forests, XGBoost can process larger data sets, so it will require more resources to store that data.

Our Data
- It is clear that even XGBoost has a difficult time determining the genre of music because of the many overlapping parameters. For example, the model classifies Hip-Hop music as Rap more than it classifies Hip-Hop correctly, as seen in the bottom row of the heat map.
- Using XGBoost with our data and messing with the model's parameters through trial and error, we were able to get a final accuracy of 66.6%.


/>

Accuracy

  Each model we used has a different percentage accuracy. Some are more accurate in predicting outcomes than others.
This bar graph shows the difference in accuracy in each of the models that we used. The decimal shown can be translated to the percent of accuracy. (ex 0.5 = 50%)



Overveiw
Model Accuracy (in %)
Logistic Regression 53
Naive Bayes 42
Classification Trees 44
Random forest 50
Logistic Regression 53
XGBoost 66