Genus X

Popularity to Tempo

Starting off our project, we were trying to get as much data from the dataset as possible. To make it easier, we used different graphs that represent correlations between data columns inside the dataset. On this graph, we can see the correlation between the song tempo that is most commonly used in each individual genre and the popularity of that song.

Duration to Loudness to Popularity

This 3d scatterplot takes into account song duration, loudness, and popularity on the x-y-z axis as well as a song's energy determining the size of the point. Although there still is overlap, this helps to show that as more variables are taken into account, each song begins to appear as more unique, allowing more patterns to be seen across genres.

Heat Map of All Variables

Here is a heat map that shows the correlation between different variables in each song. Some variables show greater correlation than others, such as energy and loudness, while other variables are less correlated. Greater correlation would make it easier to find patterns across genres, so because there are many variables with correlation close to zero, the computer will have a difficult time comparing many of the song parameters.

Our Project

Logistic Regression

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

Multinomial Logistic Regression

Multinomial Logistic Regression is a type of Logistic Regression that deals with cases when the target or independent variable has three or more possible values. In our case, there are 10.

Pros and Cons of LR

Pros:

- Logistic regression is easier to implement, interpret, and very efficient to train.

- Good accuracy for many simple data sets and it performs well when the dataset is linearly separable.

- It can easily extend to multiple classes(multinomial regression) and a natural probabilistic view of class predictions.

Cons:

- It can only be used to predict discrete functions. (Classes)

- It constructs linear boundaries. (In opposed to curvilinear)

- It is tough to obtain complex relationships using logistic regression.

Accuracy of Logistic Regression

Confusion matrix on the right, you can see the accuracy of Logistic Regression model that we used for the data set. On the y-axis are the actual values, and on the x-axis are predictions of these genres that were made used by the LR model. Thus, the higher the number is, the more accurate the predictions are.

The best and the worst predictions

From this graph, we can see that the most accurate predictions were made for Classical music, and the least accurate for Alternative music.

Naive Bayes

About Naive Bayes : Native Bayes is a machine learning model for classification that uses the Bayes Theorem( $\textrm{P(H \textbar E) = } \frac{\textrm{ P(E \textbar H) * P(H)}} {\textrm{P(E)}}$ )
- P(H) is the probability of hypothesis H being true. This is known as the prior probability.
- P(E) is the probability of the evidence(regardless of the hypothesis).
- P(E|H) is the probability of the evidence given that hypothesis is true.
- P(H|E) is the probability of the hypothesis given that the evidence is there.

Classification Trees

About Classification Trees:
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Pros
- Simple to understand and to interpret. Trees can be visualized.
- Requires little data preparation.
- Able to handle both numerical and categorical data.
- Possible to validate a model using statistical tests.

Cons
- Decision\-tree learners can create over complex trees that do not generalize the data well.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated
- Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure.
- Decision tree learners create biased trees if some classes dominate.

Random Forest

Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

Benefits
- Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data. However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error.
- Provides flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists. Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing.
- Easy to determine feature importance: Random forest makes it easy to evaluate variable importance, or contribution, to the model. There are a few ways to evaluate feature importance. Gini importance and mean decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a given variable is excluded. However, permutation importance, also known as mean decrease accuracy (MDA), is another importance measure. MDA identifies the average decrease in accuracy by randomly permutating the feature values in oob samples.

Challenges
- Time-consuming process: Since random forest algorithms can handle large data sets, they can be provide more accurate predictions, but can be slow to process data as they are computing data for each individual decision tree.
- Requires more resources: Since random forests process larger data sets, they’ll require more resources to store that data.
- More complex: The prediction of a single decision tree is easier to interpret when compared to a forest of them.

XGBoost

XGBoost is a popular supervised machine learning algorithm that tends to outperform many other algorithms.

Benefits
- XGBoost can be used for both regression and classification models and for large datasets.
- It provides highly accurate results by using more accurate approximations in order to find the best tree models.

Challenges
- XGBoost often takes much longer to run than other algorithms because of how system-heavy it is.
- Requires more resources: Like random forests, XGBoost can process larger data sets, so it will require more resources to store that data.

Our Data
- It is clear that even XGBoost has a difficult time determining the genre of music because of the many overlapping parameters. For example, the model classifies Hip-Hop music as Rap more than it classifies Hip-Hop correctly, as seen in the bottom row of the heat map.
- Using XGBoost with our data and messing with the model's parameters through trial and error, we were able to get a final accuracy of 66.6%.

Accuracy

Each model we used has a different percentage accuracy. Some are more accurate in predicting outcomes than others.
This bar graph shows the difference in accuracy in each of the models that we used. The decimal shown can be translated to the percent of accuracy. (ex 0.5 = 50%)

Overveiw
Model	Accuracy (in %)
Logistic Regression	53
Naive Bayes	42
Classification Trees	44
Random forest	50
Logistic Regression	53
XGBoost	66

Genus X

Classify music genre using machine learning.

What is it?

Visuals

Popularity to Tempo

Loudness to Tempo

Duration to Loudness to Popularity

Heat Map of All Variables

Our Project

Logistic Regression

Multinomial Logistic Regression

Pros and Cons of LR

Pros:

Cons:

Accuracy of Logistic Regression

The best and the worst predictions

Naive Bayes

Classification Trees

Random Forest

XGBoost

Accuracy

Meet the Molten Cores Team

Uzair Masih Isarahmed

Richard Lora

Ryan Cuccurullo

Nadia Pesch

Ryan Ruhland

Ilia Miroshnikov

Address

Email

Phone