Learned from StatQuest with Josh Starmer. He makes machine learning concepts easy for us to understand~ Hooray~~
A Gentle Introduction to Machine Learning
the goal of Machine Learning is to make
predictions
classifications
two main ideas about ML:
we use testing data to evaluate Machine Learning methods
don’t be fooled by how well a Machine Learning method fits the Training Data
Fitting the Training Data well but make poor predictions is called Bias-Variance Tradeoff
There are tons of fancy Machine Learning methods, but the most important thing is to know abou them is not what makes them so fancy, it is that we decide which method fits our needs the best by using Testing Data
Cross Validation
Cross Validation allows us to compare different machine learning methods and get a sense of how well they will work in practice
Four-Fold Cross Validation: divide the data into 4 blocks (numbers change dynamically). In practice, it is very common to divide the data into 10 blocks–Ten-Fold-Cross Validation
Leave One Out Cross Validation: call each individual patient (or sample) a block
Using machine learning lingo, we need the data to
train the machine learning methods
test the machine learning methods
Reusing the same data for both training and testing is a bad idea because we need to know how the method will work on data it was not trained on.
A slightly better idea would be to use the first 75% of the data for training and the last 25% of the data for testing.
Rather than worry too much about which block would be best for testing, cross validation uses them all, one at a time, and summarizes the results at the end.
Tuning parameter: a parameter that is not estimated, but just sort of guessed. Then we could use 10-fold cross validation to help find the best value for that tuning parameter.
The Confusion Matrix
steps:
we start by dividing the data into Training and Testing set
Then we train all of the methods we are interested in with the Training data
And then test each method on the Testing set
summarize how each method performed on the Testing data
create a Confusion Matrix for each method
Confusion Matrix
rows: what the machine learning algorithm predicted
columns: the known truth
the result is (Positive/Negative) and (True/False) shows whether they predict right or wrong
diagonal
the numbers along the diagonal (top-left, right-bottom) tell us how many times the samples were correctly classified
the number not on the diagonal are samples the algorithm messed up
the size of the confusion matrix is determined by the number of things we want to predict
a Confusion Matrix tells you what your machine learning algorithm did right and what it did wrong
Sensitivity and Specificity
2 rows and 2 columns
Sensitivity: tells us what percentage of items with some certain features were correctly identified
the big difference when calculating Sensitivity and Specifity for larger confusion matrices is that there are no single values that work for the entire matrix, instead we calculate a different Sensitivity and Specificity for each category
we can use Sensetivity and Specificity to help us decide which machine learning method would be best for our data
If correctly identifying positives is the most important thing to do with the data, we should choose a method with higher sensitivity
If corretly identifying negatives is more important, than we should put more emphasis on specificity