Machine Learning Foundation Concept

A Gentle Introduction to Machine Learning

  • the goal of Machine Learning is to make
    • predictions
    • classifications
  • two main ideas about ML:
    • we use testing data to evaluate Machine Learning methods
    • don’t be fooled by how well a Machine Learning method fits the Training Data
  • Fitting the Training Data well but make poor predictions is called Bias-Variance Tradeoff
  • There are tons of fancy Machine Learning methods, but the most important thing is to know abou them is not what makes them so fancy, it is that we decide which method fits our needs the best by using Testing Data

Cross Validation

  • Cross Validation allows us to compare different machine learning methods and get a sense of how well they will work in practice
  • Four-Fold Cross Validation: divide the data into 4 blocks (numbers change dynamically). In practice, it is very common to divide the data into 10 blocks–Ten-Fold-Cross Validation
  • Leave One Out Cross Validation: call each individual patient (or sample) a block
  • Using machine learning lingo, we need the data to
    • train the machine learning methods
    • test the machine learning methods
  • Reusing the same data for both training and testing is a bad idea because we need to know how the method will work on data it was not trained on.
  • A slightly better idea would be to use the first 75% of the data for training and the last 25% of the data for testing.
  • Rather than worry too much about which block would be best for testing, cross validation uses them all, one at a time, and summarizes the results at the end.
  • Tuning parameter: a parameter that is not estimated, but just sort of guessed. Then we could use 10-fold cross validation to help find the best value for that tuning parameter.

The Confusion Matrix

  • steps:
    • we start by dividing the data into Training and Testing set
    • Then we train all of the methods we are interested in with the Training data
    • And then test each method on the Testing set
    • summarize how each method performed on the Testing data
      • create a Confusion Matrix for each method
  • Confusion Matrix
    • rows: what the machine learning algorithm predicted
    • columns: the known truth
    • the result is (Positive/Negative) and (True/False) shows whether they predict right or wrong
  • diagonal
    • the numbers along the diagonal (top-left, right-bottom) tell us how many times the samples were correctly classified
    • the number not on the diagonal are samples the algorithm messed up
  • the size of the confusion matrix is determined by the number of things we want to predict
  • a Confusion Matrix tells you what your machine learning algorithm did right and what it did wrong

Sensitivity and Specificity

  • 2 rows and 2 columns
    • Sensitivity: tells us what percentage of items with some certain features were correctly identified
      • True Positives / (True Positives + False Negatives)
    • Specificity: tells us what percentage of items without some certain features were correctly identified
      • True Negatives / (True Negatives + False Positives)
  • larger confusion matrices
    • the big difference when calculating Sensitivity and Specifity for larger confusion matrices is that there are no single values that work for the entire matrix, instead we calculate a different Sensitivity and Specificity for each category
  • we can use Sensetivity and Specificity to help us decide which machine learning method would be best for our data
    • If correctly identifying positives is the most important thing to do with the data, we should choose a method with higher sensitivity
    • If corretly identifying negatives is more important, than we should put more emphasis on specificity

Machine Learning Models