Logistic Regression

Terminology Review (KNN)

Let's take a moment and review some classification evaluation metrics:

Precision=Number of True PositivesNumber of Predicted PositivesPrecision = \frac{\text{Number of True Positives}}{\text{Number of Predicted Positives}}

Recall=Number of True PositivesNumber of Actual Total PositivesRecall = \frac{\text{Number of True Positives}}{\text{Number of Actual Total Positives}}

Accuracy=Number of True Positives + True NegativesTotal ObservationsAccuracy = \frac{\text{Number of True Positives + True Negatives}}{\text{Total Observations}}

Confusion Matrices

ROC Curve

The Receiver Operater Characteristic curve (ROC curve) which illustrates the false positive against false negative rate of our classifier. When training a classifier, we are hoping the ROC curve will hug the upper left corner of our graph. A classifier with 50-50 accuracy is deemed 'worthless'; this is no better then random guessing, as in the case of a coin flip.

AUC

AUC (Area Under [the] Curve) is an alternative comprehensive metric to confusion matrices, which we previously examined, and ROC graphs allow us to determine optimal precision-recall tradeoff balances specific to the specific problem we are looking to solve.

Class Imbalance

Class Weight

Example:

Oversampling/Undersampling

SMOTE (Synthetic Minority Oversampling):

ROC curve is misleading because the test set was also manipulated using SMOTE. This produces results that will not be comparable to future cases as we have synthetically created test cases. SMOTE should only be applied to training sets, and then from there an accuracte gauge can be made on the model's performance by using a raw test sample that has not been oversampled or undersampled.

Last updated