Logistic Regression

Terminology Review (KNN)

Let's take a moment and review some classification evaluation metrics:

Precision=Number of True PositivesNumber of Predicted PositivesPrecision = \frac{\text{Number of True Positives}}{\text{Number of Predicted Positives}}

Recall=Number of True PositivesNumber of Actual Total PositivesRecall = \frac{\text{Number of True Positives}}{\text{Number of Actual Total Positives}}

Accuracy=Number of True Positives + True NegativesTotal ObservationsAccuracy = \frac{\text{Number of True Positives + True Negatives}}{\text{Total Observations}}

Confusion Matrices

ROC Curve

The Receiver Operater Characteristic curve (ROC curve) which illustrates the false positive against false negative rate of our classifier. When training a classifier, we are hoping the ROC curve will hug the upper left corner of our graph. A classifier with 50-50 accuracy is deemed 'worthless'; this is no better then random guessing, as in the case of a coin flip.

AUC

AUC (Area Under [the] Curve) is an alternative comprehensive metric to confusion matrices, which we previously examined, and ROC graphs allow us to determine optimal precision-recall tradeoff balances specific to the specific problem we are looking to solve.

from sklearn.metrics import roc_curve, auc

#scikit learns built in roc_curve method returns the fpr, tpr and thresholds
#for various decision boundaries given the case member probabilites

#First calculate the probability scores of each of the datapoints:
y_score = logreg.fit(X_train, y_train).decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)

# From there we can easily calclate the AUC
print('AUC: {}'.format(auc(fpr, tpr)))

Class Imbalance

Class Weight

class_weight : dict or 'balanced', default: None
    Weights associated with classes in the form 
    ``{class_label: weight}``.
    If not given, all classes are supposed to have weight one.

    The "balanced" mode uses the values of y to automatically 
    adjust weights inversely proportional to class frequencies 
    in the input data as 
    ``n_samples / (n_classes * np.bincount(y))``.

    Note that these weights will be multiplied with 
    sample_weight (passed through the fit method) if 
    sample_weight is specified.

    .. versionadded:: 0.17
       *class_weight='balanced'*

Example:

weights = [None, 'balanced', {1:2, 0:1}, {1:10, 0:1}, 
           {1:100, 0:1}, {1:1000, 0:1}]
for n, weight in enumerate(weights):
    logreg = LogisticRegression(fit_intercept = False, 
                                C = 1e12, 
                                class_weight=weight)
    ...

Oversampling/Undersampling

SMOTE (Synthetic Minority Oversampling):

from imblearn.over_sampling import SMOTE, ADASYN

print(y.value_counts()) #Previous original class distribution
X_resampled, y_resampled = SMOTE().fit_sample(X, y) 
print(pd.Series(y_resampled).value_counts()) #Preview synthetic sample class distribution

0    99773
1      227
Name: is_attributed, dtype: int64

1    99773
0    99773
dtype: int64

ROC curve is misleading because the test set was also manipulated using SMOTE. This produces results that will not be comparable to future cases as we have synthetically created test cases. SMOTE should only be applied to training sets, and then from there an accuracte gauge can be made on the model's performance by using a raw test sample that has not been oversampled or undersampled.

Last updated