Let's take a moment and review some classification evaluation metrics:
Precision=Number of Predicted PositivesNumber of True Positives
Recall=Number of Actual Total PositivesNumber of True Positives
Accuracy=Total ObservationsNumber of True Positives + True Negatives
Confusion Matrices
ROC Curve
The Receiver Operater Characteristic curve (ROC curve) which illustrates the false positive against false negative rate of our classifier. When training a classifier, we are hoping the ROC curve will hug the upper left corner of our graph. A classifier with 50-50 accuracy is deemed 'worthless'; this is no better then random guessing, as in the case of a coin flip.
AUC
AUC (Area Under [the] Curve) is an alternative comprehensive metric to confusion matrices, which we previously examined, and ROC graphs allow us to determine optimal precision-recall tradeoff balances specific to the specific problem we are looking to solve.
from sklearn.metrics import roc_curve, auc#scikit learns built in roc_curve method returns the fpr, tpr and thresholds#for various decision boundaries given the case member probabilites#First calculate the probability scores of each of the datapoints:y_score = logreg.fit(X_train, y_train).decision_function(X_test)fpr, tpr, thresholds =roc_curve(y_test, y_score)# From there we can easily calclate the AUCprint('AUC: {}'.format(auc(fpr, tpr)))
Class Imbalance
Class Weight
class_weight : dict or 'balanced', default: None
Weights associated with classes in the form
``{class_label: weight}``.
If not given, all classes are supposed to have weight one.
The "balanced" mode uses the values of y to automatically
adjust weights inversely proportional to class frequencies
in the input data as
``n_samples / (n_classes * np.bincount(y))``.
Note that these weights will be multiplied with
sample_weight (passed through the fit method) if
sample_weight is specified.
.. versionadded:: 0.17
*class_weight='balanced'*
Example:
weights = [None,'balanced',{1:2,0:1},{1:10,0:1},{1:100,0:1},{1:1000,0:1}]for n, weight inenumerate(weights): logreg =LogisticRegression(fit_intercept =False, C =1e12, class_weight=weight) ...
Oversampling/Undersampling
SMOTE (Synthetic Minority Oversampling):
from imblearn.over_sampling import SMOTE, ADASYNprint(y.value_counts())#Previous original class distributionX_resampled, y_resampled =SMOTE().fit_sample(X, y)print(pd.Series(y_resampled).value_counts())#Preview synthetic sample class distribution0997731227Name: is_attributed, dtype: int64199773099773dtype: int64
ROC curve is misleading because the test set was also manipulated using SMOTE. This produces results that will not be comparable to future cases as we have synthetically created test cases. SMOTE should only be applied to training sets, and then from there an accuracte gauge can be made on the model's performance by using a raw test sample that has not been oversampled or undersampled.