Distance Metrics

Best Practices

  • What decisions do I need to make regarding my data? How might these decisions affect overall performance?

  • Which predictors do I need? How can I confirm that I have the right predictors?

  • What parameter values (if any) should I choose for my model? How can I find the optimal value for a given parameter?

  • What metrics will I use to evaluate the performance of my model? Why?

  • How do I know if there's room left for improvement with my model? Are the potential performance gains worth the time needed to reach them?

Workflow

  • First

    • import standard libraries

    • import and read dataset

  • Preprocessing Data

    • Remove unnecessary columns

    • Convert feature(s) to binary encoding

    • Detect and deal with null values

    • One-Hot Encode categorical columns

    • Store target column in a separate valiable and remove it from DataFrame

  • Normalize Data

    • StandardScaler

    • .fit_transform()

    • Creating Training and Testing Sets (train_test_split)

  • Creating and Fitting KNN Model

    • KNeighborsClassifier

    • Fit the classifier to training data/labels (labels = target)

    • Use the classifier to generate predictions

  • Precision, Recall, Accuracy and F1-Score

    • from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

  • Improving Model Performance

    • take in six parameters:

      • X_train, y_train, X_test, and y_test

      • min_k and max_k. Set these to 1 and 25, by default

      Create two variables, best_k and best_score

      Iterate through every odd number between min_k and max_k + 1.

      For each iteration:

      • Create a new KNN classifier, and set the n_neighbors parameter to the current value for k, as determined by our loop.

      • Fit this classifier to the training data.

      • Generate predictions for X_test using the fitted classifier.

      • Calculate the F1-score for these predictions.

      • Compare this F1-score to best_score. If better, update best_score and best_k.

      Once it has checked every value for k, print out the best value for k and the F1-score it achieved.

Last updated