Distance Metrics

What decisions do I need to make regarding my data? How might these decisions affect overall performance?
Which predictors do I need? How can I confirm that I have the right predictors?
What parameter values (if any) should I choose for my model? How can I find the optimal value for a given parameter?
What metrics will I use to evaluate the performance of my model? Why?
How do I know if there's room left for improvement with my model? Are the potential performance gains worth the time needed to reach them?

First
- import standard libraries
- import and read dataset
Preprocessing Data
- Remove unnecessary columns
- Convert feature(s) to binary encoding
- Detect and deal with null values
- One-Hot Encode categorical columns
- Store target column in a separate valiable and remove it from DataFrame
Normalize Data
- StandardScaler
- .fit_transform()
- Creating Training and Testing Sets (train_test_split)
Creating and Fitting KNN Model
- KNeighborsClassifier
- Fit the classifier to training data/labels (labels = target)
- Use the classifier to generate predictions
Precision, Recall, Accuracy and F1-Score
- from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
Improving Model Performance
- take in six parameters:
  - X_train, y_train, X_test, and y_test
  - min_k and max_k. Set these to 1 and 25, by default
  Create two variables, best_k and best_score
  Iterate through every odd number between min_k and max_k + 1.
  For each iteration:
  - Create a new KNN classifier, and set the n_neighbors parameter to the current value for k, as determined by our loop.
  - Fit this classifier to the training data.
  - Generate predictions for X_test using the fitted classifier.
  - Calculate the F1-score for these predictions.
  - Compare this F1-score to best_score. If better, update best_score and best_k.
  Once it has checked every value for k, print out the best value for k and the F1-score it achieved.

Last updated 6 years ago