Distance Metrics
Best Practices
What decisions do I need to make regarding my data? How might these decisions affect overall performance?
Which predictors do I need? How can I confirm that I have the right predictors?
What parameter values (if any) should I choose for my model? How can I find the optimal value for a given parameter?
What metrics will I use to evaluate the performance of my model? Why?
How do I know if there's room left for improvement with my model? Are the potential performance gains worth the time needed to reach them?
Workflow
First
import standard libraries
import and read dataset
Preprocessing Data
Remove unnecessary columns
Convert feature(s) to binary encoding
Detect and deal with null values
One-Hot Encode categorical columns
Store target column in a separate valiable and remove it from DataFrame
Normalize Data
StandardScaler
.fit_transform()
Creating Training and Testing Sets (train_test_split)
Creating and Fitting KNN Model
KNeighborsClassifier
Fit the classifier to training data/labels (labels = target)
Use the classifier to generate predictions
Precision, Recall, Accuracy and F1-Score
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
Improving Model Performance
take in six parameters:
X_train
,y_train
,X_test
, andy_test
min_k
andmax_k
. Set these to1
and25
, by default
Create two variables,
best_k
andbest_score
Iterate through every odd number between
min_k
andmax_k + 1
.For each iteration:
Create a new KNN classifier, and set the
n_neighbors
parameter to the current value for k, as determined by our loop.Fit this classifier to the training data.
Generate predictions for
X_test
using the fitted classifier.Calculate the F1-score for these predictions.
Compare this F1-score to
best_score
. If better, updatebest_score
andbest_k
.
Once it has checked every value for
k
, print out the best value for k and the F1-score it achieved.
Last updated