Machine Learning
Last updated
Last updated
Here is a cheat sheet that shows which algorithms perform best at which tasks.
Algorithm | Pros | Cons | Good at |
Linear regression | - Very fast (runs in constant time) - Easy to understand the model - Less prone to overfitting | - Unable to model complex relationships -Unable to capture nonlinear relationships without first transforming the inputs | - The first look at a dataset - Numerical data with lots of features |
Decision trees | - Fast - Robust to noise and missing values - Accurate | - Complex trees are hard to interpret - Duplication within the same sub-tree is possible | - Star classification - Medical diagnosis - Credit risk analysis |
Neural networks | - Extremely powerful - Can model even very complex relationships - No need to understand the underlying data – Almost works by “magic” | - Prone to overfitting - Long training time - Requires significant computing power for large datasets - Model is essentially unreadable | - Images - Video - “Human-intelligence” type tasks like driving or flying - Robotics |
Support Vector Machines | - Can model complex, nonlinear relationships - Robust to noise (because they maximize margins) | - Need to select a good kernel function - Model parameters are difficult to interpret - Sometimes numerical stability problems - Requires significant memory and processing power | - Classifying proteins - Text classification - Image classification - Handwriting recognition |
K-Nearest Neighbors | - Simple - Powerful - No training involved (“lazy”) - Naturally handles multiclass classification and regression | - Expensive and slow to predict new instances - Must define a meaningful distance function - Performs poorly on high-dimensionality datasets | - Low-dimensional datasets - Computer security: intrusion detection - Fault detection in semiconducter manufacturing - Video content retrieval - Gene expression - Protein-protein interaction |
“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: over-representing performance on particular datasets and (implicitly) over-representing performance of a method on future datasets.
We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “re-problem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.
Name | Method | Explanation | Remedy |
Traditional overfitting | Train a complex predictor on too-few examples. |
| |
Parameter tweak overfitting | Use a learning algorithm with many parameters. Choose the parameters based on the test set performance. | For example, choosing the features so as to optimize test set performance can achieve this. | Same as above. |
Brittle measure | Use a measure of performance which is especially brittle to overfitting. | “entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach. | Prefer less brittle measures of performance. |
Bad statistics | Misuse statistics to overstate confidences. | One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence. | Don’t do this. Reject papers which do this. |
Choice of measure | Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs. | This is fairly common and tempting. | Use canonical performance measures. For example, the performance measure directly motivated by the problem. |
Incomplete Prediction | Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction. | Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed. | Reject papers which do this. |
Human-loop overfitting. | Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction. | This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice. | Make sure test examples are not available to the human. |
Chose to report results on some subset of datasets where your algorithm performs well. | The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect. | Use comparisons on standard datasets. Select datasets without using the test set. Good Contestperformance can’t be faked this way. | |
Reprobleming | Alter the problem so that your performance improves. | For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset. | Discount papers which do this. Make sure problem specifications are clear. |
Old datasets | Create an algorithm for the purpose of improving performance on old datasets. | After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade… | Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it. |
Overfitting by review | 10 people submit a paper to a conference. The one with the best result is accepted. | This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting. |
|