Google ML Crash Course
Last updated
Last updated
If you recall from the Feature Crosses unit, the following classification problem is nonlinear:
Figure 1. Nonlinear classification problem.
"Nonlinear" means that you can't accurately predict a label with a model of the form b+w1x1+w2x2
In other words, the "decision surface" is not a line. Previously, we looked at feature crosses as one possible approach to modeling nonlinear problems.
Now consider the following data set:
Figure 2. A more difficult nonlinear classification problem.
The data set shown in Figure 2 can't be solved with a linear model.
To see how neural networks might help with nonlinear problems, let's start by representing a linear model as a graph:
Figure 3. Linear model as graph.
Each blue circle represents an input feature, and the green circle represents the weighted sum of the inputs.
How can we alter this model to improve its ability to deal with nonlinear problems?
In the model represented by the following graph, we've added a "hidden layer" of intermediary values. Each yellow node in the hidden layer is a weighted sum of the blue input node values. The output is a weighted sum of the yellow nodes.
Figure 4. Graph of two-layer model.
Is this model linear? Yes—its output is still a linear combination of its inputs.
In the model represented by the following graph, we've added a second hidden layer of weighted sums.
Figure 5. Graph of three-layer model.
Is this model still linear? Yes, it is. When you express the output as a function of the input and simplify, you get just another weighted sum of the inputs. This sum won't effectively model the nonlinear problem in Figure 2.
To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe each hidden layer node through a nonlinear function.
In the model represented by the following graph, the value of each node in Hidden Layer 1 is transformed by a nonlinear function before being passed on to the weighted sums of the next layer. This nonlinear function is called the activation function.
Figure 6. Graph of three-layer model with activation function.
Now that we've added an activation function, adding layers has more impact. Stacking nonlinearities on nonlinearities lets us model very complicated relationships between the inputs and the predicted outputs. In brief, each layer is effectively learning a more complex, higher-level function over the raw inputs. If you'd like to develop more intuition on how this works, see Chris Olah's excellent blog post.
The following sigmoid activation function converts the weighted sum to a value between 0 and 1.
Here's a plot:
Figure 7. Sigmoid activation function.
The following rectified linear unit activation function (or ReLU, for short) often works a little better than a smooth function like the sigmoid, while also being significantly easier to compute.
The superiority of ReLU is based on empirical findings, probably driven by ReLU having a more useful range of responsiveness. A sigmoid's responsiveness falls off relatively quickly on both sides.
Figure 8. ReLU activation function.
In fact, any mathematical function can serve as an activation function. Suppose that σrepresents our activation function (Relu, Sigmoid, or whatever). Consequently, the value of a node in the network is given by the following formula:
TensorFlow provides out-of-the-box support for a wide variety of activation functions. That said, we still recommend starting with ReLU.
Now our model has all the standard components of what people usually mean when they say "neural network":
A set of nodes, analogous to neurons, organized in layers.
A set of weights representing the connections between each neural network layer and the layer beneath it. The layer beneath may be another neural network layer, or some other kind of layer.
A set of biases, one for each node.
An activation function that transforms the output of each node in a layer. Different layers may have different activation functions.
This section explains back-propagation's failure cases and the most common way to regularize a neural network.
There are a number of common ways for back-propagation to go wrong.
The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.
When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.
The ReLU activation function can help prevent vanishing gradients.
If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.
Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during back-propagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.
Lowering the learning rate can help keep ReLU units from dying.
Yet another form of regularization, called Dropout, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:
0.0 = No dropout regularization.
1.0 = Drop out everything. The model learns nothing.
Values between 0.0 and 1.0 = More useful.
One vs. all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question. For example, given a picture of a dog, five different recognizers might be trained, four seeing the image as a negative example (not a dog) and one seeing the image as a positive example (a dog). That is:
Is this image an apple? No.
Is this image a bear? No.
Is this image candy? No.
Is this image a dog? Yes.
Is this image an egg? No.
This approach is fairly reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises.
We can create a significantly more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. The following figure suggests this approach:
Figure 1. A one-vs.-all neural network.
Recall that logistic regression produces a decimal between 0 and 1.0. For example, a logistic regression output of 0.8 from an email classifier suggests an 80% chance of an email being spam and a 20% chance of it being not spam. Clearly, the sum of the probabilities of an email being either spam or not spam is 1.0.
Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.
For example, returning to the image analysis we saw in Figure 1, Softmax might produce the following likelihoods of an image belonging to a particular class:
Class | Probability |
apple | 0.001 |
bear | 0.04 |
candy | 0.008 |
dog | 0.95 |
egg | 0.001 |
Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.
Figure 2. A Softmax layer within a neural network.▸
The Softmax equation is as follows:
Note that this formula basically extends the formula for logistic regression into multiple classes.
Consider the following variants of Softmax:
Full Softmax is the Softmax we've been discussing; that is, Softmax calculates a probability for every possible class.
Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.
Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.
Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For such examples:
You may not use Softmax.
You must rely on multiple logistic regressions.
For example, suppose your examples are images containing exactly one item—a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit—then you'll have to use multiple logistic regressions instead.
An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.
Estimated Time: 10 minutes
Collaborative filtering is the task of making predictions about the interests of a user based on interests of many other users. As an example, let's look at the task of movie recommendation. Suppose we have 1,000,000 users, and a list of the movies each user has watched (from a catalog of 500,000 movies). Our goal is to recommend movies to users.
To solve this problem some method is needed to determine which movies are similar to each other. We can achieve this goal by embedding the movies into a low-dimensional space created such that similar movies are nearby.
Before describing how we can learn the embedding, we first explore the type of qualities we want the embedding to have, and how we will represent the training data for learning the embedding.
To help develop intuition about embeddings, on a piece of paper, try to arrange the following movies on a one-dimensional number line so that the movies nearest each other are the most closely related:
Movie | Description | |
R | A French widow grieves the loss of her husband and daughter after they perish in a car accident. | |
PG-13 | Batman endeavors to save Gotham City from nuclear annihilation in this sequel to The Dark Knight, set in the DC Comics universe. | |
PG | A orphaned boy discovers he is a wizard and enrolls in Hogwarts School of Witchcraft and Wizardry, where he wages his first battle against the evil Lord Voldemort. | |
PG | A family of superheroes forced to live as civilians in suburbia come out of retirement to save the superhero race from Syndrome and his killer robot. | |
PG | A lovable ogre and his donkey sidekick set off on a mission to rescue Princess Fiona, who is emprisoned in her castle by a dragon. | |
PG | Luke Skywalker and Han Solo team up with two androids to rescue Princess Leia and save the galaxy. | |
PG-13 | When professional cycler Champion is kidnapped during the Tour de France, his grandmother and overweight dog journey overseas to rescue him, with the help of a trio of elderly jazz singers. | |
R | An amnesiac desperately seeks to solve his wife's murder by tattooing clues onto his body. |
Figure 1. A possible one-dimensional arrangement
While this embedding does help capture how much the movie is geared towards children versus adults, there are many more aspects of a movie that one would want to capture when making recommendations. Let's take this example one step further, adding a second embedding dimension.
Try the same exercise as before, but this time arrange the same movies in a two-dimensional space.
Figure 2. A possible two-dimensional arrangement
With this two-dimensional embedding we define a distance between movies such that movies are nearby (and thus inferred to be similar) if they are both alike in the extent to which they are geared towards children versus adults, as well as the extent to which they are blockbuster movies versus arthouse movies. These, of course, are just two of many characteristics of movies that might be important.
More generally, what we've done is mapped these movies into an embedding space, where each word is described by a two-dimensional set of coordinates. For example, in this space, "Shrek" maps to (-1.0, 0.95) and "Bleu" maps to (0.65, -0.2). In general, when learning a d-dimensional embedding, each movie is represented by d real-valued numbers, each one giving the coordinate in one dimension.
In this example, we have given a name to each dimension. When learning embeddings, the individual dimensions are not learned with names. Sometimes, we can look at the embeddings and assign semantic meanings to the dimensions, and other times we cannot. Often, each such dimension is called a latent dimension, as it represents a feature that is not explicit in the data but rather inferred from it.
Ultimately, it is the distances between movies in the embedding space that are meaningful, rather than a single movie's values along any given dimension.
Categorical data refers to input features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person.
Categorical data is most efficiently represented via sparse tensors, which are tensors with very few non-zero elements. For example, if we're building a movie recommendation model, we can assign a unique ID to each possible movie, and then represent each user by a sparse tensor of the movies they have watched, as shown in Figure 3.
Figure 3. Data for our movie recommendation problem.
Each row of the matrix in Figure 3 is an example capturing a user's movie-viewing history, and is represented as a sparse tensor because each user only watches a small fraction of all possible movies. The last row corresponds to the sparse tensor [1, 3, 999999], using the vocabulary indices shown above the movie icons.
Likewise one can represent words, sentences, and documents as sparse vectors where each word in the vocabulary plays a role similar to the movies in our recommendation example.
In order to use such representations within a machine learning system, we need a way to represent each sparse vector as a vector of numbers so that semantically similar items (movies or words) have similar distances in the vector space. But how do you represent a word as a vector of numbers?
The simplest way is to define a giant input layer with a node for every word in your vocabulary, or at least a node for every word that appears in your data. If 500,000 unique words appear in your data, you could represent a word with a length 500,000 vector and assign each word to a slot in the vector.
If you assign "horse" to index 1247, then to feed "horse" into your network you might copy a 1 into the 1247th input node and 0s into all the rest. This sort of representation is called a one-hot encoding, because only one index has a non-zero value.
More typically your vector might contain counts of the words in a larger chunk of text. This is known as a "bag of words" representation. In a bag-of-words vector, several of the 500,000 nodes would have non-zero value.
But however you determine the non-zero values, one-node-per-word gives you very sparseinput vectors—very large vectors with relatively few non-zero values. Sparse representations have a couple of problems that can make it hard for a model to learn effectively.
Huge input vectors mean a super-huge number of weights for a neural network. If there are M words in your vocabulary and N nodes in the first layer of the network above the input, you have MxN weights to train for that layer. A large number of weights causes further problems:
Amount of data. The more weights in your model, the more data you need to train effectively.
Amount of computation. The more weights, the more computation required to train and use the model. It's easy to exceed the capabilities of your hardware.
If you feed the pixel values of RGB channels into an image classifier, it makes sense to talk about "close" values. Reddish blue is close to pure blue, both semantically and in terms of the geometric distance between vectors. But a vector with a 1 at index 1247 for "horse" is not any closer to a vector with a 1 at index 50,430 for "antelope" than it is to a vector with a 1 at index 238 for "television".
The solution to these problems is to use embeddings, which translate large sparse vectors into a lower-dimensional space that preserves semantic relationships. We'll explore embeddings intuitively, conceptually, and programmatically in the following sections of this module.
You can solve the core problems of sparse input data by mapping your high-dimensional data into a lower-dimensional space.
As you can see from the paper exercises, even a small multi-dimensional space provides the freedom to group semantically similar items together and keep dissimilar items far apart. Position (distance and direction) in the vector space can encode semantics in a good embedding. For example, the following visualizations of real embeddings show geometrical relationships that capture semantic relations like the relation between a country and its capital:
Figure 4. Embeddings can produce remarkable analogies.
This sort of meaningful space gives your machine learning system opportunities to detect patterns that may help with the learning task.
While we want enough dimensions to encode rich semantic relations, we also want an embedding space that is small enough to allow us to train our system more quickly. A useful embedding may be on the order of hundreds of dimensions. This is likely several orders of magnitude smaller than the size of your vocabulary for a natural language task.
An embedding is a matrix in which each column is the vector that corresponds to an item in your vocabulary. To get the dense vector for a single vocabulary item, you retrieve the column corresponding to that item.
But how would you translate a sparse bag of words vector? To get the dense vector for a sparse vector representing multiple vocabulary items (all the words in a sentence or paragraph, for example), you could retrieve the embedding for each individual item and then add them together.
If the sparse vector contains counts of the vocabulary items, you could multiply each embedding by the count of its corresponding item before adding it to the sum.
These operations may look familiar.
The lookup, multiplication, and addition procedure we've just described is equivalent to matrix multiplication. Given a 1 X N sparse representation S and an N X M embedding table E, the matrix multiplication S X E gives you the 1 X M dense vector.
But how do you get E in the first place? We'll take a look at how to obtain embeddings in the next section.
There are a number of ways to get an embedding, including a state-of-the-art algorithm created at Google.
There are many existing mathematical techniques for capturing the important structure of a high-dimensional space in a low dimensional space. In theory, any of these techniques could be used to create an embedding for a machine learning system.
For example, principal component analysis (PCA) has been used to create word embeddings. Given a set of instances like bag of words vectors, PCA tries to find highly correlated dimensions that can be collapsed into a single dimension.
Word2vec is an algorithm invented at Google for training word embeddings. Word2vec relies on the distributional hypothesis to map semantically similar words to geometrically close embedding vectors.
The distributional hypothesis states that words which often have the same neighboring words tend to be semantically similar. Both "dog" and "cat" frequently appear close to the word "vet", and this fact reflects their semantic similarity. As the linguist John Firth put it in 1957, "You shall know a word by the company it keeps".
Word2Vec exploits contextual information like this by training a neural net to distinguish actually co-occurring groups of words from randomly grouped words. The input layer takes a sparse representation of a target word together with one or more context words. This input connects to a single, smaller hidden layer.
In one version of the algorithm, the system makes a negative example by substituting a random noise word for the target word. Given the positive example "the plane flies", the system might swap in "jogging" to create the contrasting negative example "the jogging flies".
The other version of the algorithm creates negative examples by pairing the true target word with randomly chosen context words. So it might take the positive examples (the, plane), (flies, plane) and the negative examples (compiled, plane), (who, plane) and learn to identify which pairs actually appeared together in text.
The classifier is not the real goal for either version of the system, however. After the model has been trained, you have an embedding. You can use the weights connecting the input layer with the hidden layer to map sparse representations of words to smaller vectors. This embedding can be reused in other classifiers.
For more information about word2vec, see the tutorial on tensorflow.org
You can also learn an embedding as part of the neural network for your target task. This approach gets you an embedding well customized for your particular system, but may take longer than training the embedding separately.
In general, when you have sparse data (or dense data that you'd like to embed), you can create an embedding unit that is just a special type of hidden unit of size d. This embedding layer can be combined with any other features and hidden layers. As in any DNN, the final layer will be the loss that is being optimized. For example, let's say we're performing collaborative filtering, where the goal is to predict a user's interests from the interests of other users. We can model this as a supervised learning problem by randomly setting aside (or holding out) a small number of the movies that the user has watched as the positive labels, and then optimize a softmax loss.
Figure 5. A sample DNN architecture for learning movie embeddings from collaborative filtering data.
As another example if you want to create an embedding layer for the words in a real-estate ad as part of a DNN to predict housing prices then you'd optimize an L2 Loss using the known sale price of homes in your training data as the label.
When learning a d-dimensional embedding each item is mapped to a point in a d-dimensional space so that the similar items are nearby in this space. Figure 6 helps to illustrate the relationship between the weights learned in the embedding layer and the geometric view. The edge weights between an input node and the nodes in the d-dimensional embedding layer correspond to the coordinate values for each of the d axes.
Figure 6. A geometric view of the embedding layer weights.
The behavior of an ML system is dependent on the behavior and qualities of its input features. As the input data for those features changes, so too will your model. Sometimes that change is desirable, but sometimes it is not.
In traditional software development, you focus more on code than on data. In machine learning development, although coding is still part of the job, your focus must widen to include data. For example, on traditional software development projects, it is a best practice to write unit tests to validate your code. On ML projects, you must also continuously test, verify, and monitor your input data.
For example, you should continuously monitor your model to remove unused (or little used) features. Imagine a certain feature that has been contributing little or nothing to the model. If the input data for that feature abruptly changes, your model's behavior might also abruptly change in undesirable ways.
Some questions to ask about the reliability of your input data:
Is the signal always going to be available or is it coming from an unreliable source? For example:
Is the signal coming from a server that crashes under heavy load?
Is the signal coming from humans that go on vacation every August?
Some questions to ask about versioning:
Does the system that computes this data ever change? If so:
How often?
How will you know when that system changes?
Sometimes, data comes from an upstream process. If that process changes abruptly, your model can suffer.
Consider creating your own copy of the data you receive from the upstream process. Then, only advance to the next version of the upstream data when you are certain that it is safe to do so.
The following question might remind you of regularization:
Does the usefulness of the feature justify the cost of including it?
It is always tempting to add more features to the model. For example, suppose you find a new feature whose addition makes your model slightly more accurate. More accuracy certainly sounds better than less accuracy. However, now you've just added to your maintenance burden. That additional feature could degrade unexpectedly, so you've got to monitor it. Think carefully before adding features that lead to minor short-term wins.
Some features correlate (positively or negatively) with other features. Ask yourself the following question:
Are any features so tied together that you need additional strategies to tease them apart?
Sometimes a model can affect its own training data. For example, the results from some models, in turn, are directly or indirectly input features to that same model.
Sometimes a model can affect another model. For example, consider two models for predicting stock prices:
Model A, which is a bad predictive model.
Model B.
Since Model A is buggy, it mistakenly decides to buy stock in Stock X. Those purchases drive up the price of Stock X. Model B uses the price of Stock X as an input feature, so Model B can easily come to some false conclusions about the value of Stock X stock. Model B could, therefore, buy or sell shares of Stock X based on the buggy behavior of Model A. Model B's behavior, in turn, can affect Model A, possibly triggering a tulip mania or a slide in Company X's stock
Machine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model's predictions susceptible to bias.
When building models, it's important to be aware of common human biases that can manifest in your data, so you can take proactive steps to mitigate their effects.
WARNING: The following inventory of biases provides just a small selection of biases that are often uncovered in machine learning data sets; this list is not intended to be exhaustive. Wikipedia's catalog of cognitive biases enumerates over 100 different types of human bias that can affect our judgment. When auditing your data, you should be on the lookout for any and all potential sources of bias that might skew your model's predictions.
Reporting bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency. This bias can arise because people tend to focus on documenting circumstances that are unusual or especially memorable, assuming that the ordinary can "go without saying."
EXAMPLE: A sentiment-analysis model is trained to predict whether book reviews are positive or negative based on a corpus of user submissions to a popular website. The majority of reviews in the training data set reflect extreme opinions (reviewers who either loved or hated a book), because people were less likely to submit a review of a book if they did not respond to it strongly. As a result, the model is less able to correctly predict sentiment of reviews that use more subtle language to describe a book.
Automation bias is a tendency to favor results generated by automated systems over those generated by non-automated systems, irrespective of the error rates of each.
EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy the new "groundbreaking" model they trained to identify tooth defects, until the factory supervisor pointed out that the model's precision and recall rates were both 15% lower than those of human inspectors.
Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world distribution. Selection bias can take many different forms:
Coverage bias: Data is not selected in a representative fashion.
EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product. Consumers who instead opted to buy a competing product were not surveyed, and as a result, this group of people was not represented in the training data.
Non-response bias (or participation bias): Data ends up being unrepresentative due to participation gaps in the data-collection process.
EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Consumers who bought the competing product were 80% more likely to refuse to complete the survey, and their data was underrepresented in the sample.
Sampling bias: Proper randomization is not used during data collection.
EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Instead of randomly targeting consumers, the surveyer chose the first 200 consumers that responded to an email, who might have been more enthusiastic about the product than average purchasers.
Group attribution bias is a tendency to generalize what is true of individuals to an entire group to which they belong. Two key manifestations of this bias are:
In-group bias: A preference for members of a group to which you also belong, or for characteristics that you also share.
EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that applicants who attended the same computer-science academy as they both did are more qualified for the role.
Out-group homogeneity bias: A tendency to stereotype individual members of a group to which you do not belong, or to see their characteristics as more uniform.
EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that all applicants who did not attend a computer-science academy do not have sufficient expertise for the role.
Implicit bias occurs when assumptions are made based on one's own mental models and personal experiences that do not necessarily apply more generally.
EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature to indicate a person is communicating the word "no." However, in some regions of the world, a head shake actually signifies "yes."
A common form of implicit bias is confirmation bias, where model builders unconsciously process data in ways that affirm preexisting beliefs and hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called experimenter's bias.
EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a variety of features (height, weight, breed, environment). The engineer had an unpleasant encounter with a hyperactive toy poodle as a child, and ever since has associated the breed with aggression. When the trained model predicted most toy poodles to be relatively docile, the engineer retrained the model several more times until it produced a result showing smaller poodles to be more violent.
As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.
Where might bias lurk? Here are three red flags to look out for in your data set.
If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.
For example, the table below shows a summary of key stats for a subset of features in the California Housing dataset, stored in a pandas DataFrame
and generated via DataFrame.describe
. Note that all features have a count
of 17000, indicating there are no missing values:
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
count | 17000.0 | 17000.0 | 17000.0 | 17000.0 | 17000.0 | 17000.0 | 17000.0 |
mean | -119.6 | 35.6 | 2643.7 | 1429.6 | 501.2 | 3.9 | 207.3 |
std | 2.0 | 2.1 | 2179.9 | 1147.9 | 384.5 | 1.9 | 116.0 |
min | -124.3 | 32.5 | 2.0 | 3.0 | 1.0 | 0.5 | 15.0 |
25% | -121.8 | 33.9 | 1462.0 | 790.0 | 282.0 | 2.6 | 119.4 |
50% | -118.5 | 34.2 | 2127.0 | 1167.0 | 409.0 | 3.5 | 180.4 |
75% | -118.0 | 37.7 | 3151.2 | 1721.0 | 605.2 | 4.8 | 265.0 |
max | -114.3 | 42.0 | 37937.0 | 35682.0 | 6082.0 | 15.0 | 500.0 |
Suppose instead that three features (population
, households
, and median_income
) only had a count of 3000
—in other words, that there were 14,000 missing values for each feature:
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
count | 17000.0 | 17000.0 | 17000.0 | 3000.0 | 3000.0 | 3000.0 | 17000.0 |
mean | -119.6 | 35.6 | 2643.7 | 1429.6 | 501.2 | 3.9 | 207.3 |
std | 2.0 | 2.1 | 2179.9 | 1147.9 | 384.5 | 1.9 | 116.0 |
min | -124.3 | 32.5 | 2.0 | 3.0 | 1.0 | 0.5 | 15.0 |
25% | -121.8 | 33.9 | 1462.0 | 790.0 | 282.0 | 2.6 | 119.4 |
50% | -118.5 | 34.2 | 2127.0 | 1167.0 | 409.0 | 3.5 | 180.4 |
75% | -118.0 | 37.7 | 3151.2 | 1721.0 | 605.2 | 4.8 | 265.0 |
max | -114.3 | 42.0 | 37937.0 | 35682.0 | 6082.0 | 15.0 | 500.0 |
These 14,000 missing values would make it much more difficult to accurately correlate average income of households with median house prices. Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.
When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.
For example, take a look at the following excerpted examples from the California housing data set:
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
1 | -121.7 | 38.0 | 7105.0 | 3523.0 | 1088.0 | 5.0 | 0.2 |
2 | -122.4 | 37.8 | 2479.0 | 1816.0 | 496.0 | 3.1 | 0.3 |
3 | -122.0 | 37.0 | 2813.0 | 1337.0 | 477.0 | 3.7 | 0.3 |
4 | -103.5 | 43.8 | 2212.0 | 803.0 | 144.0 | 5.3 | 0.2 |
5 | -117.1 | 32.8 | 2963.0 | 1162.0 | 556.0 | 3.6 | 0.2 |
6 | -118.0 | 33.7 | 3396.0 | 1542.0 | 472.0 | 7.4 | 0.4 |
Can you pinpoint any unexpected feature values?
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
1 | -121.7 | 38.0 | 7105.0 | 3523.0 | 1088.0 | 5.0 | 0.2 |
2 | -122.4 | 37.8 | 2479.0 | 1816.0 | 496.0 | 3.1 | 0.3 |
3 | -122.0 | 37.0 | 2813.0 | 1337.0 | 477.0 | 3.7 | 0.3 |
4 | -103.5 | 43.8 | 2212.0 | 803.0 | 144.0 | 5.3 | 0.2 |
5 | -117.1 | 32.8 | 2963.0 | 1162.0 | 556.0 | 3.6 | 0.2 |
6 | -118.0 | 33.7 | 3396.0 | 1542.0 | 472.0 | 7.4 | 0.4 |
The longitude and latitude coordinates in example 4 (-103.5 and 43.8, respectively) do not fall within the U.S. state of California. In fact, they are the approximate coordinates of Mount Rushmore National Memorial in the state of South Dakota. This is a bogus example that we inserted into the data set.
Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.
If you completed the Validation programming exercise, you may recall discovering how a failure to randomize the California housing data set prior to splitting it into training and validation sets resulted in a pronounced data skew. Figure 1 visualizes a subset of data drawn from the full data set that exclusively represents the northwest region of California.
Figure 1. California state map overlaid with data from the California Housing data set. Each dot represents a housing block, with colors ranging from blue to red corresponding to median house price ranging from low to high, respectively.
If this unrepresentative sample were used to train a model to predict California housing prices statewide, the lack of housing data from southern portions of California would be problematic. The geographical bias encoded in the model might adversely affect homebuyers in unrepresented communities.
When evaluating a model, metrics calculated against an entire test or validation set don't always give an accurate picture of how fair the model is.
Consider a new model developed to predict the presence of tumors that is evaluated against a validation set of 1,000 patients' medical records. 500 records are from female patients, and 500 records are from male patients. The following confusion matrix summarizes the results for all 1,000 examples:
These results look promising: precision of 80% and recall of 72.7%. But what happens if we calculate the result separately for each set of patients? Let's break out the results into two separate confusion matrices: one for female patients and one for male patients.
When we calculate metrics separately for female and male patients, we see stark differences in model performance for each group.
Female patients:
Of the 11 female patients who actually have tumors, the model correctly predicts positive for 10 patients (recall rate: 90.9%). In other words, the model misses a tumor diagnosis in 9.1% of female cases.
Similarly, when the model returns positive for tumor in female patients, it is correct in 10 out of 11 cases (precision rate: 90.9%); in other words, the model incorrectly predicts tumor in 9.1% of female cases.
Male patients:
However, of the 11 male patients who actually have tumors, the model correctly predicts positive for only 6 patients (recall rate: 54.5%). That means the model misses a tumor diagnosis in 45.5% of male cases.
And when the model returns positive for tumor in male patients, it is correct in only 6 out of 9 cases (precision rate: 66.7%); in other words, the model incorrectly predicts tumor in 33.3% of male cases.
We now have a much better understanding of the biases inherent in the model's predictions, as well as the risks to each subgroup if the model were to be released for medical use in the general population.
Fairness is a relatively new subfield within the discipline of machine learning. To learn more about research and initiatives devoted to developing new tools and techniques for identifying and mitigating bias in machine learning models, check out Google's Machine Learning Fairness resources page.
Keep your first model simple.
Focus on ensuring data pipeline correctness.
Use a simple, observable metric for training & evaluation.
Own and monitor your input features.
Treat your model configuration as code: review it, check it in.
Write down the results of all experiments, especially "failures."