Google ML Crash Course
Last updated
Last updated
https://developers.google.com/machine-learning/crash-course/
True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:
where:
is the temperature in Celsius—the value we're trying to predict.
is the slope of the line.
is the number of chirps per minute—the value of our input feature.
is the -intercept.
By convention in machine learning, you'll write the equation for a model slightly differently:
where:
is the predicted label (a desired output).
is the bias (the y-intercept), sometimes referred to as .
is the weight of feature 1. Weight is the same concept as the "slope" in the traditional equation of a line.
is a feature (a known input).
To infer (predict) the temperature for a new chirps-per-minute value , just substitute the value into this model.
Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:
The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
where:
is an example in which
is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
is the example's label (for example, temperature).
is a function of the weights and bias in combination with the set of features .
is a data set containing many labeled examples, which are pairs.
is the number of examples in .
Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.
For convex problems, weights can start anywhere (say, all 0s)
Convex: think of a bowl shape
Just one minimum
Foreshadowing: not true for neural nets
Non-convex: think of an egg crate
More than one minimum
Strong dependency on initial values
Could compute gradient over entire data set on each step, but this turns out to be unnecessary
Computing gradient on small data samples works well
On every step, get a new random sample
Stochastic Gradient Descent: one example at a time
Mini-Batch Gradient Descent: batches of 10-1000
Loss & gradients are averaged over the batch
Note that TensorFlow handles all the gradient computations for you, so you don't actually have to understand the calculus provided here.
A multivariable function is a function with more than one argument, such as:
The partial derivative f with respect to x, denoted as follows:
is the derivative of considered as a function of alone. To find the following:
you must hold constant (so is now a function of one variable ), and take the regular derivative of with respect to . For example, when is fixed at 1, the preceding function becomes:
This is just a function of one variable , whose derivative is:
In general, thinking of as fixed, the partial derivative of with respect to is calculated as follows:
Similarly, if we hold fixed instead, the partial derivative of with respect to is:
Intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit. In the preceding example:
So when you start at , hold y constant, and move a little, changes by about 7.4 times the amount that you changed .
In machine learning, partial derivatives are mostly used in conjunction with the gradient of a function.
The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:
For instance, if:
then:
Note the following:
∇f
Points in the direction of greatest increase of the function.
−∇f
Points in the direction of greatest decrease of the function.
The number of dimensions in the vector is equal to the number of variables in the formula for ; in other words, the vector falls within the domain space of the function. For instance, the graph of the following function :
when viewed in three dimensions with looks like a valley with a minimum at :
The gradient of is a two-dimensional vector that tells you in which direction to move for the maximum increase in height. Thus, the negative of the gradient moves you in the direction of maximum decrease in height. In other words, the negative of the gradient vector points into the valley.
In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.