Google ML Crash Course
https://developers.google.com/machine-learning/crash-course/
Linear Regression
True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:
where:
By convention in machine learning, you'll write the equation for a model slightly differently:
where:
Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:
Squared loss: a popular loss function
The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
where:
Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.
Reducing Loss
Weight Initialization
For convex problems, weights can start anywhere (say, all 0s)
Convex: think of a bowl shape
Just one minimum
Foreshadowing: not true for neural nets
Non-convex: think of an egg crate
More than one minimum
Strong dependency on initial values
SGD & Mini-Batch Gradient Descent
Could compute gradient over entire data set on each step, but this turns out to be unnecessary
Computing gradient on small data samples works well
On every step, get a new random sample
Stochastic Gradient Descent: one example at a time
Mini-Batch Gradient Descent: batches of 10-1000
Loss & gradients are averaged over the batch
Math
Note that TensorFlow handles all the gradient computations for you, so you don't actually have to understand the calculus provided here.
Partial derivatives
A multivariable function is a function with more than one argument, such as:
The partial derivative f with respect to x, denoted as follows:
Intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit. In the preceding example:
In machine learning, partial derivatives are mostly used in conjunction with the gradient of a function.
Gradients
The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:
For instance, if:
then:
Note the following:
In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.
TensorFlow API Hierarchy
Last updated