Google ML Crash Course

https://developers.google.com/machine-learning/crash-course/

Linear Regression

True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:

y=mx+by = mx + b

where:

  • yy is the temperature in Celsius—the value we're trying to predict.

  • mm is the slope of the line.

  • xx is the number of chirps per minute—the value of our input feature.

  • bb is the yy-intercept.

By convention in machine learning, you'll write the equation for a model slightly differently:

y=b+w1x1y' = b + w_1x_1

where:

  • yy′ is the predicted label (a desired output).

  • bb is the bias (the y-intercept), sometimes referred to as w0w_0.

  • w1w_1 is the weight of feature 1. Weight is the same concept as the "slope" mm in the traditional equation of a line.

  • x1x_1 is a feature (a known input).

To infer (predict) the temperature yy′ for a new chirps-per-minute value x1x_1, just substitute the x1x_1 value into this model.

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:

y=b+w1x1+w2x2+w3x3y′=b+w_1x_1+w_2x_2+w_3x_3

The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:

  = the square of the difference between the label and the prediction
  = (observation - prediction(x))^2
  = (y - y')^2

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

MSE=1N(x,y)D(yprediction(x))2MSE=\frac{1}N ∑_{(x,y)∈D}(y−prediction(x))^2

where:

  • (x,y)(x,y) is an example in which

    • xx is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.

    • yy is the example's label (for example, temperature).

  • prediction(x)prediction(x) is a function of the weights and bias in combination with the set of features xx.

  • DD is a data set containing many labeled examples, which are (x,y)(x,y) pairs.

  • NN is the number of examples in DD.

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Reducing Loss

Weight Initialization

  • For convex problems, weights can start anywhere (say, all 0s)

    • Convex: think of a bowl shape

    • Just one minimum

  • Foreshadowing: not true for neural nets

    • Non-convex: think of an egg crate

    • More than one minimum

    • Strong dependency on initial values

SGD & Mini-Batch Gradient Descent

  • Could compute gradient over entire data set on each step, but this turns out to be unnecessary

  • Computing gradient on small data samples works well

    • On every step, get a new random sample

  • Stochastic Gradient Descent: one example at a time

  • Mini-Batch Gradient Descent: batches of 10-1000

    • Loss & gradients are averaged over the batch

Math

Note that TensorFlow handles all the gradient computations for you, so you don't actually have to understand the calculus provided here.

Partial derivatives

A multivariable function is a function with more than one argument, such as:

f(x,y)=e2ysin(x)f(x,y)=e^{2y}sin⁡(x)

The partial derivative f with respect to x, denoted as follows:

fx\frac{∂f}{∂x}

is the derivative of ff considered as a function of xx alone. To find the following:

fx\frac{∂f}{∂x}

you must hold yy constant (so ff is now a function of one variable xx), and take the regular derivative of ff with respect to xx. For example, when yy is fixed at 1, the preceding function becomes:

f(x)=e2sin(x)f(x)=e^2 sin⁡(x)

This is just a function of one variable xx, whose derivative is:

e2cos(x)e^2cos⁡(x)

In general, thinking of yy as fixed, the partial derivative of ff with respect to xx is calculated as follows:

fx(x,y)=e2ycos(x)\frac{∂f}{∂x}(x,y)=e^{2y}cos⁡(x)

Similarly, if we hold xx fixed instead, the partial derivative of ff with respect to yy is:

fy(x,y)=2e2ysin(x)\frac{∂f}{∂y}(x,y)=2e^{2y}sin⁡(x)

Intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit. In the preceding example:

fx(0,1)=e27.4\frac{∂f}{∂x}(0,1)=e2≈7.4

So when you start at (0,1)(0,1), hold y constant, and move xx a little, ff changes by about 7.4 times the amount that you changed xx.

In machine learning, partial derivatives are mostly used in conjunction with the gradient of a function.

Gradients

The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

a=ba = b

For instance, if:

f(x,y)=e2ysin(x)f(x,y)=e^{2y} sin⁡(x)

then:

f(x,y)=(fx(x,y),fy(x,y))=(e2ycos(x),2e2ysin(x))∇f(x,y)=(\frac{∂f}{∂x}(x,y), \frac{∂f}{∂y}(x,y))=(e^{2y} cos⁡(x), 2e^{2y} sin⁡(x))

Note the following:

∇f

Points in the direction of greatest increase of the function.

−∇f

Points in the direction of greatest decrease of the function.

The number of dimensions in the vector is equal to the number of variables in the formula for ff; in other words, the vector falls within the domain space of the function. For instance, the graph of the following function f(x,y)f(x,y):

f(x,y)=4+(x2)2+2y2f(x,y)=4+(x−2)^2 + 2y^2

when viewed in three dimensions with z=f(x,y)z=f(x,y) looks like a valley with a minimum at (2,0,4)(2,0,4):

The gradient of f(x,y)f(x,y) is a two-dimensional vector that tells you in which (x,y)(x,y) direction to move for the maximum increase in height. Thus, the negative of the gradient moves you in the direction of maximum decrease in height. In other words, the negative of the gradient vector points into the valley.

In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

TensorFlow API Hierarchy

Last updated