Google ML Crash Course

https://developers.google.com/machine-learning/crash-course/

Linear Regression

True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:

where:

By convention in machine learning, you'll write the equation for a model slightly differently:

where:

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:

The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:

  = the square of the difference between the label and the prediction
  = (observation - prediction(x))^2
  = (y - y')^2

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

where:

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Reducing Loss

Weight Initialization

  • For convex problems, weights can start anywhere (say, all 0s)

    • Convex: think of a bowl shape

    • Just one minimum

  • Foreshadowing: not true for neural nets

    • Non-convex: think of an egg crate

    • More than one minimum

    • Strong dependency on initial values

SGD & Mini-Batch Gradient Descent

  • Could compute gradient over entire data set on each step, but this turns out to be unnecessary

  • Computing gradient on small data samples works well

    • On every step, get a new random sample

  • Stochastic Gradient Descent: one example at a time

  • Mini-Batch Gradient Descent: batches of 10-1000

    • Loss & gradients are averaged over the batch

Math

Note that TensorFlow handles all the gradient computations for you, so you don't actually have to understand the calculus provided here.

Partial derivatives

A multivariable function is a function with more than one argument, such as:

The partial derivative f with respect to x, denoted as follows:

Intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit. In the preceding example:

In machine learning, partial derivatives are mostly used in conjunction with the gradient of a function.

Gradients

The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

For instance, if:

then:

Note the following:

In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

TensorFlow API Hierarchy

Last updated