Google ML Crash Course

https://developers.google.com/machine-learning/crash-course/

Linear Regression

True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:

y = mx + b

where:

$y$ is the temperature in Celsius—the value we're trying to predict.
$m$ is the slope of the line.
$x$ is the number of chirps per minute—the value of our input feature.
$b$ is the $y$ -intercept.

By convention in machine learning, you'll write the equation for a model slightly differently:

y' = b + w_1x_1

where:

$y′$ is the predicted label (a desired output).
$b$ is the bias (the y-intercept), sometimes referred to as $w_0$ .
$w_1$ is the weight of feature 1. Weight is the same concept as the "slope" $m$ in the traditional equation of a line.
$x_1$ is a feature (a known input).

To infer (predict) the temperature $y′$ for a new chirps-per-minute value $x_1$ , just substitute the $x_1$ value into this model.

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:

y′=b+w_1x_1+w_2x_2+w_3x_3

Squared loss: a popular loss function

The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:

  = the square of the difference between the label and the prediction
  = (observation - prediction(x))^2
  = (y - y')^2

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

MSE=\frac{1}N ∑_{(x,y)∈D}(y−prediction(x))^2

where:

$(x,y)$ is an example in which
- $x$ is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
- $y$ is the example's label (for example, temperature).
$prediction(x)$ is a function of the weights and bias in combination with the set of features $x$ .
$D$ is a data set containing many labeled examples, which are $(x,y)$ pairs.
$N$ is the number of examples in $D$ .

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Reducing Loss

Weight Initialization

For convex problems, weights can start anywhere (say, all 0s)
- Convex: think of a bowl shape
- Just one minimum
Foreshadowing: not true for neural nets
- Non-convex: think of an egg crate
- More than one minimum
- Strong dependency on initial values

SGD & Mini-Batch Gradient Descent

Could compute gradient over entire data set on each step, but this turns out to be unnecessary
Computing gradient on small data samples works well
- On every step, get a new random sample
Stochastic Gradient Descent: one example at a time
Mini-Batch Gradient Descent: batches of 10-1000
- Loss & gradients are averaged over the batch

Math

Note that TensorFlow handles all the gradient computations for you, so you don't actually have to understand the calculus provided here.

Partial derivatives

A multivariable function is a function with more than one argument, such as:

f(x,y)=e^{2y}sin⁡(x)

The partial derivative f with respect to x, denoted as follows:

\frac{∂f}{∂x}

is the derivative of $f$ considered as a function of $x$ alone. To find the following:

\frac{∂f}{∂x}

you must hold $y$ constant (so $f$ is now a function of one variable $x$ ), and take the regular derivative of $f$ with respect to $x$ . For example, when $y$ is fixed at 1, the preceding function becomes:

f(x)=e^2 sin⁡(x)

This is just a function of one variable $x$ , whose derivative is:

e^2cos⁡(x)

In general, thinking of $y$ as fixed, the partial derivative of $f$ with respect to $x$ is calculated as follows:

\frac{∂f}{∂x}(x,y)=e^{2y}cos⁡(x)

Similarly, if we hold $x$ fixed instead, the partial derivative of $f$ with respect to $y$ is:

\frac{∂f}{∂y}(x,y)=2e^{2y}sin⁡(x)

Intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit. In the preceding example:

\frac{∂f}{∂x}(0,1)=e2≈7.4

So when you start at $(0,1)$ , hold y constant, and move $x$ a little, $f$ changes by about 7.4 times the amount that you changed $x$ .

In machine learning, partial derivatives are mostly used in conjunction with the gradient of a function.

Gradients

The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

a = b

For instance, if:

f(x,y)=e^{2y} sin⁡(x)

then:

∇f(x,y)=(\frac{∂f}{∂x}(x,y), \frac{∂f}{∂y}(x,y))=(e^{2y} cos⁡(x), 2e^{2y} sin⁡(x))

Note the following:

∇f

Points in the direction of greatest increase of the function.

−∇f

Points in the direction of greatest decrease of the function.

The number of dimensions in the vector is equal to the number of variables in the formula for $f$ ; in other words, the vector falls within the domain space of the function. For instance, the graph of the following function $f(x,y)$ :

f(x,y)=4+(x−2)^2 + 2y^2

when viewed in three dimensions with $z=f(x,y)$ looks like a valley with a minimum at $(2,0,4)$ :

The gradient of $f(x,y)$ is a two-dimensional vector that tells you in which $(x,y)$ direction to move for the maximum increase in height. Thus, the negative of the gradient moves you in the direction of maximum decrease in height. In other words, the negative of the gradient vector points into the valley.

In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

TensorFlow API Hierarchy

PreviousBig Data NextKaggle

Last updated 6 years ago