Neural Networks and Deep Learning

deeplearning.ai @ coursera

Scale of data, computation and algorithms.

Binary Classification

X.shape(n, m) and Y.shape(1, m) -> python

Logistic Regression

Gradient Descent

The negative sign should apply to the entire cost function (both terms in the summation).

J(w,b)=1mi=1mL(y^(i),y(i))=1m(y(i)logy^(i)+(1y(i))log(1y^(i)))J(w,b) = \frac{1}{m}\sum_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)}) = -\frac{1}{m}\left( y^{(i)}log \hat{y}^{(i)} + (1-y^{(i)})log(1 - \hat{y}^{(i)})\right)

Derivatives

Derivative => Slope of Lines

Computation Graph

Logistic Regression Gradient Descent

When you're implementing deep learning algorithms, you find that having explicit for loops in your code makes your algorithm run less efficiency. So, in the deep learning era, we would move to a bigger and bigger datasets, and so being able to implement your algorithms without using explicit for loops is really important and will help you to scale to much bigger datasets. So, it turns out that there are a set of techniques called vectorization techniques that allow you to get rid of these explicit for-loops in your code.

Derivation of dLdz\frac{dL}{dz}

If you're curious, here is the derivation for dLdz=ay\frac{dL}{dz} = a - y

Note that in this part of the course, Andrew refers to dLdz as dz\frac{dL}{dz} \text{ as } dz.

By the chain rule: dLdz=dLda×dadz\frac{dL}{dz} = \frac{dL}{da} \times \frac{da}{dz}

We'll do the following: 1. solve for dLda\frac{dL}{da}​, then

Step 1: dLda\frac{dL}{da}

L=(y×log(a)+(1y)×log(1a))L = -(y \times log(a) + (1-y) \times log(1-a))

dLda=y×1a(1y)×11a×1\frac{dL}{da} = -y\times \frac{1}{a} - (1-y) \times \frac{1}{1-a}\times -1

We're taking the derivative with respect to a.

Remember that there is an additional 1-1 in the last term when we take the derivative of (1a)(1-a) with respect to aa.

dLda=ya+1y1a\frac{dL}{da} = \frac{-y}{a} + \frac{1-y}{1-a}

We'll give both terms the same denominator:

dLda=y×(1a)a×(1a)+a×(1y)a×(1a)\frac{dL}{da} = \frac{-y \times (1-a)}{a\times(1-a)} + \frac{a \times (1-y)}{a\times(1-a)}

Clean up the terms:

dLda=y+ay+aaya(1a)\frac{dL}{da} = \frac{-y + ay + a - ay}{a(1-a)}

So now we have:

dLda=aya(1a)\frac{dL}{da} = \frac{a - y}{a(1-a)}

Step 2: dadz\frac{da}{dz}

dadz=ddzσ(z)\frac{da}{dz} = \frac{d}{dz} \sigma(z)

The derivative of a sigmoid has the form:

ddzσ(z)=σ(z)×(1σ(z))\frac{d}{dz}\sigma(z) = \sigma(z) \times (1 - \sigma(z))

You can look up why this derivation is of this form. For example, google "derivative of a sigmoid", and you can see the derivation in detail, such as in this article.

Recall that σ(z)=a\sigma(z) = a, because we defined "a", the activation, as the output of the sigmoid activation function.

So we can substitute into the formula to get:

dadz=a(1a)\frac{da}{dz} = a (1 - a)

(Continue to the next page to see step 3!)

Step 3: dLdz\frac{dL}{dz}

We'll multiply step 1 and step 2 to get the result.

dLdz=dLda×dadz\frac{dL}{dz} = \frac{dL}{da} \times \frac{da}{dz}

From step 1: dLda=aya(1a)\frac{dL}{da} = \frac{a - y}{a(1-a)}

From step 2: dadz=a(1a)\frac{da}{dz} = a (1 - a)

dLdz=aya(1a)×a(1a)\frac{dL}{dz} = \frac{a - y}{a(1-a)} \times a (1 - a)

Notice that we can cancel factors to get this:

dLdz=ay\frac{dL}{dz} = a - y

In Andrew's notation, he's referring to dLdz\frac{dL}{dz}​ as dzdz.

So in the videos:

dz=aydz = a - y

Vectorization

import time
import numpy as np

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorized Version:" + str(1000*(toc-tic)) + " ms")
# 250286.989866
# ~1.5 ms

tic = time.time()
c = 0
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print("for-loop Version:" + str(1000*(toc-tic)) + " ms")
# 250286.989866
# ~475 ms

Vectorizing Logistic Regression

Vectorizing Logistic Regression's Gradient Output

Broadcasting in Python

import numpy as np

A = np.array([[56.0,   0.0,  4.4, 68.0],
              [ 1.2, 104.0, 52.0,  8.0],
              [ 1.8, 135.0, 99.0,  0.9]])
              
print(A)

cal = A.sum(axis=0)
percentage = 100 * A / cal.reshape(1, 4)
print(percentage)

Note on Python/Numpy Vectors

import numpy as np

# Not recommended
a = np.random.randn(5) # Rank 1 array
print(a)
print(a.shape) # (5,)
print(a.T)
print(np.dot(a, a.T))

# Recommended
a = np.random.randn((5, 1))
print(a)
print(a.T)

# If you have a rank 1 array use:
a = a.reshape((5, 1))

Logistic regression cost function

What is a Neural Network?

Neural Network Representation

Computing a Neural Network's Output

Vectorizing across multiple examples

Explanation for Vectorized Implementation

Activation Functions

The tanh function is almost always strictly superior. The one exception is for the output layer because if y is either 0 or 1, then it makes sense for y hat to be a number, the one to output that's between 0 and 1 rather than between minus 1 and 1. So the one exception where I would use the sigmoid activation function is when you are using binary classification, in which case you might use the sigmoid activation function for the output layer.

Why do you need non-linear activation functions?

There is just one place where you might use a linear activation function. g(x) = z. And that's if you are doing machine learning on the regression problem. So if y is a real number. So for example, if you are doing machine learning on the regression problem. So if y is a real number. So for example, if you're trying to predict housing prices. So y is not 0, 1, but is a real number, anywhere from - I don't know - $0 is the price of house up to however expensive.

Derivatives of activation functions

Gradient Descent for Neural Network

Back-propagation intuition

Random Initialization

Last updated