Back Propagation
Parameters
Suppose we have an activation $a = f(W.X + b)$ with:
and $n$ and $m$ are input dimension and hidden layer dimension, $M$ is batch size. Parameters matrices are:
The loss function for the batch is $J(a) \in R$ and is normally the sum over loss on each example in the batch.
For back propagation, suppose we have $da = \frac{\partial{J}}{\partial{a}}$, we want to compute $dX = \frac{\partial{J}}{\partial{X}}, db = \frac{\partial{J}}{\partial{b}}, dW = \frac{\partial{J}}{\partial{W}}$. We define:
of shape $(m \times M)$ and $df(z)$ the pointwise derivative of $f$ at $z$ which has the same shape.
Formula
- For $a = f(z)$ a pointwise function ($a$ and $z$ of shape $m \times M$), we have the following equation since the derivative is taken pointwise:
- For $z = W*X+b$ then we have:
Numpy
def backprop(da, cache):
#values computed during forward pass
X,z,a,... = cache
# get formula for df at z.
# example f = sigmoid ==> dfz = sigmoid(z)*(1-sigmoid(z)) = a(1-a)
# example f = tanh ==> dfz = (1-tanh(z)²) = (1-a²)
dfz = ...
# backprop
dz = da*dfz
dX = np.dot(W.T, dz)
dW = np.dot(dz,X.T)
db = np.sum(dz, axis=1, keepdims = True)