Parameters

Suppose we have an activation $a = f(W.X + b)$ with:

and $n$ and $m$ are input dimension and hidden layer dimension, $M$ is batch size. Parameters matrices are:

The loss function for the batch is $J(a) \in R$ and is normally the sum over loss on each example in the batch.

For back propagation, suppose we have $da = \frac{\partial{J}}{\partial{a}}$, we want to compute $dX = \frac{\partial{J}}{\partial{X}}, db = \frac{\partial{J}}{\partial{b}}, dW = \frac{\partial{J}}{\partial{W}}$. We define:

of shape $(m \times M)$ and $df(z)$ the pointwise derivative of $f$ at $z$ which has the same shape.

Formula

  1. For $a = f(z)$ a pointwise function ($a$ and $z$ of shape $m \times M$), we have the following equation since the derivative is taken pointwise:
  1. For $z = W*X+b$ then we have:

Numpy

def backprop(da, cache):
  #values computed during forward pass
  X,z,a,... = cache

  # get formula for df at z.
  # example f = sigmoid ==> dfz = sigmoid(z)*(1-sigmoid(z)) = a(1-a)
  # example f = tanh ==> dfz = (1-tanh(z)²) = (1-a²)
  dfz = ...

  # backprop
  dz = da*dfz
  dX = np.dot(W.T, dz)
  dW = np.dot(dz,X.T)
  db = np.sum(dz, axis=1, keepdims = True)