Deep Neural Network
TODO: from drawio
each layer can be described mathematically as below, the process to compute through the layers to the final layer is what we call forward propagation:
\begin{align}
\label{eq:3}
z^{(l)} &= W^{(l)}a^{(l-1)} + b^{(l)} && (\text{if } l > 1) \\
a^{(l)} &= \begin{cases}
\sigma(z^{(l)}) & \text{if } l > 1 \\
x & \text{if } l = 1
\end{cases}
\end{align}
which is composed of a linear part \(z^{(l)}\) and an activation part \(a^{(l)}\), if use sigmoid as the activation function: \(\sigma(z^{(l)}) = \frac{1}{1 + e^{-z^{(l)}}} \)
for example, computation of layer 2 can be described as:
\begin{align}
\label{eq:4}
z^{(2)} &= W^{(2)}x + b^{(2)} \\
&= \begin{bmatrix}
w_{1,1}^{(2)} & w_{1,2}^{(2)} \\
w_{2,1}^{(2)} & w_{2,2}^{(2)}
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix}
+
\begin{bmatrix}
b_1^{(2)} \\
b_2^{(2)}
\end{bmatrix} \\
\begin{bmatrix}
z_{1}^{(2)} \\
z_{2}^{(2)}
\end{bmatrix}
&= \begin{bmatrix}
w_{1,1}^{(2)} x_1 + w_{1,2}^{(2)} x_2 + b_1^{(2)} \\
w_{2,1}^{(2)} x_1 + w_{2,2}^{(2)} x_2 + b_2^{(2)}
\end{bmatrix} \\
a^{(2)} &= \sigma(z^{(2)}) \\
&= \begin{bmatrix}
\sigma(z_{1}^{(2)}) \\
\sigma(z_{2}^{(2)}) \\
\end{bmatrix} \\
&= \begin{bmatrix}
\frac{1}{1 + e^{-z_{1}^{(2)}}} \\
\frac{1}{1 + e^{-z_{2}^{(2)}}} \\
\end{bmatrix}
\end{align}
loss (cost) function
choose a loss (cost) function, for example MSE (mean square error)
\begin{align}
\label{eq:5}
C &= \frac{1}{2} \| a^{(L)} - y \|_{2}^2
\end{align}
gradient descent
we use gradient descent to find the optimum parameters \(\theta\) (weights \(w\) and bias \(b\) are both part of \(\theta\)), so that the loss is minimized .
\[
\theta^{*} = arg \min_{\theta} C(\theta)
\]
To make \(C\) decrease, we apply the formula below in each iteration (each iteration we accumulate some delta of \(C\): \(\Delta C\)):
\[
\Delta C = \nabla C \Delta \theta
\]
update \(\theta\) in the opposite direction of \(\nabla C\) (the gradient of \(C\), a constant, the fastest direction to increase \(C\)), and with a learning rate \(\eta\)
\begin{align}
\label{eq:10}
\theta &= \theta + \Delta \theta \\
&= \theta -\eta \nabla C \\
\end{align}
Computing \(\nabla C\) is all we need, and we use backward propagation to do it.
Forward Propagation
Backward Propagation
we need the gradient of all our parameters: \(\frac{\partial C}{\partial W^{(l)}}, \frac{\partial C}{\partial b^{(l)}}\).
\begin{align}
\label{eq:8}
\delta^{(l)} &= \frac{\partial C}{\partial z^{(l)}} \\
&= \frac{\partial C}{\partial a^{(l)}} \odot \frac{\partial a^{(l)}}{\partial z^{(l)}} \\
&= \begin{cases}
(a^{(L)} - y) \odot \sigma'(z^{(L)}) & \text{if}\ l = L \\
(\frac{\partial z^{(l+1)}}{\partial a^{(l)}} \frac{\partial C}{\partial z^{(l+1)}}) \odot \sigma'(z^{(l)}) & \text{if}\ l<L \\
\end{cases} \\
&= \begin{cases}
(a^{(L)} - y) \odot \sigma'(z^{(L)}) & \text{if } l = L \\
((W^{(l+1)})^{T}\delta^{(l+1)}) \odot \sigma'(z^{(l)}) & \text{if } l<L \\
\end{cases} \\
\frac{\partial C}{\partial W^{(l)}} &= \frac{\partial C}{\partial z^{(l)}} \frac{\partial z^{(l)}}{\partial W^{(l)}} \\
&= \delta^{(l)}(a^{(l-1)})^{T} \\
\frac{\partial C}{\partial b^{(l)}} &= \frac{\partial C}{\partial z^{(l)}} \frac{\partial z^{(l)}}{\partial b^{(l)}} \\
&= \delta^{(l)} \\
\end{align}
take the neural network (where \(L = 3\)) as an example:
\begin{align}
\label{eq:11}
\delta^{(l)} &= \begin{cases}
(a^{(L)} - y) \odot \sigma'(z^{(L)}) & \text{if } l = L \\
((W^{(l+1)})^{T}\delta^{(l+1)}) \odot \sigma'(z^{(l)}) & \text{if } l < L
\end{cases} \\
&= \begin{cases}
\begin{bmatrix}
a_{1}^{(3)} - y_{1} \\
a_{2}^{(3)} - y_{2}
\end{bmatrix} \odot \begin{bmatrix}
\sigma'(z_{1}^{(3)}) \\
\sigma'(z_{2}^{(3)})
\end{bmatrix} & \text{if } l = L = 3 \\
(\begin{bmatrix}
w_{1,1}^{(3)} & w_{2,1}^{(3)} \\
w_{1,2}^{(3)} & w_{2,2}^{(3)} \\
w_{1,3}^{(3)} & w_{2,3}^{(3)}
\end{bmatrix}
\begin{bmatrix}
\delta_{1}^{(3)} \\
\delta_{2}^{(3)}
\end{bmatrix}
) \odot \begin{bmatrix}
\sigma'(z_{1}^{(2)}) \\
\sigma'(z_{2}^{(2)}) \\
\sigma'(z_{3}^{(2)})
\end{bmatrix} & \text{if } l = 2 \\
\end{cases} \\
\frac{\partial C}{\partial W^{(l)}} &= \delta^{(l)} (a^{(l-1)})^{T} \\
&= \begin{cases}
\begin{bmatrix}
\delta_{1}^{(3)} \\
\delta_{2}^{(3)}
\end{bmatrix} \begin{bmatrix}
a_{1}^{(2)} & a_{2}^{(2)} & a_{3}^{(2)}
\end{bmatrix} & \text{if } l = 3 \\
\begin{bmatrix}
\delta_{1}^{(2)} \\
\delta_{2}^{(2)} \\
\delta_{3}^{(2)}
\end{bmatrix} \begin{bmatrix}
x_{1} & x_{2} &x_{3}
\end{bmatrix} & \text{if } l = 2
\end{cases} \\
\frac{\partial C}{\partial b^{(l)}} &= \delta^{(l)} \\
&= \begin{cases}
\begin{bmatrix}
\delta_{1}^{(3)} \\
\delta_{2}^{(3)}
\end{bmatrix} & \text{if } l = 3 \\
\begin{bmatrix}
\delta_{1}^{(2)} \\
\delta_{2}^{(2)} \\
\delta_{3}^{(2)}
\end{bmatrix} & \text{if } l = 2
\end{cases}
\end{align}