Deep Learning Lecture notes

Grading·

Training a 3-node neural network is NPC.

Start with a guess $X ^ 0$
iteratively refine $X$ until $\nabla_X f(X) = 0$ is reached
follow the gradient direction
Multi-class
- softmax function
- $f_c(x) = \frac {\exp(z _ c)} {\sum\exp(z_j)}$
- $z_i$ is often called a logit (refer to an unscaled value)
- error function: NLL loss (negative log-likelihood loss)
  
  $err(f(x^i; w), y ^i) = -\log f_{y ^ i }(x ^ i; w)$
Issues with sigmoid function
- always non-negative
  - Alternative: $\tanh$
- Gradient vanishing
  - Alternative: $\mathrm{ReLU}$
Regularization
- L2 norm on all the weights
- $L(w) = Loss(w) + \alpha |w| ^ 2$
- 不让 $w$ 变得太大防止 overfit
- This is also called weight decay
  - $\nabla_w L = \nabla Loss(w) + \alpha w$

scan 整个图像 or 音频之类的，每个局部都输入到同一个 MLP
相当于识别 local 的 pattern 不管在什么位置
Effective in any situation where the data are expected to be composed of similar structures at different locations
- Eg. speech recognition, image recognition
loss 就是每个区域的 loss 的和
CNN
- Terminology:
  - Filters: scans for a pattern on the map from the previous layer
  - Receptive Fields: corresponding patch in the input image
  - Strides: the scanning ‘hops’ for each filter
  - Padding
  - zero padding 是在外面 pad 0，不是不做padding
- Fully convolutional network:
  - Downsample instead of pooling
  - Convolution with stride > 1 to reduce the size
  - Equivalent to learn a pooling operator

https://cloud.tsinghua.edu.cn/f/4eb80bdf09cc42c19ffd/