Deep Learning Lecture notes

Grading·

  • HW 15%
  • Coding 40%
  • Project 25%
  • Final 20%
  • Note 8%

Overview·

McCulloch-Pitts Neuron·

  • image-20220504093950186
  • g(x)=ixig(x) = \sum_i x_i
  • y=f(g(x))=I[g(x)>θ]y = f(g(x)) = \mathbb{I} [g(x) > \theta]

Hebbian Learning·

  • wiw_i is the weight between xix_i and yy
  • wi=wi+ηxiyw_i = w_i + \eta x_i y

The Perceptron·

  • image-20220504111219219
  • If iwixiT>0\sum_i w_ix_i - T > 0 then y=1y = 1 else y=0y = 0
  • w=w+η(d(x)y(x))xw = w + \eta(d(x) - y(x))x

Multi-layer Perceptron·

  • can compose arbitrarily complicated Boolean functions
  • image-20220504111750020

Training a 3-node neural network is NPC.

The first AI Winter·

  • The perceptron cannot represent XOR.
  • NP-complete
  • Funding cuts

Differentiable Functions·

  • z=iwixiz = \sum_i w_ix_i
  • σ(z)=11+ez\sigma (z) = \frac 1 {1 + e ^ {-z}}
  • now have gradient, update weights by back-propagation

Supervised Learning(1)·

analytical solution·

  • Consider f(X)=f(x1,x2,,xn)f(X) = f(x_1, x_2, \cdots, x_n)
  • solve f(X)=0\nabla f(X) = 0
  • calculate 2f(X)\nabla ^ 2 f(X) and verify 是否正定

Iterative solution·

  • Start with a guess X0X ^ 0

  • iteratively refine XX until Xf(X)=0\nabla_X f(X) = 0 is reached

  • follow the gradient direction

  • Multi-class

    • softmax function

    • fc(x)=exp(zc)exp(zj)f_c(x) = \frac {\exp(z _ c)} {\sum\exp(z_j)}

    • ziz_i is often called a logit (refer to an unscaled value)

    • error function: NLL loss (negative log-likelihood loss)

      err(f(xi;w),yi)=logfyi(xi;w)err(f(x^i; w), y ^i) = -\log f_{y ^ i }(x ^ i; w)

  • Issues with sigmoid function

    • always non-negative
      • Alternative: tanh\tanh
    • Gradient vanishing
      • Alternative: ReLU\mathrm{ReLU}
  • image-20220504145335451

  • Regularization

    • L2 norm on all the weights
    • L(w)=Loss(w)+αw2L(w) = Loss(w) + \alpha |w| ^ 2
    • 不让 ww 变得太大防止 overfit
    • This is also called weight decay
      • wL=Loss(w)+αw\nabla_w L = \nabla Loss(w) + \alpha w

Scanning MLP·

  • scan 整个图像 or 音频 之类的每个局部都输入到同一个 MLP

  • 相当于识别 local 的 pattern 不管在什么位置

  • Effective in any situation where the data are expected to be composed of similar structures at different locations

    • Eg. speech recognition, image recognition
  • loss 就是 每个区域的 loss 的和

  • CNN

    • Terminology:
      • Filters: scans for a pattern on the map from the previous layer
      • Receptive Fields: corresponding patch in the input image
      • Strides: the scanning ‘hops’ for each filter
      • Padding
      • zero padding 是 在外面 pad 0不是不做padding
    • Fully convolutional network:
      • Downsample instead of pooling
      • Convolution with stride > 1 to reduce the size
      • Equivalent to learn a pooling operator

Supervised Learning(2)·

https://cloud.tsinghua.edu.cn/f/4eb80bdf09cc42c19ffd/