Machine Learning Notes

Lecture 2·

  • bottleneck: labeling

overfit·

  • 应对 overfit: restrict the representation power -> “regularization”
  • modern view: overfit 并不是问题在 SGD 中天生的就会有 implicit regularization 来减少 overfit 的可能

Unsupervised Learning·

  • clustering
  • PCA
  • generative model
  • anomaly detection
  • dimension reduction (PCA application)

Semi-supervised Learning·

image-20250302110210897|650

Lecture 3·

Optimization·

  • zero-order method
    • only knows f(x)f(x)
    • hyperparameter tuning
  • first-order method
    • knows f(x),f(x)f(x), f'(x)
  • second-order method
    • knows f(x),f(x),f(x)f(x), f'(x), f''(x)
    • Hessian matrix is of size O(d2)O(d^2) , dd denotes the number of parameters
    • 太费时间

Gradient descent·

wt+1=wtηL(wt)w_{t+1}=w_t-\eta\nabla L(w_t)

  • Smoothness assumption: f(w)L\|f''(w)\| \leq L

什么叫 smoothness: 梯度函数是 L-Lipschitz 的

f(x)f(y)Lxy.\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|.

梯度不会剧烈变化就说明每一步的梯度下降的步长都比较稳定

  • Lemma: f(w)f'(w) 是 L-Lipschitz 当且仅当 f(w)L,wRn\|f''(w)\| \leq L, \forall w \in R^n

Proof

f(w)L\| f''(w) \| \leq L

f(y)f(x)=(01f(x+τ(yx))dτ)(yx)Lyx\begin{align*} \|f'(y) - f'(x)\| &= \left\|\left(\int _ 0 ^ 1f''(x + \tau (y - x)) d\tau\right) \cdot (y - x) \right \| \\ &\leq L \| y - x \| \end{align*}

f(y)f(x)Lyx\| f'(y) -f '(x)\| \leq L \|y - x\|

(0αf(x+τs)dτ)s=f(x+αs)f(x)αLs.\left\| \left(\int_{0} ^ {\alpha}f^{\prime\prime}(x+\tau s)d\tau\right)\cdot s\right\| =\parallel f^{\prime}(x+\alpha s)-f^{\prime}(x)\parallel\leq\alpha L\parallel s\parallel.

两边除掉 α\alpha 并令 α0\alpha \rightarrow 0 就有 f(x)L\|f''(x)\| \leq L.

就是对于每个 xx 考虑 xx 周围 δ\delta 的邻域

  • Lemma: f(w)f'(w) 是 L-Lipschitz 时

    f(y)f(x)f(x),yxL2yx2.\mid f(y)-f(x)-\langle f^{\prime}(x),y-x\rangle\mid\leq\frac{L}{2}\parallel y-x\parallel^2.

image-20250302192013008|800

  • 当满足梯度光滑条件时考虑学习率 η\eta 的合适取值范围

我们有 w=wηf(w)w' = w - \eta f'(w)

f(w)f(w)f(w),ww+L2ww2=ηf(w)2+η2L2f(w)2=η(1ηL2)f(w)2\begin{align*} f(w') - f(w) &\leq \langle f'(w), w'- w\rangle + \frac L 2 \| w'-w\| ^ 2 \\ &= - \eta \| f'(w)\|^2 + \frac {\eta ^ 2L} 2 \|f'(w)\|^2 \\ &= -\eta \left(1 - \frac {\eta L} 2\right) \|f'(w)\|^2 \end{align*}

我们希望每一次更新都有 f(w)f(w)<0f(w') - f(w) < 0这需要 η<2L\eta < \frac 2 L.

这意味着如果函数 ff 的梯度平滑那么梯度下降的效果有很好的保证