2025-01-13

Machine Learning Notes

Lecture 2·

bottleneck: labeling

overfit·

应对 overfit: restrict the representation power -> “regularization”
modern view: overfit 并不是问题，在 SGD 中天生的就会有 implicit regularization 来减少 overfit 的可能

Unsupervised Learning·

clustering
PCA
generative model
anomaly detection
dimension reduction (PCA application)

Semi-supervised Learning·

image-20250302110210897|650

Lecture 3·

Optimization·

zero-order method
- only knows $f(x)$
- hyperparameter tuning
first-order method
- knows $f(x), f'(x)$
second-order method
- knows $f(x), f'(x), f''(x)$
- Hessian matrix is of size $O(d^2)$ , $d$ denotes the number of parameters
- 太费时间

Gradient descent·

$w_{t+1}=w_t-\eta\nabla L(w_t)$

Smoothness assumption: $\|f''(w)\| \leq L$

什么叫 smoothness: 梯度函数是 L-Lipschitz 的，即

$\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|.$

梯度不会剧烈变化，就说明每一步的梯度下降的步长都比较稳定。

Lemma: $f'(w)$ 是 L-Lipschitz 当且仅当 $\|f''(w)\| \leq L, \forall w \in R^n$

Proof

若 $\| f''(w) \| \leq L$ ，有

$\begin{align*} \|f'(y) - f'(x)\| &= \left\|\left(\int _ 0 ^ 1f''(x + \tau (y - x)) d\tau\right) \cdot (y - x) \right \| \\ &\leq L \| y - x \| \end{align*}$

若 $\| f'(y) -f '(x)\| \leq L \|y - x\|$ ，有

$\left\| \left(\int_{0} ^ {\alpha}f^{\prime\prime}(x+\tau s)d\tau\right)\cdot s\right\| =\parallel f^{\prime}(x+\alpha s)-f^{\prime}(x)\parallel\leq\alpha L\parallel s\parallel.$

两边除掉 $\alpha$ 并令 $\alpha \rightarrow 0$ 就有 $\|f''(x)\| \leq L$ .

就是对于每个 $x$ 考虑 $x$ 周围 $\delta$ 的邻域

Lemma: $f'(w)$ 是 L-Lipschitz 时，有
$\mid f(y)-f(x)-\langle f^{\prime}(x),y-x\rangle\mid\leq\frac{L}{2}\parallel y-x\parallel^2.$

当满足梯度光滑条件时，考虑学习率 $\eta$ 的合适取值范围

我们有 $w' = w - \eta f'(w)$ ，

$\begin{align*} f(w') - f(w) &\leq \langle f'(w), w'- w\rangle + \frac L 2 \| w'-w\| ^ 2 \\ &= - \eta \| f'(w)\|^2 + \frac {\eta ^ 2L} 2 \|f'(w)\|^2 \\ &= -\eta \left(1 - \frac {\eta L} 2\right) \|f'(w)\|^2 \end{align*}$

我们希望每一次更新都有 $f(w') - f(w) < 0$ ，这需要 $\eta < \frac 2 L$ .

这意味着如果函数 $f$ 的梯度平滑，那么梯度下降的效果有很好的保证。