Machine Learning

矩阵求导公式

(aTx)x=a,(xTAx)x=(A+AT)xd(f(x))=tr((f(x)x)Tdx)\begin{gathered} \frac{\partial\left(a^{T} x\right)}{\partial} x=a, \frac{\partial\left(x^{T} A x\right)}{\partial} x=\left(A+A^{T}\right) x \\ d(f(x))=\operatorname{tr}\left(\left(\frac{\partial f(x)}{\partial x}\right)^{T} d x\right) \end{gathered}

统一使用列向量

K-means

优化目标:

n=1Nk=1Krnkxnμk2\sum_{n=1}^{N} \sum_{k=1}^{K} r_{n k}\left\|x_{n}-\mu_{k}\right\|^{2}

先初始化均值点, 把每个点根据就近原则分配给各个均值点, 在根据分配的点重新计算均值点。

Possible assignments: KNK!\frac{K^{N}}{K !}

Limitations:

  • Local optimal, 初始化点离样本点太过远

  • 决策面是圆, 欧式距离的局限性

  • K的选择

Hierachical

先给每个点作为一个组, 每次选择最近两个组合并。需要对距离进行定义。

对距离的定义:

minaA,bBd(a,b)maxaA,bBd(a,b)aA,bBd(a,b)AB\begin{gathered} \min _{a \in A, b \in B} d(a, b) \\ \max _{a \in A, b \in B} d(a, b) \\ \frac{\sum_{a \in A, b \in B} d(a, b)}{|A \| B|} \end{gathered}

Adaptive learning

Points come one by one

Competitive learning CL\mathrm{CL}

先初始化均值点, 计算均值点对一个点的距离, 拉近winner和这个点的距离, 对winner进行奖励

d=xmk2,k=1,,Kpk:={1 if k=argminkd0 otherwise \begin{aligned} & d=\left\|x-m_{k}\right\|^{2}, k=1, \ldots, K \\ & p_{k}:=\left\{\begin{array}{l} 1 \text { if } k=\arg \min _{k} d \\ 0 \text { otherwise } \end{array}\right. \end{aligned}

mk=mk+ηpk×(xmk)m_{k}=m_{k}+\eta p_{k} \times\left(x-m_{k}\right)

CL和K-means的等价性在于 η=1N\eta=\frac{1}{N}

Frequency sensitive

competitive learning FSCL

和CL的不同在于惩罚多次的winner, 新的问题在于K设置的不合适, 会导致K个均值点瓜分样本 点

pk,k:={1 if k=argminkdγ if k=argminkdkk0 otherwise p_{k, k^{\prime}}:=\left\{\begin{array}{l} 1 \text { if } k=\arg \min _{k} d \\ -\gamma \text { if } k^{\prime}=\arg \min _{k} d \quad k^{\prime} \neq k \\ 0 \text { otherwise } \end{array}\right.

Rival penalized competitive learning RPCL

奖励winner,惩罚rival

从distance到distribution

GMM

dM(x,y)=(xμ)TΣ1(xμ)N(xμ,Σ)=1(2π)kΣexp(12(xμ)TΣ1(xμ))\begin{gathered} d_{M(x, y)}=\sqrt{(x-\mu)^{T} \Sigma^{-1}(x-\mu)} \\ N(x \mid \mu, \Sigma)=\frac{1}{\sqrt{(2 \pi)^{k}|\Sigma|}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right) \end{gathered}

Generative process, 从 z\mathrm{z}x\mathrm{x}, z\mathrm{z} 是一个隐变量

p(zk=1)=πk,k=1Kπk=1p(x)=z=1p(z)p(xz)=k=1KπkN(xμk,Σk)\begin{gathered} p\left(z_{k}=1\right)=\pi_{k}, \sum_{k=1}^{K} \pi_{k}=1 \\ p(x)=\sum_{z=1} p(z) p(x \mid z)=\sum_{k=1}^{K} \pi_{k} N\left(x \mid \mu_{k}, \Sigma_{k}\right) \end{gathered}

最大似然估计, 对GMM进行 log\log 后对 μ,Σ\mu, \Sigma 求偏导

EM for GMM

  • 初始化 μ,Σ,π\mu, \Sigma, \pi

  • E-step, 计算每个点对于每个分布的概率

  • M-step, 重新计算 μ,Σ,π\mu, \Sigma, \pi

  • 估计似然函数, 判断是否收敛

和k-means的关系

比K-means更多考虑到混合权重、协方差矩阵和soft assignments

和K-mean一样可能陷入local optimal, 同时也不能给出最佳的K

介于k-means和 GMM的算法、优缺点

将GMM退化成EM

  • soft -> hard, 最大概率调成1 - π=1K\pi=\frac{1}{K}

  • Σ=I\Sigma=I

General EM

  • 初始化 θ\theta

  • 根据 θold \theta^{\text {old }}, 计算后验概率

p(zx,θold )=p(z)p(xz)p(xθold )p\left(z \mid x, \theta^{\text {old }}\right)=\frac{p(z) p(x \mid z)}{p\left(x \mid \theta^{\text {old }}\right)}

  • 固定住 p(zx,θold)p\left(z \mid x, \theta^{\mathrm{old}}\right), 更新 θ\theta

θnew =argmaxθQ(θ,θold )Q(θ,θold )=zp(zx,θold )lnp(x,zθ)\begin{aligned} \theta^{\text {new }} & =\arg \max _{\theta} Q\left(\theta, \theta^{\text {old }}\right) \\ Q\left(\theta, \theta^{\text {old }}\right) & =\sum_{z} p\left(z \mid x, \theta^{\text {old }}\right) \ln p(x, z \mid \theta) \end{aligned}

BMM证明

EM证明

方向一:

logq(xθ)=log(yp(yx)q(x,yθ)p(yx))yp(yx)log(q(x,yθ)p(yx))=F(p(yx),θ)=yp(yx)log(q(yx,θ)q(xθ)p(yx))=q(xθ)+yp(yx)log(q(yx,θ)p(yx))=q(xθ)KL(p(yx)q(yx,θ))\begin{aligned} \log q(x \mid \theta) & =\log \left(\sum_{y} p(y \mid x) \frac{q(x, y \mid \theta)}{p(y \mid x)}\right) \\ & \geq \sum_{y} p(y \mid x) \log \left(\frac{q(x, y \mid \theta)}{p(y \mid x)}\right)=F(p(y \mid x), \theta) \\ & =\sum_{y} p(y \mid x) \log \left(\frac{q(y \mid x, \theta) q(x \mid \theta)}{p(y \mid x)}\right) \\ & =q(x \mid \theta)+\sum_{y} p(y \mid x) \log \left(\frac{q(y \mid x, \theta)}{p(y \mid x)}\right) \\ & =q(x \mid \theta)-K L(p(y \mid x) \| q(y \mid x, \theta)) \end{aligned}

我们要最大化它的下界 F(p(yx),θ)F(p(y \mid x), \theta), 即要使这个KL散度趋近于零 p(yx)=q(yx,θ)=p(x,yθ)p(xθ)p(y \mid x)=q(y \mid x, \theta)=\frac{p(x, y \mid \theta)}{p(x \mid \theta)}

  • E-step: Fix θold \theta^{\text {old }}, 最大化 F(p(yx),θold )F\left(p(y \mid x), \theta^{\text {old }}\right)

  • M-step: Fix P(yx)P(y \mid x), 最大化 yp(yx)logq(x,yθ)\sum_{y} p(y \mid x) \log q(x, y \mid \theta)

方向二:

让两个分布尽可能接近

p(x,y)=p(yx)p(x)q(x,yθ)=q(yθ)q(xy,θ)\begin{gathered} p(x, y)=p(y \mid x) p(x) \\ q(x, y \mid \theta)=q(y \mid \theta) q(x \mid y, \theta) \end{gathered}

KL(p(yx)p(x)q(yθ)q(xy,θ))=p(yx)p(x)logp(yx)p(x)q(yθ)q(xy,θ)K L(p(y \mid x) p(x) \| q(y \mid \theta) q(x \mid y, \theta))=\int p(y \mid x) p(x) \log \frac{p(y \mid x) p(x)}{q(y \mid \theta) q(x \mid y, \theta)}

=p(yx)p(x)logp(yx)p(x)q(xθ)q(yx,θ)=p(x)KL(p(yx)q(yx,θ))p(x)logq(xθ)+ const \begin{aligned} & =\int p(y \mid x) p(x) \log \frac{p(y \mid x) p(x)}{q(x \mid \theta) q(y \mid x, \theta)} \\ & =\int p(x) K L(p(y \mid x) \| q(y \mid x, \theta))-\int p(x) \log q(x \mid \theta)+\text { const } \end{aligned}

E-step中 p(yx)p(y \mid x) 不好计算

  • Sampling

  • 假设 p(yx)=G(yμ,Σ)p(y \mid x)=G(y \mid \mu, \Sigma)

MM-step中 QQ 的优化方程的导数不能直接等于 0

  • 梯度下降

Maximum Likelihood (ML) learning

最大似然函数求偏导

θ^MLE=argmaxθP(Dθ)\hat{\theta}_{\mathrm{MLE}}=\arg \max _{\theta} P(D \mid \theta)

Bayesian learning, Maximum A Posterior

p(θx)=p(xθ)p(θ)p(x)p(\theta \mid x)=\frac{p(x \mid \theta) p(\theta)}{p(x)}

最大后验概率

θ^MAP=argmaxθP(θD)=argmaxθP(Dθ)P(θ)\begin{aligned} \hat{\theta}_{\mathrm{MAP}} & =\arg \max _{\theta} P(\theta \mid D) \\ & =\arg \max _{\theta} P(D \mid \theta) P(\theta) \end{aligned}

P(θ)P(\theta) 退化到unifrom时, 最大后验退化到最大似然

假设 p(xθ)=G(xμ,Σ),p(θ)=G(μμ0,σ02I)p(x \mid \theta)=G(x \mid \mu, \Sigma), p(\theta)=G\left(\mu \mid \mu_{0}, \sigma_{0}^{2} I\right)

logp(μx)=t=1N(k2log2π12Σ12(xtμ)TΣ1(xtμ))k2log2π12σ02I12(μμ0)Tσ02I(μμ0)p(μx)μ=t=1NΣ1(xtμ)+σ02I(μμ0)=Σ1(NxˉNμ)+σ02I(μμ0)=0μ=Σ1Nxˉ+σ02μ0Σ1N+σ02I\begin{aligned} & \log p(\mu \mid x)=\sum_{t=1}^{N\left(-\frac{k}{2} \log 2 \pi-\frac{1}{2}|\Sigma|-\frac{1}{2}\left(x_{t}-\mu\right)^{T} \Sigma^{-1}\left(x_{t}-\mu\right)\right)} \\ & -\frac{k}{2} \log 2 \pi-\frac{1}{2}\left|\sigma_{0}^{2} I\right|-\frac{1}{2}\left(\mu-\mu_{0}\right)^{T} \sigma_{0}^{-2} I\left(\mu-\mu_{0}\right) \\ & \frac{\partial p(\mu \mid x)}{\partial \mu}=\sum_{t=1}^{N} \Sigma^{-1}\left(x_{t}-\mu\right)+\sigma_{0}^{-2} I\left(\mu-\mu_{0}\right) \\ & =\Sigma^{-1}(N \bar{x}-N \mu)+\sigma_{0}^{-2} I\left(\mu-\mu_{0}\right) \\ & =0 \\ & \mu=\frac{\Sigma^{-1} N \bar{x}+\sigma_{0}^{-2} \mu_{0}}{\Sigma^{-1} N+\sigma_{0}^{-2} I} \end{aligned}

N\mathrm{N} 趋近于无穷时, μ\mu 趋近于 xˉ\bar{x}, 数据足够多的时候, 先验起的作用越来越小 当 σ0\sigma_{0} 趋近于 0 时, μ\mu 趋近于 μ0\mu_{0}

1Nlogp(θx)1Nt=1Nlogp(xθ)+1Nlogp(θ)E[logp(xθ)]+0 if N\begin{aligned} \frac{1}{N} \log p(\theta \mid x) & \propto \frac{1}{N} \sum_{t=1}^{N} \log p(x \mid \theta)+\frac{1}{N} \log p(\theta) \\ & \propto \mathbb{E}[\log p(x \mid \theta)]+0 \text { if } N \rightarrow \infty \end{aligned}

Model selection

K\mathrm{K} 个高斯模型 AIC: logp(XNθ^K)dk,dk\log p\left(X_{N} \mid \hat{\theta}_{K}\right)-d_{k}, d_{k} 是参数的个数

BIC: logp(XNθ^K)12dklnN,N\log p\left(X_{N} \mid \hat{\theta}_{K}\right)-\frac{1}{2} d_{k} \ln N, N 是样本的个数

p(xθK,K)K^=argmaxKp(xK)p(xK)=p(xθ,K)p(θK)dθ=exp(logp(xθ,K)+logp(θK))dθ\begin{gathered} p\left(x \mid \theta_{K}, K\right) \\ \widehat{K}=\arg \max _{K} p(x \mid K) \\ p(x \mid K)=\int p(x \mid \theta, K) p(\theta \mid K) d \theta \\ =\int \exp (\log p(x \mid \theta, K)+\log p(\theta \mid K)) d \theta \end{gathered}

VBEM: 使部分 π\pi 趋近于零

Principal Component Analysis (PCA)

1Nt=1Nxt(xtTw)w2=1Nt=1N(xtTxtwT(xtxtT)w)=ΣxwTΣxw\begin{aligned} \frac{1}{N} \sum_{t=1}^{N}\left\|x_{t}-\left(x_{t}^{T} w\right) w\right\|^{2} & =\frac{1}{N} \sum_{t=1}^{N\left(x_{t}^{T} x_{t}-w^{T\left(x_{t} x_{t}^{T}\right)} w\right)} \\ & =\Sigma_{x}-w^{T} \Sigma_{x} w \end{aligned}

ΣxwTΣxwλ(wTw1)\Sigma_{x}-w^{T} \Sigma_{x} w-\lambda\left(w^{T} w-1\right) 求导

Σxw=λw\Sigma_{x} w=\lambda w

From matrix factorization perspectives

Eigen-decomposition

Σx=UAUT\Sigma_{x}=U A U^{T}

SVD

XXT=UDVTVDUT=UD2UTX X^{T}=U D V^{T} V D U^{T}=U D^{2} U^{T}

EM algorithm for FA

每一个 xx 对应一个隐变量 yy, 将隐变量 yy 再估计 xx

随机采样 y\mathrm{y}, 随机产生 noise e, 用以产生 x,y\mathrm{x}, \mathrm{y}e\mathrm{e} 是独立的

yG(y0,Σy)eG(e0,σ2I)x=Ay+μ+eq(xy)=G(xAy+μ,σ2I)q(xθ)=q(y)q(xy)dy=q(xμ,Σx)=G(xμx,Σx)μx=E[Ay+μ+e]=μΣx=E[(Ay+e)(Ay+e)T]=AΣyAT+σ2I\begin{gathered} y \sim G\left(y \mid 0, \Sigma_{y}\right) \\ e \sim G\left(e \mid 0, \sigma^{2} I\right) \\ x=A y+\mu+e \\ q(x \mid y)=G\left(x \mid A y+\mu, \sigma^{2} I\right) \\ q(x \mid \theta)=\int q(y) q(x \mid y) d y=q\left(x \mid \mu, \Sigma_{x}\right)=G\left(x \mid \mu_{x}, \Sigma_{x}\right) \\ \mu_{x}=E[A y+\mu+e]=\mu \\ \Sigma_{x}=E\left[(A y+e)(A y+e)^{T}\right]=A \Sigma_{y} A^{T}+\sigma^{2} I \end{gathered}

EM 算法

q(y)=G(y0,I)q(xy)=G(xAy+μ,Σe)\begin{gathered} q(y)=G(y \mid 0, I) \\ q(x \mid y)=G\left(x \mid A y+\mu, \Sigma_{e}\right) \end{gathered}

  • E-step, 先算后验概率

q(yx)=q(yx,θ)=q(y)q(xy)q(xθ)=1(2π)m+nΣeq(xθ)exp(12(yyT+(xAyμ)Σe1(xAyμ)T))=1(2π)m+nΣeq(xθ)exp(yTATΣe1Ay+yyT2yTATΣe1(xμ))=exp(12(yμyx)Σyx1(yμyx)T)Σyx1=ATΣe1A+IΣyx1μyx=ATΣe1(xμ)q(xθ)=G(xμx,ATΣe1A+I)Σxy=E[(xμx)(yμy)T]=AyyT=A,Σyx=AT\begin{aligned} & q(y \mid x)= q(y \mid x, \theta) \\ &= \frac{q(y) q(x \mid y)}{q(x \mid \theta)} \\ &= \frac{1}{\sqrt{(2 \pi)^{m+n}\left|\Sigma_{e}\right|} \mid q(x \mid \theta)} \exp \left(-\frac{1}{2}\left(y y^{T}+(x-A y-\mu) \Sigma_{e}^{-1}(x-A y-\mu)^{T}\right)\right) \\ &= \frac{1}{\sqrt{(2 \pi)^{m+n}\left|\Sigma_{e}\right|} \mid q(x \mid \theta)} \exp \left(y^{T} A^{T} \Sigma_{e}^{-1} A y+y y^{T}-2 y^{T} A^{T} \Sigma_{e}^{-1}(x-\mu) \ldots\right) \\ &= \exp \left(-\frac{1}{2}\left(y-\mu_{y \mid x}\right) \Sigma_{y \mid x}^{-1}\left(y-\mu_{y \mid x}\right)^{T}\right) \\ & \Sigma_{y \mid x}^{-1}=A^{T} \Sigma_{e}^{-1} A+I \\ & \Sigma_{y \mid x}^{-1} \mu_{y \mid x}=A^{T} \Sigma_{e}^{-1}(x-\mu) \\ & q(x \mid \theta)=G\left(x \mid \mu_{x}, A^{T} \Sigma_{e}^{-1} A+I\right) \\ & \Sigma_{x y}=E\left[\left(x-\mu_{x}\right)\left(y-\mu_{y}\right)^{T}\right]=A y y^{T}=A, \Sigma_{y x}=A^{T} \end{aligned}

  • M-step, maxQ(qold (yx),θ)\max Q\left(q^{\text {old }}(y \mid x), \theta\right), 更新 A,ΣeA, \Sigma_{e}

yy 积分只用考虑含 yy 的项

Q(qold (yx),θ)=qold (yx)ln[G(y0,I)G(xAy+μ,Σe)]dy=qold (yx)ln1(2π)m+nΣe(12(yyT+(xAyμ)Σe1(xAyμ)T))=1(2π)m+nΣeq(xθ)exp(yTATΣe1Ay+yyT2yTATΣe1(xμ))Eyx[yTATΣe1Ay]=tr(E[yyT(ATΣe1A)])Eyx[(yμyx)(yμyx)T]=Eyx[yyT]μyxμyxT=Σyx\begin{aligned} Q\left(q^{\text {old }}(y \mid x), \theta\right) & =\int q^{\text {old }}(y \mid x) \ln \left[G(y \mid 0, I) G\left(x \mid A y+\mu, \Sigma_{e}\right)\right] d y \\ & =\int q^{\text {old }}(y \mid x) \ln \frac{1}{\sqrt{(2 \pi)^{m+n}\left|\Sigma_{e}\right|}}\left(-\frac{1}{2}\left(y y^{T}+(x-A y-\mu) \Sigma_{e}^{-1}(x-A y-\mu)^{T}\right)\right) \\ & =\frac{1}{\sqrt{(2 \pi)^{m+n}\left|\Sigma_{e}\right|} q(x \mid \theta)} \exp \left(y^{T} A^{T} \Sigma_{e}^{-1} A y+y y^{T}-2 y^{T} A^{T} \Sigma_{e}^{-1}(x-\mu) \ldots\right) \\ E_{y \mid x}\left[y^{T} A^{T} \Sigma_{e}^{-1} A y\right] & =\operatorname{tr}\left(E\left[y y^{T\left(A^{T} \Sigma_{e}^{-1} A\right)}\right]\right) \\ & E_{y \mid x}\left[\left(y-\mu_{y \mid x}\right)\left(y-\mu_{y \mid x}\right)^{T}\right]=E_{y \mid x}\left[y y^{T}\right]-\mu_{y \mid x} \mu_{y \mid x}^{T}=\Sigma_{y \mid x} \end{aligned}

AA 求偏导

d(tr[yTATΣe1Ay2xΣe1Ay])=tr(yTdATΣe1Ay+yTATΣe1dAy2xΣe1dAy)=tr(2yTATΣe1dAy2xΣe1dAy)=tr(2E[yyT]ATΣe1dA2E[y]xΣe1dA)\begin{aligned} & d\left(\operatorname{tr}\left[y^{T} A^{T} \Sigma_{e}^{-1} A y-2 x \Sigma_{e}^{-1} A y\right]\right) \\ & =\operatorname{tr}\left(y^{T} d A^{T} \Sigma_{e}^{-1} A y+y^{T} A^{T} \Sigma_{e}^{-1} d A y-2 x \Sigma_{e}^{-1} d A y\right) \\ & =\operatorname{tr}\left(2 y^{T} A^{T} \Sigma_{e}^{-1} d A y-2 x \Sigma_{e}^{-1} d A y\right) \\ & =\operatorname{tr}\left(2 E\left[y y^{T}\right] A^{T} \Sigma_{e}^{-1} d A-2 E[y] x \Sigma_{e}^{-1} d A\right) \end{aligned}

结果为

2Σe(1)TAE[yyT]T2Σe(1)TxTE[y]T2 \Sigma_{e}^{(-1)^{T}} A E\left[y y^{T}\right]^{T}-2 \Sigma_{e}^{(-1)^{T}} x^{T} E[y]^{T}

Maximum likelihood FA implements PCA

最大似然函数

maxθlog(t=1Np(xtθ)\max _{\theta} \log \left(\prod_{t=1}^{N} p\left(x_{t} \mid \theta\right)\right.

Independent Component Analysis ICA

E[si2]=1E\left[s_{i}^{2}\right]=1

x=As=i=1Nasi=i=1Naλi1λisix=A s=\sum_{i=1}^{N} a s_{i}=\sum_{i=1}^{N} a \lambda_{i}^{-1} \lambda_{i} s_{i}

往非高斯的方向走, 才能变得独立

y=wTx=wTAs=(wTA)s=zTsy=w^{T} x=w^{T} A s=\left(w^{T} A\right) s=z^{T} s

高斯分布是最随机的, 熵最大

PCA vs ICA

都是找线性投影, 使投影完的信号各个方向都是独立的, PCA是二阶信号, ICA是更高阶的信号 PCA要求正交, ICA不要求正交

GMM可以表示任何一个非高斯分布

Least Square Estimation

Y=Xβ+εminβεTε=(YXβ)T(YXβ)=βTXTXβ2YTXβ+\begin{gathered} Y=X \beta+\varepsilon \\ \min _{\beta} \varepsilon^{T} \varepsilon=(Y-X \beta)^{T(Y-X \beta)} \\ =\beta^{T} X^{T} X \beta-2 Y^{T} X \beta+\ldots \end{gathered}

β\beta 求偏导

β=(XTX)1XTY\beta=\left(X^{T} X\right)^{-1} X^{T} Y

和PCA的差别在于error方向只沿着 yy 轴, PCA error和每个维度都有关系

Logistic Regression

将linear regression投影到 01 的区间

p(x)=p(Y=1x)=exp(β0+β1x)1+exp(β0+β1x)logp(x)1p(x)=β0+β1x\begin{gathered} p(x)=p(Y=1 \mid x)=\frac{\exp \left(\beta_{0}+\beta_{1} x\right)}{1+\exp \left(\beta_{0}+\beta_{1} x\right)} \\ \log \frac{p(x)}{1-p(x)}=\beta_{0}+\beta_{1} x \end{gathered}

β1=0,Y\beta_{1}=0, \mathrm{Y}X\mathrm{X} 无关, β1>0,Y=1\beta_{1}>0, \mathrm{Y}=1 的概率和 X\mathrm{X} 大小正相关

p(x)=p(Y=kx)=exp(β0k+β1kx1++βpkxp)1+i=1nexp(β0i+β1ix1++βpixp)p(x)=p(Y=k \mid x)=\frac{\exp \left(\beta_{0 k}+\beta_{1 k} x_{1}+\ldots+\beta_{p k} x_{p}\right)}{1+\sum_{i=1}^{n} \exp \left(\beta_{0 i}+\beta_{1 i} x_{1}+\ldots+\beta_{p i} x_{p}\right)}

Regression Regularization

L1 norm和L2 norm,介于两者中间的

J(θ)=12m[t=1m(hθ(xi)yi)+λj=1nθj2]J(\theta)=\frac{1}{2 m}\left[\sum_{t=1}^{m}\left(h_{\theta\left(x^{i}\right)}-y^{i}\right)+\lambda \sum_{j=1}^{n} \theta_{j}^{2}\right]

θj\theta_{j} 尽可能趋近于零, 使某些 xx 不起作用

λ\lambda 为零, 过拟合; λ\lambda 过大, 欠拟合

M-P neuron model

y=f(i=1nwixiθ)y=f\left(\sum_{i=1}^{n} w_{i} x_{i}-\theta\right)

大于零为 1 , 否则为零

从离散到连续, 使用Sigmoid函数

相当于线性分类器, 但单个不能实现异或

Perceptron learning

wi+ΔwiwiΔwi=η(yy^)xi\begin{gathered} w_{i}+\Delta w_{i} \rightarrow w_{i} \\ \Delta w_{i}=\eta(y-\hat{y}) x_{i} \end{gathered}

  • regression problem, 直接输出

  • mutiple binary classification, igmoid函数输出

  • mutiple-class, softmax函数输出

任意连续函数可以用一个两层(含一个隐藏层)的网络逼近,有足够多的神经元

用linear的作为 activate function, 最后只能构造linear函数

深度的好处:没有理论证明,帮助抽取更复杂的特征

Error propagation algorithm

先前向传播

ak=jwkjzj=jwkjh(aj)a_{k}=\sum_{j} w_{k j} z_{j}=\sum_{j} w_{k j} h\left(a_{j}\right)

计算 δk=yktk\delta_{k}=y_{k}-t_{k}

反向传播

δj=h(aj)kwkjδk\delta_{j}=h^{\prime}\left(a_{j}\right) \sum_{k} w_{k j} \delta_{k}

计算梯度 Enwji=δjzi\frac{\partial E_{n}}{\partial w_{j i}}=\delta_{j} z_{i}

增加正则项, E~(w)=E(w)+12wTw\tilde{E}(w)=E(w)+\frac{1}{2} w^{T} w

Feature Detection Theory

从简单特征逐渐组合到复杂特征, 深度神经网络, 不容易过拟合

CNN

convolution:相乘 + 相加, 用一个n乘 n\mathrm{n} 的矩阵kernel在图像中便移动边计算, 移动的步长 s1,s2s_{1}, s_{2}, padding给外面加一圈 p1,p2p_{1}, p_{2}

一张RGB图片有三个矩阵, channel, 每个channel一个kernel

pooling: average和maximum

activation layer:ReLU,LeakyReLU防止梯度消失

loss layer: L1、L2 loss, Cross-Entropy

regularization:L1、L2,dropout,一定概率忽略activation function防止过拟合,batch norm 缺点在于不能根据大体轮廓正确识别图像

ResNet architecture

在堆深度的同时, 有更新参数梯度消失问题, skip connections跳跃连接, F(x)+xF(x)+x

LSTM:选择性记忆,当前的和历史的有所选择,历史、输入、输出

Autoencoder (AE)

encode

F(X)ZF(X) \rightarrow Z

decode

G(Z)ZG(Z) \rightarrow Z^{\prime}

是否能重建

p(Z)P(X)P(Z)p(Z) \rightarrow P(X) \rightarrow P(Z)

但是 p(ZX)p(Z \mid X) 难以计算, 我们用 Q(Z)Q(Z) 去近似, Q(ZX)=N(ZμX,ΣX)Q(Z \mid X)=N\left(Z \mid \mu_{X}, \Sigma_{X}\right)

Q(zx)logp(xz)p(z)p(x)\int Q(z \mid x) \log \frac{p(x \mid z) p(z)}{p(x)}

XX \rightarrow encoder μx,ΣxKL(μx,Σx0,I)\rightarrow \mu_{x}, \Sigma_{x} \rightarrow \operatorname{KL}\left(\mu_{x}, \Sigma_{x} \| 0, I\right) sample zN(zμx,Σx))\left.z \sim N\left(z \mid \mu_{x}, \Sigma_{x}\right)\right) \rightarrow decoder Xf(z)2\rightarrow\|X-f(z)\|^{2} 训练encoder有困难, zz 是sample出来的, 对AE进行修改

 sample εz=μx+Σx×ε,εN(ε0,I)\text { sample } \varepsilon \rightarrow z=\mu_{x}+\sqrt{\Sigma_{x}} \times \varepsilon, \varepsilon \sim N(\varepsilon \mid 0, I)

test的时候用

 sample zN(z0,I)) decoder \text { sample } z \sim N(z \mid 0, I)) \rightarrow \text { decoder }

Conditional VAE

Y+ sample zN(z0,I)) decoder Y+\text { sample } z \sim N(z \mid 0, I)) \rightarrow \text { decoder }

Generative Adversarial Networks GAN

z Generator G(z),x Discriminator D(G(z)),D(x)z \rightarrow \text { Generator } \rightarrow G(z), x \rightarrow \text { Discriminator } \rightarrow D(G(z)), D(x)

VAE vs GAN

VAE: easy to find zz, 可解释的 P(X)\mathrm{P}(\mathrm{X}), 但会生成模糊的图片

GAN: hard to find z,P(X)\mathrm{z}, \mathrm{P}(\mathrm{X}) 难以解释, 生成sharp图片

diffusion model: model structure很灵活, 计算很简单

T\mathrm{T} 步中每一步为 sample加上高斯noise

xt=1βtxt1+βtσ2x_{t}=\sqrt{1-\beta_{t}} x_{t-1}+\beta_{t} \sigma^{2}

GNN:图神经网络, 每个节点的特征由周边的节点得出

SVM

两条直线 wTx+b=11w^{T} x+b=1 \vee-1 之间的margin m=2w,wm=\frac{2}{\|w\|}, \mathrm{w} 越小, m\mathrm{m} 越大, 即距离越大越好。

  • Hard margin:

minw,b12w2 s.t. yi(wTxi+b)1\begin{gathered} \min _{w, b} \frac{1}{2}\|w\|^{2} \\ \text { s.t. } y_{i}\left(w^{T} x_{i}+b\right) \geq 1 \end{gathered}

  • Soft margin:

minw,b12w2+Ciξi s.t. yi(wTxi+b)1ξi,ξi0\begin{gathered} \min _{w, b} \frac{1}{2}\|w\|^{2}+C \sum_{i} \xi_{i} \\ \text { s.t. } y_{i}\left(w^{T} x_{i}+b\right) \geq 1-\xi_{i}, \xi_{i} \geq 0 \end{gathered}

Soft margin引入变量 ξ\xi 以松弛约束

Lagrangian form, 引入 α\alpha 以及 β\beta, 乘以约束项

Dual form: 对 w,b,ξw, b, \xi 分别求导, 可得 w=iαiyiφ(xi)w=\sum_{i} \alpha_{i} y_{i} \varphi\left(x_{i}\right), 使用QP求解器求解, 对于kernel trick只能 用dual form求解

实际运算中, 可以将 b\mathrm{b} 作为一个维度吸收进 wTφ(x)=wˉTφ(xˉ)+bw^{T} \varphi(x)=\bar{w}^{T} \varphi(\bar{x})+b, 少掉一个约束项更加简化。

Association vs. Causation

Propensity score:对不同特点进行评分, 再从不同特点的样本中进行选择

Inverse probability weighting: 拉平特点的分布差异