[学习笔记] - 向量·矩阵的微分

涉及矩阵和向量的求导可以分为五大类别

  • 标量对矩阵
  • 标量对向量
  • 向量对标量
  • 向量对向量
  • 矩阵对标量

布局

矩阵求导有两种布局

  • 分子布局

yx=[y1xy2xynx]\frac{\partial \mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x} \\ \vdots \\ \frac{\partial y_n}{\partial x} \end{bmatrix}

  • 分母布局

yx=[y1xy2xynx]\frac{\partial \mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} & \cdots & \frac{\partial y_n}{\partial x} \end{bmatrix}

可以随时在两种布局间进行转换

标量/向量

  • fRf \in \mathbb{R}
  • xRn\mathbf{x} \in \mathbb{R}^n

fx=(fx1fx2fxn)TRn\frac{\partial f}{\partial \mathbf{x}} = \begin{pmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} &\cdots &\frac{\partial f}{\partial x_n} \end{pmatrix}^T \in \mathbb{R}^n

2fxxT=(2fx1x1fx1x2fx1xn2fx2x1fx2x2fx2xn2fxnx1fxnx2fxnxn)TRn\frac{\partial^2 f}{\partial \mathbf{x}\partial \mathbf{x}^T} = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \frac{\partial f}{\partial x_1\partial x_2} &\cdots &\frac{\partial f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial f}{\partial x_2\partial x_2} &\cdots &\frac{\partial f}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial f}{\partial x_n\partial x_2} &\cdots &\frac{\partial f}{\partial x_n\partial x_n} \end{pmatrix}^T \in \mathbb{R}^n

对向量的导数是函数关于向量元素的偏导数。因此,得到的导数结果是一向量,与向量的维度一致。

标量/向量的公式

原式 结果
ax\frac{\partial a}{\partial \mathbf{x}} 0\mathbf{0}
au(x)x\frac{\partial au(\mathbf{x})}{\partial \mathbf{x}} au(x)xa\frac{\partial u(\mathbf{x})}{\partial \mathbf{x}}
(u+v)x\frac{\partial(u + v)}{\partial \mathbf{x}} ux+vx\frac{\partial u}{\partial \mathbf{x}} + \frac{\partial v}{\partial \mathbf{x}}
uvx\frac{\partial uv}{\partial \mathbf{x}} vux+uvxv\frac{\partial u}{\partial \mathbf{x}} + u\frac{\partial v}{\partial \mathbf{x}}
g(u(x))x\frac{\partial g(u(\mathbf{x}))}{\partial \mathbf{x}} g(u)uux\frac{\partial g(u)}{\partial u}\frac{\partial u}{\partial \mathbf{x}}

原式 结果
aTxx\frac{\partial \mathbf{a}^T\mathbf{x}}{\partial \mathbf{x}} a\mathbf{a}
xTax\frac{\partial \mathbf{x}^T\mathbf{a}}{\partial \mathbf{x}} a\mathbf{a}
xTxx\frac{\partial \mathbf{x}^T\mathbf{x}}{\partial \mathbf{x}} 2x2\mathbf{x}
xTAxx\frac{\partial \mathbf{x}^TA\mathbf{x}}{\partial \mathbf{x}} (A+AT)x(A + A^T)\mathbf{x}
(xa)T(xa)x\frac{\partial (\mathbf{x} - \mathbf{a})^T(\mathbf{x} - \mathbf{a})}{\partial \mathbf{x}} 2(xa)2(\mathbf{x} - \mathbf{a})
(Axa)T(Axa)x\frac{\partial (A\mathbf{x} - \mathbf{a})^T(A\mathbf{x} - \mathbf{a})}{\partial \mathbf{x}} 2AT(Axa)2A^T(A\mathbf{x} - \mathbf{a})
(Axa)TC(Axa)x\frac{\partial (A\mathbf{x} - \mathbf{a})^TC(A\mathbf{x} - \mathbf{a})}{\partial \mathbf{x}} 2AT(C+CT)(Axa)2A^T(C + C^T)(A\mathbf{x} - \mathbf{a})

标量/矩阵

  • fRf \in \mathbb{R}
  • ARn×mA \in \mathbb{R}^{n \times m}

fA=(fA11fA12fA1mfA21fA22fA2mfAn1fAn2fAnm)\frac{\partial f}{\partial A} = \begin{pmatrix} \frac{\partial f}{\partial A_{11}} & \frac{\partial f}{\partial A_{12}} & \cdots & \frac{\partial f}{\partial A_{1m}} \\ \frac{\partial f}{\partial A_{21}} & \frac{\partial f}{\partial A_{22}} & \cdots & \frac{\partial f}{\partial A_{2m}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial A_{n1}} & \frac{\partial f}{\partial A_{n2}} & \cdots & \frac{\partial f}{\partial A_{nm}} \end{pmatrix}

标量/矩阵的公式

原式 结果
tr(A)A\frac{\partial tr(A)}{\partial A} I\mathbf{I}
tr(AB)A\frac{\partial tr(AB)}{\partial A} BTB^T
tr(BA)A\frac{\partial tr(BA)}{\partial A} BTB^T
tr(ABAT)A\frac{\partial tr(ABA^T)}{\partial A} A(B+BT)A(B+B^T)
tr(f(U=g(A)))A\frac{\partial tr(f(U = g(A)))}{\partial A} tr(f(U))Utr(g(A))A\frac{\partial tr(f(U))}{\partial U} \frac{tr(g(A))}{\partial A}
f(A)AT\frac{\partial f(A)}{\partial A^T} (f(A)A)T(\frac{\partial f(A)}{\partial A})^T
AA\frac{\partial \vert A\vert}{\partial A} A(A1)T\vert A\vert(A^{-1})^T
lnAA\frac{\partial \ln\vert A\vert}{\partial A} (A1)T(A^{-1})^T
AXBF2X\frac{\partial \parallel AX - B\parallel^2_F}{\partial X} 2AT(AXB)2A^T(AX - B)
XABF2X\frac{\partial \parallel XA - B\parallel^2_F}{\partial X} 2(XAB)AT2(XA - B)A^T

向量/标量

  • yRn\mathbf{y} \in \mathbb{R}^n
  • xRx \in \mathbb{R}

yx=(y1xy2xynx)TRn\frac{\partial \mathbf{y}}{\partial x} = \begin{pmatrix} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} &\cdots &\frac{\partial y_n}{\partial x} \end{pmatrix}^T \in \mathbb{R}^n

向量/标量的公式

原式 结果
ax\frac{\partial \mathbf{a}}{\partial x} 0\mathbf{0}
au(x)x\frac{\partial a\mathbf{u}(x)}{\partial x} auxa\frac{\partial \mathbf{u}}{\partial x}
Aux\frac{\partial A\mathbf{u}}{\partial x} AuxA\frac{\partial \mathbf{u}}{\partial x}
uTx\frac{\partial \mathbf{u}^T}{\partial x} (uTx)T(\frac{\partial \mathbf{u}^T}{\partial x})^T
(u+v)x\frac{\partial (\mathbf{u} + \mathbf{v})}{\partial x} ux+vx\frac{\partial \mathbf{u}}{\partial x} + \frac{\partial \mathbf{v}}{\partial x}
g(u)x\frac{\partial \mathbf{g}(\mathbf{u})}{\partial x} g(u)xux\frac{\partial \mathbf{g}(\mathbf{u})}{\partial x} \frac{\partial \mathbf{u}}{\partial x}

向量/向量

  • yRm\mathbf{y} \in \mathbb{R}^m
  • xRn\mathbf{x} \in \mathbb{R}^n

yx=(y1x1y2x1ymx1y1x2y2x2ymx2y1xny2xnymxn)Rn×m\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{n \times m}

向量/向量的公式

原式 结果
ax\frac{\partial \mathbf{a}}{\partial \mathbf{x}} 0\mathbf{0}
xx\frac{\partial \mathbf{x}}{\partial \mathbf{x}} I\mathbf{I}
Axx\frac{\partial A\mathbf{x}}{\partial \mathbf{x}} ATA^T
f(u=g(u))x\frac{\partial f(\mathbf{u} = g(\mathbf{u}))}{\partial \mathbf{x}} g(x)xf(u)u\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}} \frac{\partial f(\mathbf{u})}{\partial \mathbf{u}}
xyx\frac{\partial \mathbf{x} \dotplus \mathbf{y}}{\partial \mathbf{x}} diag(y1,y2,,yn)diag(y_1, y_2, \cdots, y_n)
(f(x1),f(x2),,f(xn))x\frac{\partial (f(x_1), f(x_2), \cdots, f(x_n))}{\partial \mathbf{x}} (f(x1)000f(x2)0000f(xn))\begin{pmatrix}f'(x_1) & 0 & \cdots & 0\\0 & f'(x_2) & \cdots & 0\\\vdots & \vdots & \ddots & \vdots\\0 & 0 & 0 & f'(x_n)\end{pmatrix}

矩阵/标量

  • ARm×nA \in \mathbb{R}^{m\times n}
  • xRx \in \mathbb{R}

Ax=(A11xA21xAm1xA12xA22xAm1xA1nxA2nxAmnx)Rn×m\frac{\partial A}{\partial x} = \begin{pmatrix} \frac{\partial A_{11}}{\partial x} & \frac{\partial A_{21}}{\partial x} & \cdots & \frac{\partial A_{m1}}{\partial x} \\ \frac{\partial A_{12}}{\partial x} & \frac{\partial A_{22}}{\partial x} & \cdots & \frac{\partial A_{m1}}{\partial x} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial A_{1n}}{\partial x} & \frac{\partial A_{2n}}{\partial x} & \cdots & \frac{\partial A_{mn}}{\partial x} \end{pmatrix} \in \mathbb{R}^{n \times m}

多维随机变量

nn个随机变量X1,X2,,XnX_1, X_2, \cdots, X_n组成下面的nn元列向量

X=(X1X2Xn)=(X1X2Xn)\mathbf{X} = \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix} = \begin{pmatrix} X_1 & X_2 & \cdots & X_n \end{pmatrix}

多维随机变量的期待值

E[X]=u=(u1u2un)=(E[X1]E[X2]E[Xn])=(E[X1]E[X2]E[Xn])E[\mathbf{X}] = \mathbf{u} = \begin{pmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{pmatrix} = \begin{pmatrix} E[X_1] \\ E[X_2] \\ \vdots \\ E[X_n] \end{pmatrix} = \begin{pmatrix} E[X_1]&E[X_2]&\cdots&E[X_n] \end{pmatrix}

  • E[aX+bY]=aE[X]+bE[Y]E[a\mathbf{X} + b\mathbf{Y}] = aE[\mathbf{X}] + bE[\mathbf{Y}]
  • E[AX+b]=AE[X]+bE[A\mathbf{X} + \mathbf{b}] = AE[\mathbf{X}] + \mathbf{b}

多维随机变量的方差

Var(aTX)=aTVar[X]aVar(\mathbf{a}^T\mathbf{X}) = \mathbf{a}^T Var[\mathbf{X}]\mathbf{a}

协方差矩阵

给定nn元的多维随机变量X\mathbf{X}mm元的多维随机变量Y\mathbf{Y},则它们的协方差矩阵为

Cov(X,Y)=Σ=[Cov(x1,y1)Cov(x1,y2)Cov(x1,ym)Cov(x2,y1)Cov(x2,y2)Cov(x2,ym)Cov(xn,y1)Cov(xn,y2)Cov(xn,ym)]=E(XYT)E(X)E(Y)T\begin{aligned} Cov(\mathbf{X}, \mathbf{Y}) = \Sigma &= \begin{bmatrix} Cov(x_1,y_1) & Cov(x_1,y_2) & \cdots & Cov(x_1,y_m) \\ Cov(x_2,y_1) & Cov(x_2,y_2) & \cdots & Cov(x_2,y_m) \\ \vdots & \vdots & \ddots & \vdots \\ Cov(x_n,y_1) & Cov(x_n,y_2) & \cdots & Cov(x_n,y_m) \end{bmatrix} \\ &= E(\mathbf{X}\mathbf{Y}^T) - E(\mathbf{X})E(\mathbf{Y})^T \end{aligned}

其中Cov(X,X)Cov(\mathbf{X}, \mathbf{X})X\mathbf{X}自身的协方差矩阵可以写作

Cov(X,X)=E((Xu)(Xu)T)=E(XXT)uuTCov(\mathbf{X}, \mathbf{X}) = E((\mathbf{X} - \mathbf{u})(\mathbf{X} - \mathbf{u})^T) = E(\mathbf{X}\mathbf{X}^T) - \mathbf{u}\mathbf{u}^T

  • Cov(AX,BY)=A Cov(X,Y) BTCov(A\mathbf{X}, B\mathbf{Y}) = A\ Cov(\mathbf{X}, \mathbf{Y})\ B^T
  • Cov(AX,AX)=A Cov(X,X)ATCov(A\mathbf{X}, A\mathbf{X}) = A\ Cov(\mathbf{X}, \mathbf{X})A^T
  • E(XTAX)=trAΣ+θTAθE(\mathbf{X}^TA\mathbf{X}) = \text{tr}A\Sigma + \theta^TA\theta

Gauss-Markov的定理

我们考虑下列线性回归模型

y=xβ+εy = x\beta + \varepsilon

我们假定

  • E(ε  X)=0E(\varepsilon\ |\ X) = 0
    • 表示相互独立
  • var(ε  X)=E(εε  X)=σε2INvar(\varepsilon\ |\ X) = E(\varepsilon\varepsilon'\ |\ X) = \sigma^2_\varepsilon \mathbf{I}_N
    • 表示异方差性

则以下成立

β^=(XX)1Xy\hat{\beta} = (X'X)^{-1}X'y

Gauss-Markov的定理证明

在一个模型里,一个最好的估计量是拥有最小的方差的。因为我们拥有的数据是yy,所以我们需要的估计量是yy的线性方程,即

β~=m+My\tilde{\beta} = m + My

  • β,m\beta, mk×1k \times 1
  • MMk×nk \times n
  • yyn×1n \times 1

同时,因为我们需要的是无偏估计,所以我们需要

E(β~)=βE(\tilde{\beta}) = \beta

β~=m+My\tilde{\beta} = m + My,我们可以得到

E(β~)=E(m+My)=E(m)+E(My)=E(m)+ME(yX)=m+ME(Xβ+εX)=m+MXβ+ME(εX)=m+MXβ\begin{aligned} E(\tilde{\beta}) &= E(m + My) \\ &= E(m) + E(My) \\ &= E(m) + ME(y|X) \\ &= m + ME(X\beta + \varepsilon|X) \\ &= m + MX\beta + ME(\varepsilon|X) \\ &= m + MX\beta \end{aligned}

因此我们可以得到

{m=0MX=Ik\begin{cases} m &= 0 \\ MX &= \mathbf{I}_k \end{cases}

我们注意到最小二乘推定量β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty,则

M=(XTX)1XTM = (X^TX)^{-1}X^T

MX=(XTX)1XTX=IkMX = (X^TX)^{-1}X^TX = \mathbf{I}_k

于是我们知道我们需要寻找以下形式的线性不偏估计量

β~=My,MX=Ik\tilde{\beta} = My, \quad MX = \mathbf{I}_k

为了不失去一般性,我们重新定义MM为以下形式

M=(XX)1X+CMX=Ik((XX)1X+C)X=IkIk+CX=IkCX=0\begin{aligned} M &= (X'X)^{-1}X' + C \\ \\ MX &= \mathbf{I}_k \\ &\Rightarrow ((X'X)^{-1}X' + C)X = \mathbf{I}_k \\ &\Rightarrow \mathbf{I}_k + CX = \mathbf{I}_k \\ &\Rightarrow CX = 0 \end{aligned}

于是

β~=My=M(Xβ+ε)=β+Mεβ~β=MεE(β~βX)=0\begin{aligned} \tilde{\beta} &= My \\ &= M(X\beta + \varepsilon) \\ &= \beta + M\varepsilon \\ \\ \tilde{\beta} - \beta = M\varepsilon \\ E(\tilde{\beta} - \beta|X) &= 0 \end{aligned}

于是不偏估计量的协方差矩阵为

E((β~β)(β~β)X)=E(Mε(Mε)X)=E(MεεMX)=M(EεεX)M=Mσε2InM=σε2MM\begin{aligned} E((\tilde{\beta} - \beta)(\tilde{\beta} - \beta)'|X) &= E(M\varepsilon(M\varepsilon)'|X) \\ &= E(M\varepsilon\varepsilon'M'|X) \\ &= M(E\varepsilon\varepsilon'|X)M' \\ &= M\sigma^2_\varepsilon\mathbf{I}_n M' \\ &= \sigma^2_\varepsilon MM' \end{aligned}

其中

MM=((XX)1X+C)((XX)1X+C)=(XX)1XX(XX)1+(XX)1XC+CX(XX)1+CC\begin{aligned} MM' &= ((X'X)^{-1}X' + C)((X'X)^{-1}X' + C)' \\ &= (X'X)^{-1}X'X(X'X)^{-1} + (X'X)^{-1}X'C' + CX(X'X)^{-1} + CC' \end{aligned}

因为CX=0CX = 0,同理CX=0C'X' = 0,则

MM"=(XX)1+CCMM" = (X'X)^{-1} + CC'

CC=0CC' = 0时,我们得到最好的估计量,可以表现为

var(xTβ~)var(xTβ^)var(\mathbf{x}^T\tilde{\beta}) \geq var(\mathbf{x}^T\hat{\beta})