费雪信息 (Fisher information) 的直观意义是什么?

MLAPP中Jeffreys priors 中提到,但是不知道fisher information的直观意义是是什么,其数学定义能够说明什么问题,为什么有用
关注者
632
被浏览
38094

9 个回答

首先我们看一下 Fisher Information 的定义:
假设你观察到 i.i.d 的数据 X_1, X_2, \ldots X_n 服从一个概率分布f(X; \theta),\theta是你的目标参数(for simplicity, 这里\theta是个标量,且不考虑 nuissance parameter),那么你的似然函数(likelihood)就是:
L(\bold{X};\theta) = \prod_{i=1}^n f(X_i;\theta)
为了解得Maximum Likelihood Estimate(MLE),我们要让log likelihood的一阶导数得0,然后解这个方程,得到\hat{\theta}_{MLE}
这个log likelihood的一阶导数也叫,Score function :
S(\bold{X};\theta) = \sum_{i=1}^n \frac{\partial log f(X_i;\theta)}{\partial \theta}

那么Fisher Information,用I(\theta)表示,的定义就是这个Score function的二阶矩(second moment)I(\theta) = E[S(X;\theta)^2]
一般情况下(under specific regularity conditions)可以很容易地证明,E[S(\bold{X};\theta)]= 0, 从而得到:
I(\theta) = E[S(X;\theta)^2]-E[S(X;\theta)]^2 = Var[S(X;\theta)]
于是得到了Fisher Information的第一条数学意义:就是用来估计MLE的方程的方差。它的直观表述就是,随着收集的数据越来越多,这个方差由于是一个Independent sum的形式,也就变的越来越大,也就象征着得到的信息越来越多。

而且,如果log likelihood二阶可导,在一般情况下(under specific regularity conditions)可以很容易地证明:
E[S(\bold{X};\theta)^2] = -E(\frac{\partial^2}{\partial \theta^2}log L(\bold{X};\theta))
于是得到了Fisher Information的第二条数学意义:log likelihood在参数真实值处的负二阶导数的期望。这个意义好像很抽象,但其实超级好懂。
首先看一下一个normalized Bernoulli log likelihood长啥样:
对于这样的一个log likelihood function,它越平而宽,就代表我们对于参数估计的能力越差,它高而窄,就代表我们对于参数估计的能力越好,也就是信息量越大。而这个log likelihood在参数真实值处的负二阶导数,就反应了这个log likelihood在顶点处的弯曲程度,弯曲程度越大,整个log likelihood的形状就越偏向于高而窄,也就代表掌握的信息越多。

然后,在一般情况下(under specific regularity conditions),通过对score function在真实值处泰勒展开,然后应用中心极限定理,弱大数定律,依概率一致收敛,以及Slutsky定理,可以证明MLE的渐进分布的方差是I^{-1}(\theta),即Var(\hat{\theta}_{MLE}) = I^{-1}(\theta), 这也就是Fisher Information的第三条数学意义。不过这样说不严谨,严格的说,应该是 \sqrt{n}(\hat{\theta}_{MLE}-\theta) \xrightarrow{D} N(0,I^*(\theta)^{-1}), 这里I^*(\theta)是当只观察到一个X值时的Fisher Information,当有n个 i.i.d 观测值时,I^*(\theta) = I(\theta)/n。所以这时的直观解释就是,Fisher Information反映了我们对参数估计的准确度,它越大,对参数估计的准确度越高,即代表了越多的信息。
What is an intuitive explanation of Fisher information?
Let's consider the one dimensional case with a log-likelihood function l(\theta) where \theta is the parameter of interest. The observed fisher information is the curvature at the peak of this function, that is -l''(\theta_{MLE}), which intuitively tells us how peaked the likelihood function is or how "well" we know the parameter after data has been collected. A log-likelihood which is not terribly peaked is somewhat spread out, and we don't really have much confidence in what \theta is after having collected data and conversely, a very peaked likelihood implies we have a great deal of "confidence" of the precise value of \theta.

The expected fisher information applies the same concept except we average out the data, and we treat \theta as a constant: it's -E[l''(\theta)]. So it tells us on average how curved or peaked the likelihood function will be after the data has been collected, for a prescribed value of \theta.

In the multi-dimensional setting, we simply take the Hessian as opposed to the second derivative to measure curvature.

Conceptually, I find the idea of functionals of the likelihood as a statistic itself quite funny to wrap my head around: instead of a single number, we have an entire (random) data dependent function that encapsulates something about the parameter of interest.