关于 kaiming initialization 论文中的一个疑惑与想法

最近细读两篇 initial 的论文，Xavier 和 Kaiming ，在读 Kaiming 时有个疑惑，不过疑问提出的两天后，晚上炒菜的时候忽然想明白了，这里写篇帖子记录下我的疑问和解释，有兴趣的人可以一起看，如果哪里说错了大家也好指正。

论文地址：

Xavier 初始化论文 Understanding the difficulty of training deep feedforward neural networks
Kaiming 初始化论文 Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification

总的来讲，Kaiming 和 Xavier 的初始化的思路是一样的，都是为了保证前向传播和后向传播的信号稳定，避免出现放大或缩小的情况，他们都是确保前传和后传的方差表现一致，只是用的激活函数不一样。

Xavier 是 2010 年提出来的，他用的激活是对称双曲族(symmetric hyperbolic distribution)，论文里验证用的是 tanh 。Kaiming 是 2015 年提出来的，他用的是修正线性单元族（ Rectified Linear Unit, ReLU ），同时在此之上提出了一个负向斜率由网络本身自由控制的 PReLU 。

以上是两篇论文的基本情况，我的疑问在 Kaiming 论文的第 4 页右侧 Backward Propagation Case 里，也就是 kaiming 初始化后向传播推导过程里，里面有一段“In back-propagation we also have Delta y_l=f'(y_l)Delta x_(l+1) where f' is the derivative of f. For the ReLU case, f'(yl) is zero or one, and their probabilities are equal.”我的疑问在最后一句：为什么 their probabilities are equal ？

根据该论文 2.1 单元最后一段提出的两点有趣的现象：

Table 1 also shows the learned coefficients of PReLUs for each layer. There are two interesting phenomena in Table 1. First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected. We be lieve that this is a more economical way of exploiting lowlevel information, given the limited number of filters (e.g.,64). Second, for the channel-wise version, the deeper conv layers in general have smaller coefficients. This implies that the activations gradually become “more nonlinear” at increasing depths. In other words, the learned model tends to keep more information in earlier stages and becomes more discriminative in deeper stages.

这段我个人觉得可以这么理解，当深度卷积网络使用 PReLU 时，作者发现上层的负斜率较大，随深度的增加，斜率逐渐变小，越深越非线性，上层会 keep more information ，特征会混杂在一起，随深度的增加，会 more discriminative ，特征也会渐渐提纯，是不是可以理解为越到下方提取的特征数据会呈现稀疏的特性？那么数据在经过线性单元之前就不会在 0 处五五开，也就回到了上面我的疑问，their probabilities should not be equal.

当时对于我乱七八糟的想法也没人可以讨论，只能搁置一边，不过就提出疑问后的隔两天晚上，在做笋丁炒肉时忽然想明白了，我的想法是：Kaiming 论文里 two interesting phenomena 是基于 Table1 的结论，而 table1 是训练完成后的最终状态，而初始化时只需要考虑初始状态即可，初始的时候一般用的都是 x 轴对称的分布去初始化参数，自然五五开，至于后面是几几开就由网络自己去调整 bia