#forward propagation indicates from input layers to output layers calculating and storaging results in each layer.

#we all know backward propagation is to calculate derivatives of each W(W1, W2), so forward propagation is just 
#forward calculate, note: it is not derivatives, it is just XW, then ReLU(XW), then go to the next layer and get the output.

4.8 numerical stability and model initializing

1
2

#previously we assign a distribution to initialize parameters like W and b, but initializing scheme is very important.
#It is significant to keep numerical stability. A terrible choice may cause gradient explosion or gradient disappearance.

4.8.1 gradient disappearance and gradient explosion

1
2

#这里先省略的讲一下，因为我们在反向传播求梯度的时候，如果有多层感知机，会出现链式法则，出现多个雅可比求导矩阵，根据链式法则相乘。
#如果这些矩阵很不给力，因为他们有各种各样的特征值，所以可能会乘爆炸，或者很小。

1. gradient disappearance

%matplotlib inline
import torch
from d2l import torch as d2l

x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.sigmoid(x)#回忆sigmoid函数是1 / (1 + exp(-x))
y.backward(torch.ones_like(x))#正好通过点乘，把梯度分量都加起来了


d2l.plot(x.detach().numpy(), [y.detach().numpy(), x.grad.numpy()], legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5))
#显然这是求梯度的图

1	#也就是当sigmoid的输入很大或者很小时，它的梯度会消失。所以更稳定的ReLU函数成为默认选择。

svg

2. gradient explosion

#我们来生成100个高斯随机矩阵，并将他们与某个初始矩阵相乘。
M = torch.normal(0, 1, size=(4, 4))#因为方差取的很大，所以会越乘越大。
print('一个矩阵 \n', M)
for i in range(100):
    M = torch.mm(M, torch.normal(0, 1, size=(4, 4)))#也就是将其乘100次
    
print('乘以类似的100个矩阵后: \n', M)

一个矩阵 
 tensor([[-0.7713,  1.7917,  0.9610, -0.7195],
        [ 0.7852,  0.4398, -2.8173,  1.7556],
        [ 0.6476, -1.2246,  0.9080,  0.0451],
        [-0.4005, -1.1396,  0.4452,  1.2052]])
乘以类似的100个矩阵后: 
 tensor([[ 1.7021e+27,  2.2672e+27, -7.8770e+26,  1.1673e+27],
        [-3.1653e+27, -4.2160e+27,  1.4648e+27, -2.1708e+27],
        [-5.8994e+25, -7.8576e+25,  2.7301e+25, -4.0459e+25],
        [-1.4618e+27, -1.9471e+27,  6.7649e+26, -1.0025e+27]])
一个矩阵 
 tensor([[-0.7713,  1.7917,  0.9610, -0.7195],
        [ 0.7852,  0.4398, -2.8173,  1.7556],
        [ 0.6476, -1.2246,  0.9080,  0.0451],
        [-0.4005, -1.1396,  0.4452,  1.2052]])
乘以类似的100个矩阵后: 
 tensor([[ 1.7021e+27,  2.2672e+27, -7.8770e+26,  1.1673e+27],
        [-3.1653e+27, -4.2160e+27,  1.4648e+27, -2.1708e+27],
        [-5.8994e+25, -7.8576e+25,  2.7301e+25, -4.0459e+25],
        [-1.4618e+27, -1.9471e+27,  6.7649e+26, -1.0025e+27]])

3. Break symmetry

1
2
3

#假设我们有多层感知机，有一个隐藏层，两个隐藏单元。我们可以对第一层权重W1进行重新排列，并且同样对输出层权重进行排列，可以获得相同的函数。
#其实这就是矩阵乘法换个行列，只要对应相乘的行列不变，结果就还是相同的，一个道理。第一个隐藏单元和第二个隐藏单元将没有什么区别，也就是
#隐藏单元之间有排列对称性。可以用暂退法和正则化打破这种对称性。

4.8.2 parameters initialization

1	#这里简而言之吧，就是针对一些特定的问题，设定参数的时候，我们可以在方差上设置一些限制，比如sigma必须满足某数学条件即可。