Nonlinearity Generator

Layer can be used as replacement or in combination with batchnorm.

Forward

  • Let $f(s,b)=\left\{\begin{matrix}
    s & ;s\ge b\\
    b & ;else
    \end{matrix}\right.$
  • $z^t_i=S^{t-1}_i$
  • $S^t_i=f(z^t_i,b_i)$

For $b=0$ this is the ReLU. The trick is, that the training starts with an nearly linear network and than due to parameterupdates of $b$ the capacity is increased during training.

In the original paper a special training ruele like momentum is used. How about adadelta?

Backward

  •  ${\delta E\over\delta z^t_i}={\delta S^t_i\over\delta z^t_i}{ \delta E\over S^t_i}
    =\left\{\begin{matrix}
    { \delta E\over S^t_i} & ;S_i^{(t-1)}\ge b_i\\
    0 & ;else
    \end{matrix}\right.
    $
  • ${\delta E\over\delta S^{(t-1)}_i}
    ={\delta z^t_i\over\delta S^{(t-1)}_i}{\delta E\over z^t_i}
    ={\delta E\over z^t_i}
    $
  • ${\delta E\over\delta b_i}
    ={\delta S^t_i\over\delta b_i}{\delta E\over\delta S^t_i}
    =\left\{\begin{matrix}
    0 & ;S_i^{(t-1)}\ge b_i\\
    {\delta E\over\delta S^t_i} & ;else
    \end{matrix}\right.
    $

Remarks

How about the following function:

$$
f(x,\alpha)=\alpha{1\over 1+e^{-x}} + (1-\alpha)(x-0.5)
$$

siglingen

For $\alpha=0$ this is the linearity plus $0.5$. For $\alpha=1$ this is the sigmoid. Can this be used to improve RNN’s? Better use tanh, so with $\alpha=0$ this is the linarity?

References

Batchnorm_Replacement