Nonlinearity Generator

Layer can be used as replacement or in combination with batchnorm.


  • Let $f(s,b)=\left\{\begin{matrix}
    s & ;s\ge b\\
    b & ;else
  • $z^t_i=S^{t-1}_i$
  • $S^t_i=f(z^t_i,b_i)$

For $b=0$ this is the ReLU. The trick is, that the training starts with an nearly linear network and than due to parameterupdates of $b$ the capacity is increased during training.

In the original paper a special training ruele like momentum is used. How about adadelta?


  •  ${\delta E\over\delta z^t_i}={\delta S^t_i\over\delta z^t_i}{ \delta E\over S^t_i}
    { \delta E\over S^t_i} & ;S_i^{(t-1)}\ge b_i\\
    0 & ;else
  • ${\delta E\over\delta S^{(t-1)}_i}
    ={\delta z^t_i\over\delta S^{(t-1)}_i}{\delta E\over z^t_i}
    ={\delta E\over z^t_i}
  • ${\delta E\over\delta b_i}
    ={\delta S^t_i\over\delta b_i}{\delta E\over\delta S^t_i}
    0 & ;S_i^{(t-1)}\ge b_i\\
    {\delta E\over\delta S^t_i} & ;else


How about the following function:

f(x,\alpha)=\alpha{1\over 1+e^{-x}} + (1-\alpha)(x-0.5)


For $\alpha=0$ this is the linearity plus $0.5$. For $\alpha=1$ this is the sigmoid. Can this be used to improve RNN’s? Better use tanh, so with $\alpha=0$ this is the linarity?