# Nonlinearity Generator

Layer can be used as replacement or in combination with batchnorm.

## Forward

• Let $f(s,b)=\left\{\begin{matrix} s & ;s\ge b\\ b & ;else \end{matrix}\right.$
• $z^t_i=S^{t-1}_i$
• $S^t_i=f(z^t_i,b_i)$

For $b=0$ this is the ReLU. The trick is, that the training starts with an nearly linear network and than due to parameterupdates of $b$ the capacity is increased during training.

## Backward

•  ${\delta E\over\delta z^t_i}={\delta S^t_i\over\delta z^t_i}{ \delta E\over S^t_i} =\left\{\begin{matrix} { \delta E\over S^t_i} & ;S_i^{(t-1)}\ge b_i\\ 0 & ;else \end{matrix}\right.$
• ${\delta E\over\delta S^{(t-1)}_i} ={\delta z^t_i\over\delta S^{(t-1)}_i}{\delta E\over z^t_i} ={\delta E\over z^t_i}$
• ${\delta E\over\delta b_i} ={\delta S^t_i\over\delta b_i}{\delta E\over\delta S^t_i} =\left\{\begin{matrix} 0 & ;S_i^{(t-1)}\ge b_i\\ {\delta E\over\delta S^t_i} & ;else \end{matrix}\right.$

## Remarks

$$f(x,\alpha)=\alpha{1\over 1+e^{-x}} + (1-\alpha)(x-0.5)$$
For $\alpha=0$ this is the linearity plus $0.5$. For $\alpha=1$ this is the sigmoid. Can this be used to improve RNN’s? Better use tanh, so with $\alpha=0$ this is the linarity?