Sigmoid Generator

Training deep nets with logistic units is hard. Idea is to linearize the network in the beginning of the training and let the network learn the grade of non linearity (e-g- capacity) by itself. Forward Let $f(x,\alpha)=\alpha{1\over 1+e^{-x}} + (1-\alpha)(x-0.5)$ $z_i=S^{t-1}_i$ $S^t_i=f(z_i, b_i)$ Backward Let $g(x)={1\over 1+e^{-x}}$ ${\delta E\over\delta z_i} ={\delta S^t_i\over\delta z_i}{\delta E\over\delta S^t_i} =(b_i […]


Long Short Term Memory cells. Variablen $N$ Anzahl hidden Nodes $I$ Anzahl Inputnodes $t=0\dots T$ Zeitpunkte $\theta^{xi(t)}\in R^{N\times I}$ Inputgate Input $\theta^{hi(t)}\in R^{N\times N}$ Inputgate Hidden $b^{i(t)}\in R^{N}$ Inputgate Konstante Forward Pass $z^{i(t)}=\theta^{xi(t)}x^{(t)} + \theta^{hi(t)}h^{(t-1)} + b^{i(t)}$ $i^{(t)}=Sig(z^{i(t)})$ $z^{f(t)}=\theta^{xf(t)}x^{(t)} + \theta^{hf(t)}h^{(t-1)} + b^{f(t)}$ $f^{(t)}=Sig(z^{f(t)})$ $z^{o(t)}=\theta^{xo(t)}x^{(t)} + \theta^{ho(t)}h^{(t-1)} + b^{o(t)}$ $o^{(t)}=Sig(z^{o(t)})$ $z^{g(t)}=\theta^{xg(t)}x^{(t)} + \theta^{hg(t)}h^{(t-1)} + b^{g(t)}$ […]


Der Einfachheit halber sei die Gewichtematrix definiert von $-c$ bis $c$. $s$ sei die Stride. Forward $z^t(x,y)=\sum_{u=-c}^c\sum_{v=-c}^c w_{u,v} h^{t-1}(xs+u,ys+v)+b$ $h^t(x,y)=Sig(z^t(x,y))$ Ableitungen $\def\xo{\overline x}$ $\def\yo{\overline y}$ ${\delta E\over\delta z^t(x,y)}= {\delta h^t(x,y)\over\delta z^t(x,y)}{\delta E\over\delta h^t(x,y)} =h^t(x,y)(1-h^t(x,y)){\delta E\over\delta h^t(x,y)} $ $w_{u,v}$ kommt in jedem $z^t(x,y)$ vor. ${\delta E\over\delta w_{u,v}} ={\sum_x\sum_y{\delta z^t(x,y)\over\delta w_{u,v}}{\delta E\over\delta z^t(x,y)}} ={\sum_x\sum_y h^{t-1}(xs+u,ys+v) {\delta E\over\delta […]


Sei s die Stride. $\def\xo{\overline x}$ $\def\yo{\overline y}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ Forward $z^t(x,y)=\max_{u=0}^{s-1}\max_{v=0}^{s-1} h^{t-1}(xs+u,ys+v)$ $h^t(x,y)=z^t(x,y)$ Ableitungen ${\delta E\over\delta z^t(x,y)}= {\delta h^t(x,y)\over\delta z^t(x,y)}{\delta E\over\delta h^t(x,y)} ={\delta E\over\delta h^t(x,y)} $ $h^{t-1}(\xo,\yo)$ kommt nur vor in $z^{t}(x,y)$ mit $x=\lfloor{\xo\over s}\rfloor$ und $y=\lfloor{\yo\over s}\rfloor$ ${\delta E\over\delta h^{t-1}(\xo,\yo)} ={\delta z^{t}(x,y)\over\delta h^{t-1}(\xo,\yo)} {\delta E\over\delta z^{t}(x,y)} ={\delta E\over\delta z^{t}(x,y)}\quad wenn\quad (\xo,\yo)=\argmax_{u,v}h^{t-1}(xs+u,ys+v) ;\quad 0\quad sonst […]


$\def\ho{\overline h}$ Normalisierung der Inputs eines Layers mit Mittelwert und Varianz. Forward Sei $h^{t-1}_{ij}$ der Input des $i$-ten Knotens und $j$-ten Samples des Batches. Sei $\epsilon$ ein kleiner Wert für die numerische Stabilität. $\mu^t_i={1\over m}\sum_{j=1}^m h^{t-1}_{ij}$ Mittelwert pro Knoten. ${\sigma^t_i}^2={1\over m}\sum_{j=1}^m (h^{t-1}_{ij}-\mu^t_i)^2$ Varianz pro Knoten $\ho^t_{ij}={h^{t-1}_{ij}-\mu^t_i\over \sqrt{{\sigma^t_i}^2+\epsilon}}$ $h^t_{ij}=\gamma_i\ho^t_{ij}+\beta_i$ Ableitungen


2D Recurrent Layer implementation. Forward Netzparameter: $\bar S$ Init Parameter Stati; $W$, $\bar W$ weights for recursive connections; $b$ constant. $S^{(0,u)}=S^{(t,0)}=\bar S$ Für $1\le t\le T$ $1\le u\le U$ $z^{(t,u)}=W^IX^{(t,u)}+WS^{(t-1,u)}+ \bar WS^{(t,u-1)}+b$; with $X$ Inputs $S^{t,u}=g(z^{(t,u)})$ Backward ${\delta E\over\delta z_i^{(t,u)}}={\delta S_i^{(t,u)}\over\delta z_i^{(t,u)}}{\delta E\over\delta S_i^{(t,u)}}=g'(z_i^{(t,u)}){\delta E\over\delta S_i^{(t,u)}}$ ${\delta E\over\delta S_i^{(t,u)}}=\sum_j{\delta z_j^{(t+1,u)}\over\delta S_i^{(t,u)}}{\delta E\over\delta z_j^{(t+1,u)}}+ \sum_j{\delta z_j^{(t,u+1)}\over\delta S_i^{(t,u)}}{\delta E\over\delta […]

Dice Coefficent

Error Layer. Nützlich für Segmentierung von Bildern. Siehe auch Kaggle Ultrasound Nerve Segmentation Forward Sei $G\in R^n$ eine passende Gewichtematrix mit $G_i\in[0,1]$ Sei $f(A)=\sum_i G_i A_i$ für $A\in R^n$ $S=\sum_i G_iY_iH_i^{(t-1)}$ $E=1-{E_1\over E_2}=1-{2S+\gamma\over f(Y)+f(H^{(t-1)})+\gamma}$ mit $\gamma\gt 0\in R$ für numerische Stabilität. Ableitungen ${\delta E\over\delta H_i^{(t-1)}} =-{{\delta E_1\over\delta H_i^{(t-1)}}E_2 -E_1{\delta E_2\over\delta H_i^{(t-1)}} \over E_2^2} =-{2G_iY_iE_2 – […]

Focal Loss

Good for imbalanced classification tasks.   Ressources Paper 1708.02002

Negative Log Propability

Error Layer Forward $E(\theta)=\sum_t\sum_i(-y_i^{t}\ln(r_i^{(t)})-(1-y_i^{(t)})\ln(1-r_i^{(t)}))$


Forward $n_I$ Anzahl Input Units. $n$ Anzahl hidden units. $W_{fx}, W_x\in R^{n \times n_I}$ $b\in R^n$ $W_{fS}, W_{Sf}\in R^{n \times n}$ $f^t= (W_{fx} x^t)\cdot (W_{fh}S^{t-1})$ $f^t_i=(\sum_k w^{fx}_{ik}x_k)(\sum_k w^{fS}_{ik}S^{t-1}_k) =\alpha^t_i\beta^t_i$ $z^t=W_x x^t+W_{Sf}f^t+b$ mit $\gamma^t=W_{Sf}f^t$ $S^t=g(z^t)$ Für mehere Input Layer: Summe über $diag(W_{fx} x^t+W_{f\dot x} \dot x^t)$ im Diag Teil und Summe $W_x x^t+W_{\dot x} \dot x^t$ […]


Forward $z^t=\sum_i h^{t-1,i}$ with $h^{t-1,i}\in R^{n\times m}$ Inputs from previous layers. $h^t=g(z^t)$ with $g$ activation. Back ${\delta E\over\delta z^t}={\delta h^t\over\delta z^t}{\delta E\over\delta h^t}=g'(z^t){\delta E\over\delta h^t}$ ${\delta E\over\delta h^{t-1,i}}={\delta z^t\over\delta h^{t-1,i}}{\delta E\over\delta z^t}={\delta E\over\delta z^t}$


Das Upscale-Layer wird im U-Netz eingesetzt. Forward Sei $s\in N^{>0}$ der Skalierungsfaktor. $h^t_i=h^{t-1}_{i/s}$

Nonlinearity Generator

Layer can be used as replacement or in combination with batchnorm. Forward Let $f(s,b)=\left\{\begin{matrix} s & ;s\ge b\\ b & ;else \end{matrix}\right.$ $z^t_i=S^{t-1}_i$ $S^t_i=f(z^t_i,b_i)$ For $b=0$ this is the ReLU. The trick is, that the training starts with an nearly linear network and than due to parameterupdates of $b$ the capacity is increased during training. […]

Average Pool in Time

Reduce t-dimention in RNN with average pooling. Forward Let $s$ be the stride. $0\le t\le T$ $S^{(l,-1)}=S^{(l,T+1)}=0$ $S^{(l,\bar t)}=z^{(l,\bar t)}={1\over3}\sum_{i=-1}^1S^{(l-1,2\bar t+i)}$ Backward ${\delta E\over\delta z^{(l,\bar t)}}={\delta S^{(l,\bar t)}\over\delta z^{(l,\bar t)}}{\delta E\over\delta S^{(l,\bar t)}} ={\delta E\over\delta S^{(l,\bar t)}}$ ${\delta E\over\delta S^{(l-1,t)}} =\left\{\begin{array}{c} \sum_{i=0}^1{\delta z^{(l-1,\left\lfloor{t\over 2}\right\rfloor+i)}\over\delta S^{(l-1,t)}}{\delta E\over\delta z^{(l-1,\left\lfloor{t\over 2}\right\rfloor+i)}}; \space t\space odd\\ {\delta z^{(l-1,\left\lfloor{t\over 2}\right\rfloor)}\over\delta S^{(l-1,t)}}{\delta […]

Fully Connected

Forward $z_i^t=\sum_j w_{ij}S^{(t-1)}_j$ $S^t_i=g(z_i)$ Backward We know ${\delta E\over\delta S^t_i}$ from the child layer. So we can compute: ${\delta E\over\delta z^t_i}={\delta S^t_i\over\delta z^t_i}{\delta E\over\delta S^t_i} =g'(z^t_i){\delta E\over\delta S^t_i} $ ${\delta E\over\delta S^{t-1}_i} =\sum_j{\delta z^t_j\over S^{t-1}_i}{\delta E\over\delta z^{t}_j} =\sum_j{w_{ji}}{\delta E\over\delta z^{t}_j} $ ${\delta E\over\delta w_{ij}} ={\delta z^t_i\over \delta w_{ij}}{\delta E\over \delta z^t_i} ={S_j^{(t-1)}}{\delta E\over \delta z^t_i} $