Reinforcement Learning

$\def\P{\mathcal P}$
$\def\E{\mathbb E}$
$\def\R{\mathbb R}$
$\def\Rw{\mathcal R}$

RL Process

rl_process

History:
$
H_t=A_1,O_1,R_1,…,A_t,O_t,R_t
$
Too big, therefore compression to agent state $S^A_t=f(H_t)$.

Definition Markov State

State $S_t$ is Markov if and only if
$
\P(S_{t+1}|S_t)=\P(S_{t+1}|S_1,…,S_t)
$
Given the present, the probability of the state is independent of the past.

Full Observability

$O_t=S_t^A=S_t^E$ Agent observes environment directly. (This is Markov.)

Partial Observability

$O_t=S_t^A\ne S_t^E$ Agent must construct state. For example RNN: $S_{t+1}=g(W_SS_t+W_OO_t)$.

RL Agent

Consists of three components:

  • Policy:
    Behaviour of the agent $A_t=\pi(S_t)$
  • Value Function:
    Expectation of reward:
    $
    V_{\pi}(s)=\E_{\pi}\left[R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+…|S_t=s\right]
    =\E_{\pi}\left[\sum_{u=t}^{\infty}\gamma^{u-t}R_u|S_t=s\right]
    $
    mit $\gamma\in[0,1]$
  • Model (optional):
    A model predicts, what the environment will do next.

Transition Matrix

Definiert die Wahrscheinlichkeiten für den Wechsel von einem State (Zeile) in den nächsten (Spalte).
$P_{ss’}=\P[S_{t+1}=s’|S_t=s]$

$
P=\begin{bmatrix}
P_{11} & … & P_{1n}\\
\vdots & & \vdots \\
P_{n1} & … & P_{nn}\\
\end{bmatrix}
$

Markov Process (MP)

A Markov Process consists of $\langle S,P \rangle$

  • $S$ finite set of states
  • $P$ Transition Matrix

Markov Reward Process (MRP)

A Markov Process consists of $\langle S,P,R,\gamma \rangle$

  • $S$ finite set of states
  • $P$ Transition Matrix
  • $\Rw$ reward function $\Rw_s=\E[R_{t+1}|S_t=s]$
  • $\gamma\in[0,1]$

Sample MRP

mp_study

Goal

$G_t=\sum_{k=0}^\infty\gamma^k R_{t+k+1}$

State Value Function

$v(s)=\E[G_t|S_t=s]$

References

Stochastik

Erwartungswert

Sei $X$ eine endliche Zufallsgröße mit $n$ Werten $x_i$ und Wahrscheinlichkeit $P(X=x_i)$.
Der Erwartungswert ist dann definiert durch:
$$
E(X)=\sum_{i=1}^nP(X=x_i)x_i
$$