Skip to content

Commit

Permalink
Added RL Notes (#3)
Browse files Browse the repository at this point in the history
  • Loading branch information
5iri authored Oct 15, 2024
1 parent b57b220 commit c89d39a
Showing 1 changed file with 104 additions and 0 deletions.
104 changes: 104 additions & 0 deletions week1/week1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
## Overview

- What is RL?

- Why is RL Better?

- Elements of RL

- K-Armed bandits and maths
### What is RL
- RL doesn't really require a data set, it works on reward hypothesis and tries to achieve the maximum reward instead of learning it using previously played data.
- When there is a delayed feedback (which works very well for RL, since other types of supervised learning can't)

- traditional RL performs way better in controlled environments than traditional RL.

- Read more about AlphaZero and StockFish fights.

- AlphaZero can play multiple games (using very minimal code changes,[probably just changing the rules]).


### Fundamentals of RL

- policy - defines the learning agent's way of behaving at a given time.

- model - the agent's representation of the environment.

- History - the sequence of observations, actions and rewards.

- Reward signal - defines the goal of a RL problem.

- State - a function of history.

- Information State - contains all the useful informatio from history.

- State Value - total reward at a particular time.

- Action Value - the total reward an agent can expect to accumulate if it takes that action.

- Transitions - predicts the next state.


### K-Armed Bandits & maths

- A policy where we keep exploitation the first state's maximum value.

- A place where you check everything everytime is called as exploration.

- the gradient of these two is called as epsilon greedy exploration.

$$
Q_t(a) = \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t}
= \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}}
$$

$$
Q_{n+1} = \frac{1}{n} \sum_{i=1}^{n} R_i
$$

$$
= \frac{1}{n} \left( R_n + \sum_{i=1}^{n-1} R_i \right)
$$

$$
= \frac{1}{n} \left( R_n + (n-1) \frac{1}{n-1} \sum_{i=1}^{n-1} R_i \right)
$$

$$
= \frac{1}{n} \left( R_n + (n-1) Q_n \right)
$$

$$
= \frac{1}{n} \left( R_n + n Q_n - Q_n \right)
$$

$$
= Q_n + \frac{1}{n} \left( R_n - Q_n \right)
$$


### Markov Property

- The future state is independent of the past and depends only on the present state.

- it could be said it's a tuple containing a finite set of states and their transition probabilities.

### State transition matrix
- convert each part of the equation into a matrix.


### Markov Reward process

- MRP are basically with values.

- Gamma is the discount factor, it ranges from 0 to 1.

- Gamma close to 0 gives a myopic evaluation of state and action.

- Gamma close to 1 gives a far-sighted evaluation of state and action.

### Value function and bellman equation

- Gives the total expected reward if starting in that state.


0 comments on commit c89d39a

Please sign in to comment.