From c89d39af8b1966d46c1e3c408a8dda9d26753bb0 Mon Sep 17 00:00:00 2001 From: Shri Vishakh Devanand <78346797+LOuLOu-THEKing@users.noreply.github.com> Date: Tue, 15 Oct 2024 22:49:36 +0530 Subject: [PATCH] Added RL Notes (#3) --- week1/week1.md | 104 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) diff --git a/week1/week1.md b/week1/week1.md index e69de29b..724a6c00 100644 --- a/week1/week1.md +++ b/week1/week1.md @@ -0,0 +1,104 @@ +## Overview + +- What is RL? + +- Why is RL Better? + +- Elements of RL + +- K-Armed bandits and maths +### What is RL + - RL doesn't really require a data set, it works on reward hypothesis and tries to achieve the maximum reward instead of learning it using previously played data. + - When there is a delayed feedback (which works very well for RL, since other types of supervised learning can't) + + - traditional RL performs way better in controlled environments than traditional RL. + + - Read more about AlphaZero and StockFish fights. + + - AlphaZero can play multiple games (using very minimal code changes,[probably just changing the rules]). + + +### Fundamentals of RL + + - policy - defines the learning agent's way of behaving at a given time. + + - model - the agent's representation of the environment. + + - History - the sequence of observations, actions and rewards. + + - Reward signal - defines the goal of a RL problem. + + - State - a function of history. + + - Information State - contains all the useful informatio from history. + + - State Value - total reward at a particular time. + + - Action Value - the total reward an agent can expect to accumulate if it takes that action. + + - Transitions - predicts the next state. + + +### K-Armed Bandits & maths + + - A policy where we keep exploitation the first state's maximum value. + + - A place where you check everything everytime is called as exploration. + + - the gradient of these two is called as epsilon greedy exploration. + +$$ +Q_t(a) = \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} += \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}} +$$ + +$$ +Q_{n+1} = \frac{1}{n} \sum_{i=1}^{n} R_i +$$ + +$$ += \frac{1}{n} \left( R_n + \sum_{i=1}^{n-1} R_i \right) +$$ + +$$ += \frac{1}{n} \left( R_n + (n-1) \frac{1}{n-1} \sum_{i=1}^{n-1} R_i \right) +$$ + +$$ += \frac{1}{n} \left( R_n + (n-1) Q_n \right) +$$ + +$$ += \frac{1}{n} \left( R_n + n Q_n - Q_n \right) +$$ + +$$ += Q_n + \frac{1}{n} \left( R_n - Q_n \right) +$$ + + +### Markov Property + + - The future state is independent of the past and depends only on the present state. + + - it could be said it's a tuple containing a finite set of states and their transition probabilities. + +### State transition matrix + - convert each part of the equation into a matrix. + + +### Markov Reward process + + - MRP are basically with values. + + - Gamma is the discount factor, it ranges from 0 to 1. + + - Gamma close to 0 gives a myopic evaluation of state and action. + + - Gamma close to 1 gives a far-sighted evaluation of state and action. + +### Value function and bellman equation + + - Gives the total expected reward if starting in that state. + +