Reinforcement Learning: an Easy Introduction to Value Iteration | by Carl Bettosi

Fixing the instance utilizing Worth Iteration

VI ought to make much more sense as soon as we full an instance drawback, so let’s get again to our golf MDP. We now have formalised this as an MDP however at present, the agent doesn’t know the most effective technique when enjoying golf, so let’s remedy the golf MDP utilizing VI.

We’ll begin by defining our mannequin parameters utilizing pretty customary values:

γ = 0.9 // low cost factorθ = 0.01 // convergence threshold

We are going to then initialise our worth desk to 0 for states in S:

// worth desk

V(s0) = 0V(s1) = 0V(s2) = 0

We are able to now begin within the outer loop:

Δ = 0

And three passes of the inside loop for every state in S:

// Bellman replace rule// V(s) ← maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

v = 0

// we’ve got solely checked out one motion right here as just one is offered from s0// we all know that the others will not be attainable and would due to this fact sum to 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) + 0.9 * (0 + 0.9 * 0)]

V(s0) = max[0] = 0

// Delta replace rule// Δ ← max(Δ,| v – V(s)|)

Δ = max[Δ, |v – v(s0)|] = max[0, |0 – 0|] = 0

//******************* state s1 *******************//

v = 0

// there are 2 obtainable actions right here

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 0) + 0.1 * (0 + 0.9 * 0),0.1 * (0 + 0.9 * 0) + 0.9 * (10 + 0.9 * 0)]

V(s1) = max[0, 9] = 9

Δ = max[Δ, |v – v(s1)|] = max[0, |0 – 9|] = 9

//******************* state s2 *******************//

// terminal state with no actions

This offers us the next replace to our worth desk:

V(s0) = 0V(s1) = 9V(s2) = 0

We don’t want to fret about s2 as it is a terminal state, which means no actions are attainable right here.

We now get away the inside loop and proceed the outer loop, performing a convergence examine on:

Δ < θ = 9 < 0.01 = False

Since we’ve got not converged, we do a second iteration of the outer loop:

Δ = 0

And one other 3 passes of the inside loop, utilizing the up to date worth desk:

//******************* state s0 *******************//

v = 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) + 0.9 * (0 + 0.9 * 9)]

V(s0) = max[7.29] = 7.29

Δ = max[Δ, |v – v(s0)|] = max[0, |0 – 7.29|] = 7.29

//******************* state s1 *******************//

v = 9

V(s1) = max[0.9 * (0 + 0.9 * 7.29) + 0.1 * (0 + 0.9 * 9),0.1 * (0 + 0.9 * 9) + 0.9 * (10 + 0.9 * 0)]

V(s1) = max(6.7149, 9.81) = 9.81

Δ = max[Δ, |v – v(s1)|] = max[7.29, |9 – 9.81|] = 7.29

//******************* state s2 *******************//

// terminal state with no actions

On the finish of the second iteration, our values are:

V(s0) = 7.29V(s1) = 9.81V(s2) = 0

Test convergence as soon as once more:

Δ < θ = 7.29 < 0.01 = False

Nonetheless no convergence, so we proceed the identical course of as above till Δ < θ. I gained’t present all of the calculations, the above two are sufficient to know the method.

After 6 iterations, our coverage converges. That is our values and convergence fee as they alter over every iteration:

Iteration V(s0) V(s1) V(s2) Δ Converged1 0 9 0 9 False2 7.29 9.81 0 7.29 False3 8.6022 9.8829 0 1.3122 False4 8.779447 9.889461 0 0.177247 False5 8.80061364 9.89005149 0 0.02116664 False6 8.8029969345 9.8901046341 0 0.0023832945 True

Now we are able to extract our coverage:

// Coverage extraction rule// π(s) = argmaxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

// we all know there is just one attainable motion from s0, however let’s simply do it anyway

π(s0) = argmax[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))

π(s0) = argmax[0.1 * (0 + 0.9 * 8.8029969345) + 0.9 * (0 + 0.9 * 9.8901046341)]

π(s0) = argmax[8.80325447773]

π(s0) = hit to inexperienced

//******************* state s1 *******************//

π(s1) = argmax[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 8.8029969345) + 0.1 * (0 + 0.9 * 9.8901046341),0.1 * (0 + 0.9 * 9.8901046341) + 0.9 * (10 + 0.9 * 0)]

π(s1) = argmax[8.02053693401, 9.89010941707]

π(s1) = hit in gap

Our closing coverage is:

π(s0) = hit to greenπ(s1) = hit to holeπ(s2) = terminal state (no motion)

So, when our agent is within the Ball on fairway state (s0), the most effective motion is to hit to inexperienced. This appears fairly apparent since that’s the solely obtainable motion. Nonetheless, in s1, the place there are two attainable actions, our coverage has discovered to hit in gap. We are able to now give this discovered coverage to different brokers who wish to play golf!

And there you could have it! We now have simply solved a quite simple RL drawback utilizing Worth Iteration.

Source link

Reinforcement Learning: an Easy Introduction to Value Iteration | by Carl Bettosi | Sep, 2023

Eliminating Vector Quantization: Diffusion-Based Autoregressive AI Models for Image Generation

Voyage Multilingual 2 Embedding Evaluation | by Lars Wiik | Jun, 2024

Eric Evans receives Department of Defense Medal for Distinguished Public Service | MIT News

What Adam Savage Thinks About AI

Intelligent Assistant Language Understanding On-device

Recommended For You

Eliminating Vector Quantization: Diffusion-Based Autoregressive AI Models for Image Generation

Voyage Multilingual 2 Embedding Evaluation | by Lars Wiik | Jun, 2024

Eric Evans receives Department of Defense Medal for Distinguished Public Service | MIT News

Imperva optimizes SQL generation from natural language using Amazon Bedrock

AI in Manufacturing: Overcoming Data and Talent Barriers

Intelligent Assistant Language Understanding On-device

All About Sample-Size Calculations for A/B Testing: Novel Extensions and Practical Guide

Humanoid Robot Ameca JUST SHOCKED with NEW UPGRADE 2023 !

Leave a Reply Cancel reply

A technique for more effective multipurpose robots | MIT News

Helping robots grasp the unpredictable | MIT News

The Current State of AI! (My Personal News Recap)

Robotics investments reach $418M in November 2023

2024 World Battery & Energy Storage Industry Expo (WBE)

MIT faculty, instructors, students experiment with generative AI in teaching and learning | MIT News

What is AI – Artificial Intelligence in Telugu | Future of AI | TeluguBadi

Zion Solutions Group Joins Forces with Locus Robotics to Supercharge Warehouse Productivity

A method to enable safe mobile robot navigation in dynamic environments

Robot Talk Episode 90 – Robotically Augmented People

Eliminating Vector Quantization: Diffusion-Based Autoregressive AI Models for Image Generation

RBR50 Spotlight: Slip Robotics minimizes trailer loading times with simple approach

Voyage Multilingual 2 Embedding Evaluation | by Lars Wiik | Jun, 2024

CATEGORIES

SITE MAP

Welcome Back!

Retrieve your password

Reinforcement Learning: an Easy Introduction to Value Iteration | by Carl Bettosi | Sep, 2023

You might also like

Fixing the instance utilizing Worth Iteration

What Adam Savage Thinks About AI

Intelligent Assistant Language Understanding On-device

Recommended For You

Leave a Reply Cancel reply

CATEGORIES

SITE MAP

Welcome Back!

Retrieve your password