Processing math: 100%

Tuesday, February 12, 2013

Lecture 8

You will find that the policy improvement algorithm is useful in answering Questions 3 and 4 on Examples Sheet 2. In Question 3, you will find λ and ϕ for a certain policy ("On seeing a filling station, stop and fill the tank"), and then look for a condition that the policy improvement algorithm will not improve it (or, equivalently, that this λ and ϕ satisfy the optimality equation.

In Question 4 you will use policy improvement idea in the context of a discounted-cost problem. You find F(x) for a simple policy (in which the engineer allocates his time randomly), and then improve it by a step of the policy improvement algorithm. This leads to a policy in which the engineer puts all his maintenance effort into the machine with greatest value of ci(xi+qi)/(1βbi). This policy is better, but may not yet be optimal.

In Question 1 you will want to mimic the calculation done in the proof of Bruss's odds algorithm. You cannot solve this by simply applying the algorithm directly.

There is another interesting way to motivate the optimality equationin the average cost case. This can be made rigorous and helps us understand the relative value function ϕ(x).

Let F(x) be the infinite-horizon value function when the discount factor is β. Then we know this satisfies the optimality equation
F(x)=minu{c(x,u)+βE[F(x1)x0=x,u0=u]} .
Pick some state, say state 0. By subtracting βF(0) from both sides of the above, we obtain
F(x)F(0)+(1β)F(0)=minu{c(x,u)+βE[F(x1)F(0)x0=x,u0=u]}
One can show that as  β1 we have have F(x)F(0)ϕ(x) and (1β)F(0)λ (the average-cost). Thus we obtain
ϕ(x)+λ=minu{c(x,u)+E[ϕ(x1)x0=x,u0=u]} 
and this is our average-cost optimality equation. If you would like to understand why (1β)F(0)λ  see this small note about the connection between the average-cost problem and the discounted-cost problem with β near 1.


It is also interesting to think about the following (which I mentioned briefly in lectures today). Suppose we have a deterministic stationary Markov policy, say π, with u=f(x). Suppose we have λ and φ such that
φ(x) + λ = c(x,f(x)) + Σy φ(y) P(y | x )     (7.6)
where P(y | x) is the transition matrix under π.
Suppose π induces an ergodic Markov chain (i.e. a Markov chain that is irreducible and positive recurrent) and this has invariant distribution μ. We know that
μy = Σx μx P( y | x)     (7.7)
Multiplying (7.6) through by μx and then summing on x, we get
Σx μx φ(x) + λ Σx μx
= Σx μc(x,f(x)) + Σy Σx μx φ(y) P( y | x)
which, using (7.7) gives
Σx μx φ(x) + λ Σx μx = Σx μc(x,f(x)) + Σy μy φ(y).
Then using Σx μ= 1, and cancelling the terms that are equal on both sides, we have
λ = Σx μc(x,f(x))
and so we see that λ is the average-cost of policy π.