Tuesday, February 12, 2013

Lecture 8

You will find that the policy improvement algorithm is useful in answering Questions 3 and 4 on Examples Sheet 2. In Question 3, you will find $\lambda$ and $\phi$ for a certain policy ("On seeing a filling station, stop and fill the tank"), and then look for a condition that the policy improvement algorithm will not improve it (or, equivalently, that this $\lambda$ and $\phi$ satisfy the optimality equation.

In Question 4 you will use policy improvement idea in the context of a discounted-cost problem. You find $F(x)$ for a simple policy (in which the engineer allocates his time randomly), and then improve it by a step of the policy improvement algorithm. This leads to a policy in which the engineer puts all his maintenance effort into the machine with greatest value of $c_i(x_i+q_i)/(1-\beta b_i)$. This policy is better, but may not yet be optimal.

In Question 1 you will want to mimic the calculation done in the proof of Bruss's odds algorithm. You cannot solve this by simply applying the algorithm directly.

There is another interesting way to motivate the optimality equationin the average cost case. This can be made rigorous and helps us understand the relative value function $\phi(x)$.

Let $F(x)$ be the infinite-horizon value function when the discount factor is $\beta$. Then we know this satisfies the optimality equation
$F(x) = \min_u \{ c(x,u) +\beta E[ F(x_1) \mid x_0=x,u_0=u ] \}$ .
Pick some state, say state 0. By subtracting $\beta F(0)$ from both sides of the above, we obtain
$F(x) – F(0) + (1–\beta)F(0) = \min_u \{ c(x,u) + β E[ F(x_1) – F(0) \mid x_0=x,u_0=u ] \}$
One can show that as  $\beta\to 1$ we have have $F(x) – F(0) \to \phi(x)$ and $(1–\beta)F(0) \to\lambda$ (the average-cost). Thus we obtain
$\phi(x) + \lambda = \min_u \{ c(x,u) + E[ \phi(x_1) \mid x_0=x,u_0=u ] \}$ 
and this is our average-cost optimality equation. If you would like to understand why $(1–\beta)F(0) \to\lambda$  see this small note about the connection between the average-cost problem and the discounted-cost problem with $\beta$ near 1.


It is also interesting to think about the following (which I mentioned briefly in lectures today). Suppose we have a deterministic stationary Markov policy, say π, with u=f(x). Suppose we have λ and φ such that
φ(x) + λ = c(x,f(x)) + Σy φ(y) P(y | x )     (7.6)
where P(y | x) is the transition matrix under π.
Suppose π induces an ergodic Markov chain (i.e. a Markov chain that is irreducible and positive recurrent) and this has invariant distribution μ. We know that
μy = Σx μx P( y | x)     (7.7)
Multiplying (7.6) through by μx and then summing on x, we get
Σx μx φ(x) + λ Σx μx
= Σx μc(x,f(x)) + Σy Σx μx φ(y) P( y | x)
which, using (7.7) gives
Σx μx φ(x) + λ Σx μx = Σx μc(x,f(x)) + Σy μy φ(y).
Then using Σx μ= 1, and cancelling the terms that are equal on both sides, we have
λ = Σx μc(x,f(x))
and so we see that λ is the average-cost of policy π.