Tuesday, February 2, 2016

Burglar problem

I think it will be helpful if I say something about Examples Sheet 1, #8.

Since this is a case of positive programming and $F(x)=x$ satisfies the optimality equation in the region $x\geq x^*=(1-q)/q\lambda$, it  follows from Theorem 4.2 that it is optimal for the burglar to retire if he has accumulated loot of at least $x^*$.

The difficulty in this problem lies in the question of whether or not  an optimal policy exists when $x<x^*$. We cannot use Theorem 5.2 to  claim that an optimal policy exists because we are not in a D or N programming case.

Let $x$ be his accumulated loot so far. We know finite horizon problems are easier to analyse, so we might look at a problem in which the burglar may burgle at most $s$ further days, with optimality equation
\[
F_s(x) = \max\left[ x, (1-q)\int_0^\infty F_{s-1}(x+y) \lambda e^{-\lambda y}
  dy\right],
\]with $F_0(x)=x$. For this problem a OSLA rule is optimal.  It is easily proved by induction that it is optimal to follow policy $\pi^*$, defined as "stop iff $x\geq x^*=(1-q)/q\lambda$".

One way we could proceed would be by actually calculating $F(\pi^*,x)$ for $x<x^*$ and showing it satisfies the DP equation. This is possible but difficult.

An alternative is to show that $F(x)\leq F(\pi^*,x)$.  To do that, let's consider a problem with artificially greater rewards. Suppose that $x_0=x<x^*$, and on the $(s+1)$th day of burglary the burglar receives (if he is not caught) his usual exponentially distributed increment, plus an additional increment of $x^*$. Having received all of this, he will certainly want to stop since he will now have more loot than $x^*$. But in this problem, with greater rewards, the maximal reward the burglar can obtain is certainly bounded above by
\[
F_s(x)+(1-q)^s(x+(s+1)\tfrac{1}{\lambda} + x^*),
\]which is the maximum amount he could obtain by stopping after no more than $s$ days, plus the expected reward he will obtain if he stops after the $(s+1)$th day. The second term accounts for the possibility he is caught before $s+1$ (through the multiplying factor of $(1-q)^s$), but ignores the fact that he cannot actually get both this and the reward from stopping somewhere amongst the first $s$ days.  The important thing is that the second term tends to $0$ as $s\to\infty$. This implies that the maximal amount he could get in the original problem, $F(x)$, satisfies $F(x)\leq \lim_{s\to\infty}F_s(x)=\lim_{s\to\infty}F_s(\pi^*,x)=F(\pi^*,x)$.  So in fact $\pi^*$ is an optimal policy.