I think it will be helpful if I say something about Examples Sheet 1, #8.
Since this is a case of positive programming and F(x)=x satisfies the optimality equation in the region x≥x∗=(1−q)/qλ, it follows from Theorem 4.2 that it is optimal for the burglar to retire if he has accumulated loot of at least x∗.
The difficulty in this problem lies in the question of whether or not an optimal policy exists when x<x∗. We cannot use Theorem 5.2 to claim that an optimal policy exists because we are not in a D or N programming case.
Let x be his accumulated loot so far. We know finite horizon problems are easier to analyse, so we might look at a problem in which the burglar may burgle at most s further days, with optimality equation
Fs(x)=max[x,(1−q)∫∞0Fs−1(x+y)λe−λydy],with F0(x)=x. For this problem a OSLA rule is optimal. It is easily proved by induction that it is optimal to follow policy π∗, defined as "stop iff x≥x∗=(1−q)/qλ".
One way we could proceed would be by actually calculating F(π∗,x) for x<x∗ and showing it satisfies the DP equation. This is possible but difficult.
An alternative is to show that F(x)≤F(π∗,x). To do that, let's consider a problem with artificially greater rewards. Suppose that x0=x<x∗, and on the (s+1)th day of burglary the burglar receives (if he is not caught) his usual exponentially distributed increment, plus an additional increment of x∗. Having received all of this, he will certainly want to stop since he will now have more loot than x∗. But in this problem, with greater rewards, the maximal reward the burglar can obtain is certainly bounded above by
Fs(x)+(1−q)s(x+(s+1)1λ+x∗),which is the maximum amount he could obtain by stopping after no more than s days, plus the expected reward he will obtain if he stops after the (s+1)th day. The second term accounts for the possibility he is caught before s+1 (through the multiplying factor of (1−q)s), but ignores the fact that he cannot actually get both this and the reward from stopping somewhere amongst the first s days. The important thing is that the second term tends to 0 as s→∞. This implies that the maximal amount he could get in the original problem, F(x), satisfies F(x)≤lims→∞Fs(x)=lims→∞Fs(π∗,x)=F(π∗,x). So in fact π∗ is an optimal policy.
Since this is a case of positive programming and F(x)=x satisfies the optimality equation in the region x≥x∗=(1−q)/qλ, it follows from Theorem 4.2 that it is optimal for the burglar to retire if he has accumulated loot of at least x∗.
The difficulty in this problem lies in the question of whether or not an optimal policy exists when x<x∗. We cannot use Theorem 5.2 to claim that an optimal policy exists because we are not in a D or N programming case.
Let x be his accumulated loot so far. We know finite horizon problems are easier to analyse, so we might look at a problem in which the burglar may burgle at most s further days, with optimality equation
Fs(x)=max[x,(1−q)∫∞0Fs−1(x+y)λe−λydy],with F0(x)=x. For this problem a OSLA rule is optimal. It is easily proved by induction that it is optimal to follow policy π∗, defined as "stop iff x≥x∗=(1−q)/qλ".
One way we could proceed would be by actually calculating F(π∗,x) for x<x∗ and showing it satisfies the DP equation. This is possible but difficult.
An alternative is to show that F(x)≤F(π∗,x). To do that, let's consider a problem with artificially greater rewards. Suppose that x0=x<x∗, and on the (s+1)th day of burglary the burglar receives (if he is not caught) his usual exponentially distributed increment, plus an additional increment of x∗. Having received all of this, he will certainly want to stop since he will now have more loot than x∗. But in this problem, with greater rewards, the maximal reward the burglar can obtain is certainly bounded above by
Fs(x)+(1−q)s(x+(s+1)1λ+x∗),which is the maximum amount he could obtain by stopping after no more than s days, plus the expected reward he will obtain if he stops after the (s+1)th day. The second term accounts for the possibility he is caught before s+1 (through the multiplying factor of (1−q)s), but ignores the fact that he cannot actually get both this and the reward from stopping somewhere amongst the first s days. The important thing is that the second term tends to 0 as s→∞. This implies that the maximal amount he could get in the original problem, F(x), satisfies F(x)≤lims→∞Fs(x)=lims→∞Fs(π∗,x)=F(π∗,x). So in fact π∗ is an optimal policy.