Processing math: 100%

Tuesday, February 2, 2016

Burglar problem

I think it will be helpful if I say something about Examples Sheet 1, #8.

Since this is a case of positive programming and F(x)=x satisfies the optimality equation in the region xx=(1q)/qλ, it  follows from Theorem 4.2 that it is optimal for the burglar to retire if he has accumulated loot of at least x.

The difficulty in this problem lies in the question of whether or not  an optimal policy exists when x<x. We cannot use Theorem 5.2 to  claim that an optimal policy exists because we are not in a D or N programming case.

Let x be his accumulated loot so far. We know finite horizon problems are easier to analyse, so we might look at a problem in which the burglar may burgle at most s further days, with optimality equation
Fs(x)=max[x,(1q)0Fs1(x+y)λeλydy],with F0(x)=x. For this problem a OSLA rule is optimal.  It is easily proved by induction that it is optimal to follow policy π, defined as "stop iff xx=(1q)/qλ".

One way we could proceed would be by actually calculating F(π,x) for x<x and showing it satisfies the DP equation. This is possible but difficult.

An alternative is to show that F(x)F(π,x).  To do that, let's consider a problem with artificially greater rewards. Suppose that x0=x<x, and on the (s+1)th day of burglary the burglar receives (if he is not caught) his usual exponentially distributed increment, plus an additional increment of x. Having received all of this, he will certainly want to stop since he will now have more loot than x. But in this problem, with greater rewards, the maximal reward the burglar can obtain is certainly bounded above by
Fs(x)+(1q)s(x+(s+1)1λ+x),which is the maximum amount he could obtain by stopping after no more than s days, plus the expected reward he will obtain if he stops after the (s+1)th day. The second term accounts for the possibility he is caught before s+1 (through the multiplying factor of (1q)s), but ignores the fact that he cannot actually get both this and the reward from stopping somewhere amongst the first s days.  The important thing is that the second term tends to 0 as s. This implies that the maximal amount he could get in the original problem, F(x), satisfies F(x)limsFs(x)=limsFs(π,x)=F(π,x).  So in fact π is an optimal policy.