Optimization and Control: February 2013

Thursday, February 28, 2013

Sustainable fishing

At the end of Lecture 13 I introduced the Hamilton-Jacobi-Bellman equation, which is a continuous time version of the optimality equation. In Section 13.3 (which I did not discuss in lectures) there is a presentation of the HJB equation for the continuous-time version of our well-studied LQ regulation problem. This is essentially Examples Sheet 3, Question 3. So you can review that section when you do the examples sheet. It is straightforward.

I have just now added to the notes a Section 13.4. This section is not examinable, but I thought you might find it interesting. The HJB equation is used in this example to deduce how fisherman might either fish to extinction, or not, a potentially sustainable fish population. The population follows dynamics of $\dot x=a(x)-u$, when $x>0$, where $u$ is the rate at which fish are extracted. Fishermen are trying to maximize
$$\int_0^\infty u(t) e^{-\alpha t}dt.$$ Whether the fish population is completely wiped out, or sustained, depends on whether the discounting rate $\alpha$ is greater or less than the rate at which the fish population can grow when it is small, i.e. $a'(0)$. If $\alpha< a'(0)$ then the population will converge to a sustainable positive level $\bar x$ and at which the optimal fishing rate is $\bar u$ and $\dot x=a(\bar x)-\bar u=0$.

We are not going to spend much time thinking about how to solve HJB equations directly, because the theory of Pontryagin's Maximum Principle that we will meet in Lecture 14 is more powerful. However, Questions 4 and 5 are about find a solution to the HJB equation, and you begin these questions by writing down the infinitesimal form of the optimality equation.

Lecture 13

The name "Kalman filter" refers to the estimation equation (13.1) and takes its name from Rudolf Kalman (1930 –), who developed it in the years 1958-64. He also coined the terms controllable and observable, and gave the criteria that we have seen in previous lectures. The fact that a system is controllable iff the matrix $[B\ AB\ \cdots\ A^{n-1}B]$ is of full rank is sometimes called Kalman's criteria. In the IEEE biography of Kalman it is stated

The Kalman filter, and its later extensions to nonlinear problems, represents perhaps the most widely applied by-product of modern control theory. It has been used in space vehicle navigation and control (e.g. the Apollo vehicle), radar tracking algorithms for ABM applications, process control, and socioeconomic systems.

The theory in this lecture is admittedly quite tricky - partly because the notation. As a test of memory, can you say what roles in the theory are taken by each of these?

$x_t$, $u_t$, $A$, $B$, $\epsilon_t$, $y_t$, $C$, $\eta_t$, $\hat x_t$, $\Delta_t$, $\xi_t$, $\zeta_t$, $R$, $S$, $Q$, $K_t$, $\Pi_t$, $N$, $L$, $M$, $H_t$, $V_t$.

You will understand the ideas better once you have worked through the details of a scalar example (in which $n=m=p=1$). You do this in Example Sheet 3 Question 2. I also think that Question 1 is helpful in gaining an appreciation of the duality between control and estimation problems.

You will not be asked to reproduce the statement or proofs of Theorem 13.1 or 13.2 in examinations. You should simply know that $\hat{x}_t$ is computed from $\hat{x}_{t-1}$ and $y_t$ in the linear manner specified by (13.1), and that the covariance matrix $V_t$ satisfies a Riccati equation. You are not expected to memorize Riccati equations.

Notice that the Riccati equation for $V_t$, i.e. $V_t = g\, V_{t-1}$ runs in the opposite time direction to the one we had for $\Pi_t$ in lecture 10, where $\Pi_{t-1} = f\, \Pi_t$. We are given $V_0$ and $\Pi_h$.

Coda on the puzzle

Here is an amusing sporting fact that resonates with our puzzle in which a decision that is the same under conditions of A true and A false, may not be the right decision when the truth status of A is unknown.

John Howard (London School of Economics) tells me that in cricket it can happen that a batsman attempts to drive the ball but appears to miss (or he may have snicked it). The ball then hits his pad, bounces off, and is caught by a slip fielder. There is an appeal for lbw (leg before wicket). The umpire would have given him out "lbw" if he was sure he had not hit the ball. He would be out "caught" if he had hit the ball. But a batsman cannot be given out unless a definite reason is given, and if the umpire is not sure which of these two it was, then batsman is not out.

I am told that a similar incident happened in the recently concluded India-Australia Test, where the bowler appealed for lbw or caught, the umpire upheld the appeal for lbw, but the batsman was upset thinking he was given caught out as he had not edged the ball.

You can read more about this in the article, Ricky Ponting and the Judges, by Ian Rumfitt, who writes:

"In the first innings of the final Ashes Test of 2009, Ricky Ponting faced a ball which was deflected, off something, into the wicket-keeper’s hands. The English XI appealed, and in the agonizingly long time that it took Asad Rauf to decide, Jonathan Agnew (commentating on Test Match Special) reasoned as follows: ‘Either the ball hit Ponting’s bat or it hit his pads. If it hit his bat, he is out caught behind. If it hit his pads, he is out lbw. So, either way, he is out’. Rauf, however, appeared to disagree with Agnew’s reasoning, and Ponting stayed at the wicket."

Tuesday, February 26, 2013

Proof by Genie

Sometimes arguments like that which appear in the puzzle can be useful in proofs. In their paper, An optimal strategy for a conflict resolution problem, (Venkat Anantharam and Pravin Varayia, Systems and Control Letters, 1986, pp. 329 -332), the authors consider a problem in which each of $n$ people has a number in $[0,1]$. The aim is to find out who has the greatest number, using the minimum expected number of yes/no questions, which are to be asked in sequence to everyone simultaneously.

Questions can be of the form "Whose number in subset $A$ of $[0,1]$?", for any subset $A$. They prove that it is optimal to restrict questions to the simplest form of "Whose number is bigger than $x$?" To prove this they use an argument like this:

We postulate the following genie:

The genie tells us which of the sets $A$ and $A^c$ ...

I omit the detail (see their paper if you like). I focus here on the idea that the genie is providing extra information that can make the optimal policy easier to figure out. The authors continue:

By postulating a genie we mean that we permit ourselves to use different strategies on events on which the genie gives us different answers. Clearly we can do no better without the genie than we can with it.

They then show that given the information provided by the genie, the optimal policy is one that could have been implemented without having that information, and which takes a special form. So they can conclude the optimal policy (without the genie) must also take this special form (i.e. that it suffices to ask questions of the simple form "Whose number is bigger than $x$?").

We see, however, that one has to be careful with this type of proof. In the puzzle we are drawn into thinking about a genie who tells us whether A is true or false. At issue is the fact that the policy which is optimal, given what the genie tells us about A, may have identical $u_0$, whatever the genie says, but this is only because we can act differently for $u_1, u_2, \dotsc$.

Answer to the puzzle

There have been 10 votes for "yes" and 5 votes for "no". The intuitive answer to the puzzle is yes ($u_0=1$ is optimal). That might make you suspicious that I am going to argue for the non-intuitive answer, no ($u_0=1$ is not necessarily optimal). I wonder if you will like my answer, or think I am cheating?

I said nothing about $u_0$ being the only decision. I make no apology, because this lecture course has been all about taking a sequence of decisions (or controls) $u_0,u_1,u_2,\dotsc$ over time. That possibility should have been in your mind, and I was giving you a hint by denoting the decision as $u_0$ (the first of a sequence of decisions).

Suppose we are faced with a stopping problem in which $u_0=1$ means "continue" and $u_0=0$ means "stop". We should consider the possibility that by knowing that A is true, or by knowing A is false, it is optimal to continue. But if we do not know whether A is true or false then it is optimal to stop.

It is not completely obvious that this can happen. So let's see how to construct an example in which it does. Consider a version of the secretary problem in which we will see 3 candidates. They are "best", "middle" and 'worst" (B,M,W). Of course we can only compare them with one another, so if they are presented in the order M,B,W, we would see this as $(x_1,x_2,x_3)=(1, 1, 0)$, where $x_i=1$ if the $i$th candidate is the best so far. Suppose that the candidates will be presented in one of the of the following four orders, with consequent sequence of 0s and 1s, with the probabilities given:

B,M,W (1,0,0) with probability 0.2
M,B,W (1,1,0) 0.3
B,W,M (1,0,0) 0.2
W,M,B (1,1,1) 0.3

It is a bit artificial that all 6 possible orders are not equally likely, but for the purpose of this discussion we are not trying to think about a really practical problem.

We wish to maximize the probability of stopping on the last 1.

Now suppose the unknown "A" is whether "the worst candidate will be seen last".

If "A" is true it will be 1 or 2, but if A is false it will be 3 or 4.

You can now verify that given A is true it is optimal not to stop on the first 1 ($u_0=1=$ "continue"). The same decision is optimal if A is false. In both cases we win with probability 0.6.

However, if we do not know whether A is true or false, then if is optimal to stop on the first 1 $(u_0=0$, and win with probability 0.4). If we do not stop we will with probability 0.6 reach 1, 1 and from there can win with probability of only 0.5. So our win probability is only 0.3.

Do you think this example is a cheat (in that it does not really match up with the way I posed the original puzzle), or have you learned something interesting?

I was thinking about this recently in connection with a similar question. I wonder if any of you can invent a neater example to illustrate this point, or one with a better "back-story".

The question I have been thinking about concerns an application of Bruss's odds theorem in circumstances that we do not know the odds. Recall that if $X_1,\dotsc,X_n$ are independent, with $X_i=1$ or $0$ with probabilities $p_i$ and $q_i=1-p_i$, then we maximize the probability of stopping on the last 1 if we stop at the first 1 that we find amongst $X_s,X_{s+1},\dotsc,X_n$, where $s$ is the greatest integer $i$ such that

$\frac{p_i}{q_i}+\cdots+\frac{p_n}{q_n}\geq 1$.

Now suppose we do not know the actual values of $p_1,\dotsc,p_n$. But we have seen $X_{i-1}=1$ and somehow know that $p_i/q_i+\cdots+p_n/q_n=1$. Can we say that it is optimal to stop? Yes. Since sum-of-odds $=1$ is the borderline case in the odds theorem is it also optimal not to stop? Yes.

But suppose we know only that $E[p_i/q_i+\cdots+p_n/q_n] > 1$. Now this is insufficient information on which to know whether or not it is optimal to stop.

Lecture 12

I wish only to say that controllability and observabliity stand in a dual relationship to one another. This is clear in the necessary and sufficient conditions: that $[B\ AB\ \cdots A^{n-1}B]$ must be rank $n$ (for controllability), and

$\begin{bmatrix}C\\\ CA\\ \vdots\\ CA^{n-1}\end{bmatrix}$ must be of rank $n$ (for observability).

This duality is also nicely exhibited in Example Sheet 3 Question 1. You should now be able to do this question.

Notice that a system can be stabilizable without being controllable. E.g. the scalar system with $x_{t+1}=(1/2)x_t$is trivially stabilizable, but it is not controllable.

Monday, February 25, 2013

Who solved the Secretary Problem?

I thought you might enjoy this.

In his paper (Who Solved the Secretary Problem? Statist. Sci., Volume 4, Number 3 (1989), 282-289) Thomas Ferguson tells the following funny story.

"When the celebrated German astronomer, Johannes Kepler (1571-1630), lost his first wife to cholera in 1611, he set about finding a new wife using the same methodical thoroughness and careful consideration of the data that he used in finding the orbit of Mars to be an ellipse. His first, not altogether happy, marriage had been arranged for him, and this time he was determined to make his own decision. In a long letter to a Baron Strahlendorf on October 23, 1613, written after he had made his selection, he describes in great detail the problems he faced and the reasons behind each of the decisions he made. He arranged to interview and to choose from among no fewer than eleven candidates for his hand. The process consumed much of his attention and energy for nearly 2 years, what with the investigations into the virtues and drawbacks of each candidate, her dowry, negotiations with her parents, natural hesitations, the advice of friends, etc. The book of Arthur Koestler (1960) contains an entertaining and insightful exposition of the process. The book of Carola Baumgardt (1951) contains much supplementary information.

Suffice it to say that of the eleven candidates interviewed, Kepler eventually decided on the fifth. It may be noted that when $n = 11$, the function $\phi_n(r)$ of (2.1) takes on its maximum value when $r = 5$. Perhaps, if Kepler had been aware of the theory of the secretary problem, he could have saved himself a lot of time and trouble." Fortunately,

"His new wife, whose education, as he says in his letter, must take the place of a dowry, bore him seven children, ran his household efficiently, and seems to have provided the necessary tranquil homelife for his erratic genius to flourish."

Ferguson has many other interesting things in his paper. It created some controversy and amusement when published because his answer to the question "Who Solved the Secretary Problem?" is "Nobody"! He then goes on to "solve" it for the first time! (at least a particular version he likes). He also says:

"As historians, we should take as the secretary problem, the problem as it first appeared in print, in Martin Gardner's February 1960 column in Scientific American, where it was called the game of googol and described as follows.
Ask someone to take as many slips of paper as he pleases, and on each slip write a different positive number. The numbers may range from small fractions of 1 to a number the size of a googol (1 followed by a hundred 0's) or even larger. These slips are turned face down and shuffled over the top of a table. One at a time you turn the slips face up. The aim is to stop turning when you come to the number that you guess to be the largest of the series. You cannot go back and pick up a previously turned slip. If you turn over all slips, then of course you must pick the last one turned.
The astute reader may notice that this is not the simple form of the secretary problem described in Section 2. The actual values of the numbers are revealed to the decision maker in violation of condition 4. Also, there is this "someone" who chooses the numbers, presumably to make your problem of selecting the largest as difficult as possible. The game of googol is really a two-person game."

Friday, February 22, 2013

A Puzzle

Here is a simple puzzle to develop your understanding and intuition.

You are faced with a problem in which you need to make a decision. For example, you might be choosing $u_0$. You can choose $u_0=0$ or $u_0=1$. One of these is optimal.

There is something "A" which you do not know. It could be either true or false.

If A is true then $u_0=1$ is definitely optimal (i.e. better than $u_0=0$).
If A is false then $u_0=1$ is also definitely optimal.

Can you therefore conclude that $u_0=1$ is optimal?

Thus

Thursday, February 21, 2013

Lecture 11

The idea of controllability is straightforward and we can understand it using ideas of linear algebra.
Consider again the broom balancing problem. It is not possible to stabilize two brooms at equilibrium simultaneously if they are the same lengths, but it is possible if the lengths are different. This is because of the forms:>
$A=\begin{pmatrix}0 & 1 & 0 & 0\\ \alpha & 0 & 0& 0\\ 0 & 0 & 0& 1\\ 0 & 0& \beta & 0\end{pmatrix}\quad B=\begin{pmatrix}0 \\-\alpha\\0\\-\beta\end{pmatrix}\quad M_4=\begin{pmatrix}0 & -\alpha & 0 & -\alpha^2\\ -\alpha & 0 & -\alpha^2& 0\\ 0 & -\beta & 0& -\beta^2\\ -\beta & 0& -\beta^2 & 0\end{pmatrix}$

where $\alpha = g/L_1$, $\beta=g/L_2$.

So $M_4$ is of rank 4 iff a and b are different. However, this assumes that as you move your hand you must keep it at constant height from the ground. What do you think might happen if you can move your hand up and down also?

In 1996, paper 2, question 10 there was this cute tripos question:

A fantasy kingdom has populations of x vampires and y humans. At the start of each of the four yearly seasons the king takes a census and then intervenes to admit or expel some humans. Suppose that the population dynamics of the kingdom are governed by the plant equations:
$x_{t+1} = x_t + (y_t – Y)$,
$y_{t+1} = y_t – (x_t – X) + u_t$,
where $x_t$ and $y_t$ are integers representing the respective populations of vampires and humans in the $t$-th season, $t=0,1,2,3,\dotsc$; $X$ and $Y$ represent equilibrium values in the population without regal intervention; ...
[ some more description]
Can the king regain equilibrium by only expelling humans?
Suppose, the census is taken during the day, so the king can only count humans, and thus $u_t$ is allowed only to depend on $y_t$. Show that ...

The last part of the question is about observability (see Lecture 12). The full question is in this file of past questions

Tuesday, February 19, 2013

Lecture 10

The linear-quadratic regulator (LGR) that we met today is one part of the solution to the linear-quadratic-Gaussian control problem (LQG). This problem is the subject of Lectures 10–13. The LQG problem is perhaps the most important problem in control theory. It has a full solution: given in terms of the Kalman Filter (a linear-quadratic estimator) and the linear-quadratic regulator.

Following today's lecture you can do Questions 6–10 on Example Sheet 2. Here is an important hint. Do not try to solve these problems by plugging in values to the general Riccati equation solution in equations. It is always better to figure out the solution from scratch, by the method of backwards induction. Conjecture for yourself that the optimal value function, $F(x,t)$, is of a form that is quadratic in the state variable, plus some $\gamma_t$ (if it is a problem with white noise). Then find the explicit form of the recurrences, working backwards inductively from a terminal time $h$, Notice how this problem-solving strategy works in Question 9, even when the noise is rather different.

In general, solutions to a Riccati equation can be computerd numerically, but not algebraically. However, one can find an full solution in the special case that $x_t$ and $u_t$ are both one-dimensional (Question 6). There is also a fully-solvable special case in which $x_t$ is two-dimensional and $u_t$ is one-dimensional (Question 10). A useful trick in some problems is to observe that $\Pi_t^{-1}$ satisfies a recurrence relation (Questions 8 and 10).

I mentioned in the lecture that I would not expect you to memorize the Riccati equation. The important thing is to remember enough of what it is for, how it looks, and how it is derived, so that you could reconstruct its derivation in the context of a particular problem, as you do in Question 6.

Thursday, February 14, 2013

Lecture 9

Since the lecture today I have made some small changes to the proof of Theorem 9.1, as I realized I could give it an even cleaner exposition. (It has been some years since I have thought hard about this.)

I have a personal fondness for some things I talked about today. Theorem 9.1 (and extensions of it to other distributions beyond exponential) was one of the things that I proved in my Ph.D. dissertation in 1980. The Lady's Nylon Stocking Problem was a fairly famous unsolved problem at that time. It gave me something to talk about with non-mathematically inclined colleagues when I first became a research fellow at Queens College. The paper is Scheduling jobs with stochastic processing requirements on parallel machines to minimize makespan or flow time. J. Appl. Prob. 19, 167-182, 1982. Looking at it again today, I think that I could write it better.

The method of uniformization that I described in this lecture seems to have been introduced by Jensen (1953), but it was first used extensively in solving queueing problems in the late 1970s (by Grassman (1977) and Keilson (1979)). But it was not well-known. I recall discovering it for myself. Nowadays it is standard. Here is a quote from a recent 2011 paper by Rhonda Righter, Stochastic sequential assignment problem with arrivals, Probability in the Engineering and Informational Sciences, 25, 2011, 477–485:

Proof. We consider the equivalent discrete-time problem by uniformizing with uniformization rate $\lambda + \gamma + \alpha = 1$, and we use value iteration (i.e., induction on a finite time horizon, $n$).

The Wikipedia article has more about it and summarises the method in these words:

The method involves the constructions of an analogous discrete time Markov chain, where transitions occur according to an exponential distribution with the same parameter in every state.

I have used this idea many dozens of times in my own research. It is usually the best first step to make when tackling a continuous-time Markov decision problem.

Wednesday, February 13, 2013

ABCs of the Bomber Problem

I briefly spoke about the infamous Bomber Problem. If you would like to read more, see ABCs of the Bomber Problem and its Relatives, Weber, 2011. There is also a talk about this here. This problem has been open for over 35 years, and has been said by researchers who know it to be highly additive. It is very infuriating that for a problem with such a very simple dynamic programming equation an "obvious" fact cannot be proved or disproved.

$F(n,t)$ is the probability that a bomber plane, which is potentially attacked by enemy fighters $t$ times, can successfully defend itself against all such attacks, when starting with a supply of $n$ air-to-air missiles. The probability of surviving an attack by using $k$ missiles is $(1-\theta^k)$, where $0<\theta<1$.

$F(n,t) = q F(n,t-1) + p \max_{k\in\{1,\ldots,n\}}(1-\theta^k)F(n-k,t-1)$

$F(n,0)=1$

Conjecture B is that the maximizing $k$ is nondecreasing in $n$.

Many problems very similar to this are easy to solve.

Klinger and Brown's 1968 paper is here.

Tuesday, February 12, 2013

Lecture 8

You will find that the policy improvement algorithm is useful in answering Questions 3 and 4 on Examples Sheet 2. In Question 3, you will find $\lambda$ and $\phi$ for a certain policy ("On seeing a filling station, stop and fill the tank"), and then look for a condition that the policy improvement algorithm will not improve it (or, equivalently, that this $\lambda$ and $\phi$ satisfy the optimality equation.

In Question 4 you will use policy improvement idea in the context of a discounted-cost problem. You find $F(x)$ for a simple policy (in which the engineer allocates his time randomly), and then improve it by a step of the policy improvement algorithm. This leads to a policy in which the engineer puts all his maintenance effort into the machine with greatest value of $c_i(x_i+q_i)/(1-\beta b_i)$. This policy is better, but may not yet be optimal.

In Question 1 you will want to mimic the calculation done in the proof of Bruss's odds algorithm. You cannot solve this by simply applying the algorithm directly.

There is another interesting way to motivate the optimality equationin the average cost case. This can be made rigorous and helps us understand the relative value function $\phi(x)$.

Let $F(x)$ be the infinite-horizon value function when the discount factor is $\beta$. Then we know this satisfies the optimality equation

$F(x) = \min_u \{ c(x,u) +\beta E[ F(x_1) \mid x_0=x,u_0=u ] \}$ .

Pick some state, say state 0. By subtracting $\beta F(0)$ from both sides of the above, we obtain

$F(x) – F(0) + (1–\beta)F(0) = \min_u \{ c(x,u) + β E[ F(x_1) – F(0) \mid x_0=x,u_0=u ] \}$

One can show that as $\beta\to 1$ we have have $F(x) – F(0) \to \phi(x)$ and $(1–\beta)F(0) \to\lambda$ (the average-cost). Thus we obtain

$\phi(x) + \lambda = \min_u \{ c(x,u) + E[ \phi(x_1) \mid x_0=x,u_0=u ] \}$

and this is our average-cost optimality equation. If you would like to understand why $(1–\beta)F(0) \to\lambda$ see this small note about the connection between the average-cost problem and the discounted-cost problem with $\beta$ near 1.

It is also interesting to think about the following (which I mentioned briefly in lectures today). Suppose we have a deterministic stationary Markov policy, say π, with u=f(x). Suppose we have λ and φ such that

φ(x) + λ = c(x,f(x)) + Σ_y φ(y) P(y | x ) (7.6)

where P(y | x) is the transition matrix under π.
Suppose π induces an ergodic Markov chain (i.e. a Markov chain that is irreducible and positive recurrent) and this has invariant distribution μ. We know that

μ_y = Σ_x μ_x P( y | x) (7.7)

Multiplying (7.6) through by μ_x and then summing on x, we get

Σ_x μ_x φ(x) + λ Σ_x μ_x
= Σ_x μ_xc(x,f(x)) + Σ_y Σ_x μ_x φ(y) P( y | x)

which, using (7.7) gives

Σ_x μ_x φ(x) + λ Σ_x μ_x = Σ_x μ_xc(x,f(x)) + Σ_y μ_y φ(y).

Then using Σ_x μ_x= 1, and cancelling the terms that are equal on both sides, we have

λ = Σ_x μ_xc(x,f(x))

and so we see that λ is the average-cost of policy π.

Thursday, February 7, 2013

Lecture 7

In today's lecture I used a portion of my slides from a talk on the Gittins Index. You may enjoy looking through the entire talk. It will look better if you download the .pdf file to your computer and read the presentation with a .pdf viewer such as acrobat, rather than try to read within your browser.

The proof of the Gittins index theorem is actually easy, but at the same time deep. It is non-examinable. I would expect you only to know what we mean by a SFABP, the statement of the Gittins index theorem, and how to calculate the indices in simple examples. However, I thought you would enjoy seeing this beautiful result and how it can be proved. Lectures 1-6 have covered everything you need to know in order to understand the Gittins index. Today's lecture has also been an opportunity to revise ideas that we already met in problems on job scheduling and pharmaceutical trials.

Weiztman's Pandora's boxes problem in 7.5 is cute and something I am talking about for the first time. I may have rushed over it a bit, so refer to the notes. I have there an example in which the prize in box $i$ is $0$ or $r_i$, with probabilities $1-p_i$ and $p_i$. Here's another example. Suppose the prize is uniformly distributed over $[0,r_i]$ and $r_i/2<c_i<r_i$. Then the Gittins index is in the undiscounted case is the solution to

$g_i= -c_i +\int_0^{r_i}\max(g_i,u)(1/r_i)du= -c_i +(g_i^2/r_i+r_i/2)$.

The idea here is that Pandora is indifferent between taking home $g_i$, or opening box $i$, at cost $c_i$, and then taking home the best of either $g_i$ or the prize she finds in the box.

This gives $g_i= r_i-\sqrt{2c_ir_i-r_i^2}$, with $0<g_i<r_i$.

The Gittins index theorem is one of the most beautiful results in the field of Markov decision processes. Its discovery and proof in 1974 is due to John Gittins. The proof I have given in today's lecture is very different to Gittins's original proof. It the simplest way to prove the theorem and was first presented in On the Gittins index for multiarmed bandits. Ann. Appl. Prob. 2, 1024-33, 1992.

Tuesday, February 5, 2013

Lecture 6

Today we looked at some stopping problems. They crop up all over the place. They are present in financial mathematics in the theory of option pricing (Black-Scholes). The holder of an American option is allowed to exercise the right to buy (or sell) the underlying asset at a predetermined price at any time before or at the expiry date. The valuation of American options is essentially an optimal stopping problem.

You can read more about Bruss's odds algorithm in his paper: "Sum the odds to one and stop" Annals of Probability, Vol. 28, 1384–1391, (2000). I showed you how to use this to address a Secretary Problem in which the candidates arrive in groups, of sizes $n_1,n_2,\dotsc,n_h$. I wonder what is the best order to interview the groups? Should we see large groups first, or small groups? I forgot to mention that the optimal success probability when using the odds algorithm is $q_{s^*}\cdots q_n(r_{s^*}+\cdots+r_n)$ and that this is always at least $1/e$, provided $r_1+\cdots+r_n\geq 1$.

Let me finish with some hints about questions on Examples Sheets 1 and 2. Question 6 on Sheet 1 can be solved by realizing that there are only two stationary Markov policies. You simple need to decide which is best. Question 7 is a simple stopping problem that can be solved using a OSLA rule. Question 8 is tricky. You should use Theorem 4.2.

Question 9 on Sheet 1 is important because it should convince you that it is possible to solve stochastic dynamic programming equations using linear programming (at least if both state space and action space are finite.) It there are $n$ states and $m$ actions (controls) available in each state then we will have an LP in $n$ variables, with $nm$ linear inequality constraints.

When on sheet 2 you do Question 2 you might think about using Bruss's Odds Algorithm. Although you cannot apply the algorithm directly, you should be able to solve this problem using the same scheme of proof as we used to prove the optimality of the odds algorithm in this lecture.