Thursday, January 31, 2013

Lecture 5

I have decided to rewrite some parts of the course. The notes will from this point start to differ to those from last year. Please look for new notes appearing on the web site. Examples Sheets 1 and 2 are now as I intend, but I will be revising Sheet 3.

I started today by telling you about a famous problem concerned with searching for a randomly moving hidden object. This is an example of a partially observed Markov decision process. This problem has not yet been solved in full generality. However, the problem has been solved in continuous time. See R. R. Weber. Optimal search for a randomly moving object. J. Appl. Prob. 23:708-717, 1986. The reason the problem is easier in continuous time is that the state $x_1$ (the probability that the object is in location 1) changes continuously, rather than in jumps.

Let me emphasize that in Lecture 4, we used $F_s(x)$ to mean the value function of a MDP over $s$ steps, with $F_0(x)=0$. In the case of $c(x,u)\geq 0$, $F_s(x)$ is clearly nondecreasing in $s$. By contrast, in this lecture, $F_s(x)$ was used to mean the minimum cost in a stopping problem in which we must stop within $s$ steps, and $F_0(x)=k(x)$. $F_s(x)$ is now clearly nonincreasing in $s$. That is because as $s$ increases we are being given more flexibility as to when we must stop. In both cases $F_s(x)$ satisfies an optimality equation and $F_s(x)\to$ a limit, say $F_\infty(x)$, but from a different direction.

Having completed these lectures you can now attempt all of Examples Sheet 1. I will be making some changes to sheets 2 and 3.

Questions 7 and 8 use the idea of a one-step look-ahead (OSLA) rule that we have met in today's lecture. Question 8 is subtle, because although it is easy to guess the answer by thinking about a OSLA rule, how do you prove this answer is correct? Hint: use an appropriate theorem about positive programming. Be careful when writing down the dynamic programming equation for Question 8 that you put the expected value (or integral and density function for an exponential random variable) in the right place relative to where you place your max{ , }).

Questions 9-11 are deeper, but that is mostly as regards their lengthy descriptions. The solutions are not actually hard. In Q9, you'll want to use the idea of substituting an equation into itself. One of the reasons I include these questions is to show you what classifies as a research level problem in this field. I have recently written a paper about Question 11, ABCs of the bomber problem and its relatives Annals of Operations Research, 2011.

Tuesday, January 29, 2013

Lecture 4

One of the CATAM projects this year is Policy Improvement for a Markov Decision Process. The next few lectures (especially 7) are relevant to answering the theoretical parts of this project. Doing this project will help you better understand this course. So I recommend this CATAM project to you.

In Section 4.3 (optimal gambling) we saw that timid play is optimal in the gambling problem when the game is favorable to the gambler ($p \geq 0.5$).

Perhaps you remember the question on the IB Markov Chains examples sheet 1 that begins "A gambler has £2 and needs to increase it to £10 in a hurry. The gambler decides to use a bold strategy in which he stakes all his money if he has £5 or less, and otherwise stakes just enough to increase his capital, if he wins, to £10."

In the case $p \leq 0.5$ bold play maximizes the probability that the gambler reaches N pounds. However, bold play may not be uniquely optimal. An example that exhibits this non-uniqueness of optimal strategy is contained in question 13 of the IB Markov Chains example sheet 1. 

When it was first discovered that bold play is not uniquely optimal, it was contrary to the accepted wisdom at that time (1955). It impressed Leonard Savage when it was presented to him by a Ph.D. student named Lester Dubins They developed a collaboration culminating in the famous monograph How to Gamble if You Must (Inequalities for Stochastic Processes). See the fourth paragraph of this obituary of Lester Dubins.

If $p=0.5$ all strategies are optimal. How could we prove that? Easy. Simply show that given any policy pi the value function is $F(\pi,i)=i/N$ and that this satisfies the DP equation. Then apply Theorem 4.1. You can read more about these so-called red and black games at the Virtual Laboratories in Probability and Statistics.

In Section 4.5 (pharmaceutical trials) I introduced an important class of very practical problems. One obvious generalization is to a problem with $k$ drugs, each of which has a different unknown probability of success, and about which we will learn as we make trials of the drugs. This is called a multi-armed bandit problem. The names comes from thinking about a gaming machine (or fruit machine), having $k$ arms that you might pull, one at a time. The arms have different unknown probabilities of generating payouts. In today's lecture we considered a special case of the two-armed bandit problem in which one of the arms has a known success probability, p, and the second arm has an unknown success probability, theta. I will say more about these problems in Lecture 6. The table on page 16 was computed by value iteration. This table is from the book Multi-armed Bandit Allocation Indices (of which I am one of the authors). The amazon.co.uk page will let you look inside this book.

Thursday, January 24, 2013

Lecture 3

In this lecture we had Theorem 3.1. It had an easy but nontrivial proof. It is important because it states that $F(x)$ satisfies a DP equation (3.7). It holds under the assumption of D (discounted), N (negative) or P (positive) programming. I explained what these three assumptions mean.

I gave an example, $c(x,u) = x$, $a(x,u) = –x$, $\beta = 1$, for which $F(x)$ does not exist. Note that a problem with this data fails to satisfy D, N or P.

Notice that the asset selling example in Section 3.5 (selling a tulip bulb collection) is very much like the secretary problem in Section 2.3. The difference is that now  (i) we observe values (not just relative ranks), (ii) wish to maximize the expected value of the selected candidate (rather than probability of choosing the best), and (iii) the number of offers is infinite, but with a discount factor $\beta$.

The tulip bulb example illustrates that one way in which discounting by a factor beta can naturally arise is when a catastrophe can occur, with probability $1–\beta$, bringing the problem to an end.

Can you figure out the answer to the assess selling problem if past offers for the tulip bulbs remain open (so long as the market has not collapsed)? The state $x_t$ is now be the best of the first $t$ offers.

Monday, January 21, 2013

Lecture 2

The highlight of today's lecture was the Secretary Problem. This is the most famous problem in the field of optimal stopping. It is credited to Merrill M. Flood in 1949, who called it the fiancĂ©e problem. It gained wider publicity when it appeared in Martin Gardner's column of Scientific American in 1960. There is an interesting Wikipedia article about it. One of the interesting things said in this article is that in behavioural studies people tend to stop too soon (i.e. marry too soon, make a purchase too soon). See The devil is in the details: Incorrect intuitions in optimal search.

The optimal solution to the secretary problem can also be solved by applying Bruss's odds-algorithm, due to Thomas Bruss. There is a nice tripos question based on Bruss's odds algorithm in 2011, Paper 2, 29K. The solution to the secretary problem is contained in the last line of this question. You should be able to do it once we have completed Lecture 5.

at LeipzigRW with J. Michael Steele and Thomas Bruss at the 9th German Open Conference on Probability and Statistics, 2010, in Leipzig

Variations of the secretary problem can be created by changing the assumptions about whether or not, in addition to relative ranks, values $X_1,\dotsc, X_h$ can be observed, and whether the objective is to maximize the probability of choosing the best candidate, to maximize the expected rank, or to maximize the expected value of the candidate selected.

A variation that has never been completely solved is the so-called Robbin's Problem. In this problem we do observe values of candidates, say $X_1,\dotsc, X_h$, and these are assumed to be independent, identically distributed uniform$[0,1]$ random variables. The objective is to maximize the expected rank of the candidate that is selected (best = rank 1, second-best = rank 2, etc). It is known only that, as $h$ goes to infinity, the expected rank that can be achieved under an optimal policy lies between 1.908 and 2.329. This problem is much more difficult that the usual secretary problem because the decision as to whether or not to hire candidate t must depend upon all the values of $X_1,\dotsc, X_t$, not just upon how $X_t$ ranks amongst them.

Following this lecture you can do questions 1–4 and 10 on Example Sheet 1. Question 2 is quite like the secretary problem (and also has a surprising answer). The tricks that have been explained in today's lecture are useful in solving these questions (working in terms of time to go, backwards induction, that a bang-bang control arises when the objective in linear in $u_t$, looking at the cross-over between increasing and decreasing terms within a $\max\{ , \}$, as we did in the secretary problem with $\max\{t/h, F(t)\}$).

Thursday, January 17, 2013

Lecture 1

In the first lecture I setup definitions and notation for state, control, history, value function, etc, and develop dynamic programming equations for a very general case, a state-structured case, and the Markov decision process case. Please be patient with the notation. It is not as complex as it may first appear. Things like $a(x,u,t)$, $F(x_t,t)$, $u_t$, and $U_t$ will begin to seem like old friends once you have used them a few times. By the way, in case it was not obvious, $h$ is used for the terminal time because it the time horizon.

One person asked me about the terminology "plant equation" for $x_{t+1}=a(x_,u_t,t)$. This derives from the fact that early optimal control theory was developed with applications to industrial processes in mind, especially chemical plants. You could also just call it the dynamics equation.

From this first lecture you should be taking away the key idea of dynamic programming, and the fact that problems in stages (with separable cost and a finite time horizon) can often be solved by working backwards from the final stage. The minimum length path (stage coach) problem is trivial, but should have made these ideas very intuitive. As an example of a problem in which the optimal policy should be determined by working backwards from the end. You might like to read the Wikipedia entry for Richard Bellman, who is credited with the invention of dynamic programming.

I have just now added some hyperlinks to the recommended booklist. In particular, Demetri Bertsekas has a web page for his book and slides from his lectures. You might find these interesting to browse through at some later stage.

I mentioned that I had once appeared on ITV's Who Wants to be a Millionaire (October 2003). There is a nice Part II exam question on this, including the model solution. You might like to look at this now – simply to see the sort of mathematical problem that this course will enable you to solve. You can also view the overhead slides for a little presentation called A Mathematician Plays "Who Wants to Be a Millionaire?" which I once gave to a Royal Institution workshop for school children. You might like to see how I made best use of my "Ask the audience" lifeline by employing an idea from statistics. Basically, I asked members of the audience not to vote if they felt at all unsure of the right answer.

There is space below for you to write comments or ask questions about things that I said in this lecture. Don't be shy, if you are puzzled about something you are unlikely to be the only one, and we can all learn from some discussion. You can post with your name, or anonymously. If you find a mistake in the notes or have a suggestion for their improvement, please write to me at rrw1@cam.ac.uk

Thursday, January 10, 2013

Course starts January 17, 2013

This is the blog page for the course of 16 lectures starting 17 January 2013 (Tue/Thu @ 9am in CMS meeting room 3).

I will use this space to talk about some extra things. Sometimes upon leaving a lecture I think, "I wish I had said ...". This blog gives me a place to say it. Or I can use this space to talk about a question that a student has asked during a lecture. Or I might comment on an examples sheet question.

There is a space after each post for you to write comments or ask questions using the IntenseDebate commenting system. Please do not be shy to do so. Other students will probably benefit.