Tuesday, December 2, 2014

Lecture 16

Today's lecture was about solving some problems using Pontryagin's Maximum.

We touched only briefly on controlled diffusion processes. The continuous time model for noise is Brownian motion, $B(t)$. The effect is that the Hamilton-Jacobi-Bellman equation gains another term, in $F_{xx}$.

I mentioned today in connection with section 16.1 (insects as optimizers) that zoologists have observed that in real-world choices of $t_{\text{switch}}$ some insect populations seem to do a good job of predicting $T$ (the time of the serious frost that ends the lives of the workers.) This example is from Luenberger, Introduction to Dynamic Systems, 1979, page 403.

What makes for a beautiful problem in science?

In his paper with the title, What makes for a beautiful problem in Science? (Journal of Political Economy, Vol. 78, No. 6, 1970),  Paul Samuelson writes:

What is the shortest distance between two points? A straight line. Can you prove it?

What is the longest distance between two points? Surely, no answer is possible. But consider two points a mile apart on a horizontal line in the plane. Suppose that the admissible paths cannot have an absolute slope exceeding one million. Then it is not hard to envisage that the longest distance is not (sic) a bit over a million miles. But it takes the methods of Pontryagin and not those of the classical theory to handle such a problem in which the control variables are constrained within a closed set.

You might like to see if you can use Pontryagin's maximum principle to prove what Samuelson claims. You must start deciding what he means by an admissible path. I expect he is thinking about a curve between $(0,0)$ and $(1,0)$ that is a graph $y(x)$ determined by $y'(x)=u(x)$, $0\leq x \leq 1$. I have added (sic) (for sic erat scriptum) because I think the word "not" should be deleted. He  is trying to say that the maximum distance is a bit over a million miles, in fact $\sqrt{10^{12}+1}$. Samuelson goes on:

A scientific problem gains in aesthetic interest if you can explain it to your neighbor at a dinner party. 

That no map requires more than four colors is such a problem. Can the turnpikes of Professor Chakravarty pass this test? In a system in which everything produces everything, a critical set of proportions at which balanced growth is most rapid seems reasonable. So the general notion of the von Neumann turnpike is conveyed. But suppose we begin away from this highway, and also are to end up away from it. Still, if the journey is to be a sufficiently long one, as seen we shall do well to move in the beginning toward that fast highway; catch a fast ride along it, staying so long there as to leave behind in all directions any rival who sticks to any slower path; then, toward the end of our journey, we can swing off the turnpike to our final rendezvous.

We see this in today's lecture (the turnpike of economic growth, Section 15.6).

I cannot resist one more excerpt from Samuelson's paper, in which he muses on the way in which an optimal path may sometimes seem strange.

Begin with durable hammers and robots that can make new hammers or robots with the same symmetrical input requirements for the one as for the other. Each input is subject to familiar diminishing returns; but increasing both inputs proportionally leads to constant returns to scale. I begin with 10 robots and 5 hammers and must end up in minimum time with (at least) 1,000 robots and 5,000 hammers. How should I proceed? Surely, I begin by producing only hammers, since the redundant robots are subject to diminishing returns. When I reach 10 of each input, I shift gears to produce equal amounts of both, proceeding along what is known as a von Neumann-DOSSO turnpike. You might think I go along it until 1,000 of each are attained, thereafter producing only the 4,000 more hammers. And you might be right. But you might be wrong. Depending upon the exact technology, it might pay to produce more of both along the turnpike, ceasing to produce more robots only after they are somewhere between 1,000 and 5,000 in number. We end up with more than we needed of robots. Irrational, since we don't want them and they cost something? Not necessarily, since they may have been worthwhile for their help in producing the extra hammers we do want.

Thursday, November 27, 2014

Lecture 15

The maximum principle is due to Lev Pontryagin. He was blind and yet one of the greatest mathematicians of his generation.

A key thing to grasp is that the PMP provides necessary conditions. We use the fact that an adjoint trajectory $\lambda$ must exist to deduce properties of, or completely determine, the optimal control and optimally controlled trajectory. To my thinking, the PMP is notoriously badly explained in most books. I hope I have been able to make it seem more intuitive.  Lagrange multipliers give a helpful interpretation, as does differentiation of the infinitestimal version of the optimality equation. We may also compare the PMP to the Lagrangian necessity theorem, which says that if $\phi(b)=f(x^*(b))=\max\{f(x):g(x)=b,x\in X\}$ is concave in $b$, then there exists a Lagrange multiplier $\lambda$ such that $f(x)+\lambda^T(b-g(x))$ is maximized with respect to $x\in X$ by $x^*(b)$.

For a crash course in Lagrangian optimization (a reminder of IB) you might like to look at pages 2-3 of these notes by Mike Tehranchi.

The rocket car example is a celebrated problem that was first solved by D.W. Bushaw, Differential Equations with a  Discontinuous Forcing Term, PhD Thesis, Princeton, 1952.

In the obituary of Donald W. Bushaw (1926-2012) it is stated that "Don’s PhD thesis is recognized as the beginning of modern optimal control theory."

There is a nice interactive demo of the solution to the rocket car parking problem that you can try.

Some personal experience of the power of PMP came in solving the problem in the paper

R. R. Weber. Optimal search for a randomly moving object. J. Appl. Prob. 23:708-717, 1986.

Here I used PMP tp solve in continuous time the problem of searching for a moving object (Section 5.1). This is still an open problem in discrete time.

Wednesday, November 26, 2014

Give your feedback

This year's course finishes on Tuesday, December 2 2014. 

Your initial state of knowledge was $W_0$, and will have been brought to $W_{16}$. It is now your job to control the trajectory onwards. Information about certain unobservables will be revealed by variables $y_1,y_2,y_3$ in the Easter Term. These observations are subject to noise.

As you study and revise from the notes I would appreciate an email from you if you see any typos or grammatical mistakes, or find something that you recommend could be explained better. I try to tread a Goldilocks path in these notes of being neither too verbose nor too succinct. 

You can give the standard feedback on-line here. This is identical to the Faculty's paper form that I will handout on Thursday. It will be sent to my email anonymously. After reading the responses, I will forward them to the Faculty Office. The online form is easy to use and is particularly useful for anyone who is following the course from my notes and is not attending lectures.

Tuesday, November 25, 2014

Lecture 14

On the basis of today's lecture you should be able to do Examples Sheet 3, #3, 4, 5.

I started the lecture with examples of two methods applied to solution of the continuous-time LQR problem.

Method 1 is the solution of this problem using the Hamilton Jacobi Bellman equation.

Method 2 foreshadows the use of Pontryagin's Maximum Principle, which we shall discuss in lectures 15, 16. You will need this for #6, 7, 8, 9, which should all be possible after lecture 15.

Question #10 I will probably do in lecture 16.

In the fishing example we were essentially using Theorem 14.1 to show that an optimal policy is one for which,
$$
u=\begin{cases}
0\\
a(\bar x)\\
u_{\max}
\end{cases}\text{as }\ x\begin{array}{c}
<\\
=\\
>
\end{array}\bar x
$$
and that a necessary condition is that $a'(\bar x)=\alpha$. At the end of page 59 of the notes, $F(x)$ is computed at a function of $T$, the time at which $x$ first becomes equal to $\bar x$. A simple calculation, which I did not do in lecture, is that $F_{xx}<0$, so $F$ is concave. All this information confirms that $F$ satisfies the HJB equation (14.2).

Notice that Figure 2 on page 59 is showing $u_{\max}\geq a(x)$ for all $x$, so fishing at maximum rate does cause the population to decrease (unless initially $a(x)=u_{\max}$).

Monday, November 24, 2014

Kalman filter in economics

The Kalman filter was an important tool in space exploration, and is often mentioned in connection with the Apollo XI guidance system. Let me make some remarks about where Kalman filtering ideas are used, in areas adjacent to operations research, such as economics. 

I once asked Lord Eatwell, President of Queens', and a famous economist, "Is the Kalman filter much used in economics". He immediately replied, "Yes, all the time".

Eatwell is one of the compliers of the Palgrave Dictionary of EconomicsIt is a good place to go if ever you need to find a short article on any economics topic. I searched in Palgrave for the Kalman Filter, and read:

The Kalman filter

The Kalman filter deals with state-space representations where the transition and measurement equations are linear and where the shocks to the system are Gaussian. The procedure was developed by Kalman (1960) to transform (‘filter’) some original observables $y_t$ into Wold innovations at and estimates of the state $x_t$. With the innovations, we can build the likelihood function of the dynamic model. With the estimates of the states, we can forecast and smooth the stochastic process.

The use of unobserved components opens up a new range of possibilities for economic modelling. Furthermore, it provides insights and a unified approach to many other problems. The examples below give a flavour.

The local linear trend model generalizes (1) by the introduction of a stochastic slope, βt, which itself follows a random walk. Thus ...

Econometricians are often interested in building linear models in which some variables are explained by other variables (in some sort of regression model). 

As the values of variables become known over time one wants to update estimates of other variables. The machinery for doing this is provided by the Kalman filter. Notice that the Kalman filter does not have anything to do with the Q assumptions of our LQG model. It is only the Linear Gaussian parts that are relevant.

You might like to try a Google search for "Kalman Filter and finance". Many things turn up. For example, here is a talk, Kalman Filtering in Mathematical Finance

Thursday, November 20, 2014

Lecture 13

The name "Kalman filter" refers to the estimation equation (13.1) and takes its name from Rudolf Kalman (1930 –), who developed it in the years 1958-64. He also coined the terms controllable and observable, and gave the criteria that we have seen in previous lectures. The fact that a system is controllable iff the matrix $[B\ AB\ \cdots\ A^{n-1}B]$ is of full rank is sometimes called Kalman's criteria. In the IEEE biography of Kalman it is stated
The Kalman filter, and its later extensions to nonlinear problems, represents perhaps the most widely applied by-product of modern control theory. It has been used in space vehicle navigation and control (e.g. the Apollo vehicle), radar tracking algorithms for ABM applications, process control, and socioeconomic systems.
The theory in this lecture is admittedly quite tricky - partly because the notation. As a test of memory, can you say what roles in the theory are taken by each of these?

 $x_t$,  $u_t$, $A$, $B$, $\epsilon_t$, $y_t$, $C$, $\eta_t$, $\hat x_t$, $\Delta_t$, $\xi_t$, $\zeta_t$, $R$, $S$, $Q$, $K_t$, $\Pi_t$, $N$, $L$, $M$, $H_t$,  $V_t$. 

 You will understand the ideas better once you have worked through the details of a scalar example (in which $n=m=p=1$). You do this in Example Sheet 3 Question 2. When you do this question, start by supposing that $\hat x_t=\hat x_{t-1}+u_{t-1}+h_t(y_t-\hat x_{t-1})$, and then find the value of $h_t$ that minimizes the variance of $\hat x_t$. You can start by subtracting $x_t=x_{t-1}+u_{t-1}+3\epsilon_t$ and using $y_t=x_{t-1}+2\eta_t$. You get,

$\hat{x}_{t+1}-{x}_{t+1}=\Delta_{t+1}=\Delta_t+2\epsilon_t-h_t\Delta_t+h_t2\eta_t.$

Then square, take the expected value, and minimize with respect to $h_t$ to find a formula for $V_{t+1}$ in terms of $V_t$.

You will not be asked to reproduce the statement or proofs of Theorem 13.1 or 13.2 in examinations. You should simply know that $\hat{x}_t$ is computed from $\hat{x}_{t-1}$ and $y_t$ in the linear manner specified by (13.1), and that the covariance matrix $V_t$ satisfies a Riccati equation. You are not expected to memorize Riccati equations.

Notice that the Riccati equation for $V_t$, i.e. $V_t = g\, V_{t-1}$ runs in the opposite time direction to the one we had for $\Pi_t$ in lecture 10, where $\Pi_{t-1} = f\, \Pi_t$. We are given $V_0$ and $\Pi_h$.

Tuesday, November 18, 2014

Lecture 12

I wish only to say that controllability and observabliity stand in a dual relationship to one another. This is clear in the necessary and sufficient conditions: that $[B\ AB\ \cdots A^{n-1}B]$ must be rank $n$ (for controllability), and 

$\begin{bmatrix}C\\\ CA\\ \vdots\\ CA^{n-1}\end{bmatrix}$ must be of rank $n$  (for observability). 

This duality is also nicely exhibited in the starred question on Example Sheet 3, Question 11. You should now be able to do this question.

Notice that a system can be stabilizable without being controllable. E.g. the scalar system with $x_{t+1}=(1/2)x_t$is trivially stabilizable, but it is not controllable.

Thursday, November 13, 2014

Lecture 11

I showed you a video of what optimal control can do when it is coupled to today's fast microprocessors and feedback controllers. The video is here.

Quadrocopter pole acrobatics

The idea of controllability is straightforward and we can understand it using ideas of linear algebra.

Consider again the broom balancing problem. I mentioned in lectures that it is not possible to stabilize two brooms at equilibrium simultaneously if they are the same lengths, but it is possible if the lengths are different. This is because of the forms:

$A=\begin{pmatrix}0 & 1 & 0 & 0\\ \alpha & 0 & 0& 0\\ 0 & 0 & 0& 1\\ 0 & 0& \beta & 0\end{pmatrix}\quad B=\begin{pmatrix}0 \\-\alpha\\0\\-\beta\end{pmatrix}\quad M_4=\begin{pmatrix}0 & -\alpha & 0 & -\alpha^2\\ -\alpha & 0 & -\alpha^2& 0\\ 0 & -\beta & 0& -\beta^2\\ -\beta & 0& -\beta^2 & 0\end{pmatrix}$

where $\alpha = g/L_1$, $\beta=g/L_2$.

So $M_4$ is of rank 4 iff $\alpha$ and $\beta$ are different. However, this assumes that as you move your hand you must keep it at constant height from the ground. Things might be different if you can move your hand up and down as well.

In 1996, paper 2, question 10 there was this tripos question about controllability:
A fantasy kingdom has populations of x vampires and y humans. At the start of each of the four yearly seasons the king takes a census and then intervenes to admit or expel some humans. Suppose that the population dynamics of the kingdom are governed by the plant equations:
$x_{t+1} = x_t + (y_t – Y)$,
$y_{t+1} = y_t – (x_t – X) + u_t$,
where $x_t$ and $y_t$ are integers representing the respective populations of vampires and humans in the $t$-th season, $t=0,1,2,3,\dotsc$; $X$ and $Y$ represent equilibrium values in the population without regal intervention; ...
[ some more description]
Can the king regain equilibrium by only expelling humans?
Suppose, the census is taken during the day, so the king can only count humans, and thus $u_t$ is allowed only to depend on $y_t$. Show that ...

The last part of the question is about observability (see Lecture 12). The full question is in this file of past questions

Tuesday, November 11, 2014

Lecture 10

The linear-quadratic regulator (LGR) that we met today is one part of the solution to the linear-quadratic-Gaussian control problem (LQG). This problem is the subject of Lectures 10–13. The LQG problem is perhaps the most important problem in control theory. It has a full solution: given in terms of the Kalman Filter (a linear-quadratic estimator) and the linear-quadratic regulator.

Following today's lecture you can do Questions 6–10 on Example Sheet 2. I mostly did question 10 in today's lecture. Here is an important hint. Do not try to solve these problems by plugging in values to the general Riccati equation solution in equations. It is always better to figure out the solution from scratch, by the method of backwards induction. Conjecture for yourself that the optimal value function, $F(x,t)$, is of a form that is quadratic in the state variable, plus some $\gamma_t$ (if it is a problem with white noise). Then find the explicit form of the recurrences, working backwards inductively from a terminal time $h$, 

In general, solutions to a Riccati equation can be computed numerically, but not algebraically. However, one can find an full solution in the special case that $x_t$ and $u_t$ are both one-dimensional (Question 6). There is also a fully-solvable special case in which $x_t$ is two-dimensional and $u_t$ is one-dimensional (Question 10). A useful trick in some problems is to observe that $\Pi_t^{-1}$ satisfies a recurrence relation (Questions 8 and 10). 

I do not expect you to memorize the general form of the Riccati equation for the exams. The important thing is to remember enough of what it is for, how it looks, and how it is derived, so that you could reconstruct its derivation in the context of a particular problem, as you do in Question 6.

Monday, November 10, 2014

Examples sheet 2

Having completed a first supervision on this examples sheet, I have now made some corrections to the questions, and made some changes of wording that I think will help.

I have also taken a hard question from sheet 3 and moved it to the end, as question #11. It is certainly "starred", but might be interesting to discuss, as it points up the duality of the questions of controllability and observability.

Thursday, November 6, 2014

Lecture 9

The slides for the short talk on stochastic scheduling today are here

The method of uniformization that I described in this lecture seems to have been introduced by Jensen (1953), but it was first used extensively in solving queueing problems in the late 1970s (by Grassman (1977) and Keilson (1979)). The Wikipedia article has more about it and summarises the method in these words:
The method involves the constructions of an analogous discrete time Markov chain, where transitions occur according to an exponential distribution with the same parameter in every state.
I have used this idea many dozens of times in my own research. It is usually the best first step to make when  tackling a continuous-time Markov decision problem.

Tuesday, November 4, 2014

Lecture 8

You will find that the policy improvement algorithm is useful in answering Questions 3 and 4 on Examples Sheet 2. In Question 3, you will find $\lambda$ and $\phi$ for a certain policy ("On seeing a filling station, stop and fill the tank"), and then look for a condition that the policy improvement algorithm will not improve it (or, equivalently, that this $\lambda$ and $\phi$ satisfy the optimality equation.

In Question 4 you will use policy improvement idea in the context of a discounted-cost problem. You find $F(x)$ for a simple policy (in which the engineer allocates his time randomly), and then improve it by a step of the policy improvement algorithm. This leads to a policy in which the engineer puts all his maintenance effort into the machine with greatest value of $c_i(x_i+q_i)/(1-\beta b_i)$. This policy is better, but may not yet be optimal.

In Question 1 you will want to mimic the calculation done in the proof of Bruss's odds algorithm. You cannot solve this by simply applying the algorithm directly.


There is another interesting way to motivate the optimality equationin the average cost case. This can be made rigorous and helps us understand the relative value function $\phi(x)$.



Let $F(x)$ be the infinite-horizon value function when the discount factor is $\beta$. Then we know this satisfies the optimality equation
$F(x) = \min_u \{ c(x,u) +\beta E[ F(x_1) \mid x_0=x,u_0=u ] \}$ .
Pick some state, say state 0. By subtracting $\beta F(0)$ from both sides of the above, we obtain
$F(x) – F(0) + (1–\beta)F(0) = \min_u \{ c(x,u) + β E[ F(x_1) – F(0) \mid x_0=x,u_0=u ] \}$
One can show that as  $\beta\to 1$ we have have $F(x) – F(0) \to \phi(x)$ and $(1–\beta)F(0) \to\lambda$ (the average-cost). Thus we obtain
$\phi(x) + \lambda = \min_u \{ c(x,u) + E[ \phi(x_1) \mid x_0=x,u_0=u ] \}$ 
and this is our average-cost optimality equation. If you would like to understand why $(1–\beta)F(0) \to\lambda$  see this small note about the connection between the average-cost problem and the discounted-cost problem with $\beta$ near 1.

It is also interesting to think about the following (which I mentioned briefly in lectures today). Suppose we have a deterministic stationary Markov policy, say π, with u=f(x). Suppose we have λ and φ such that
φ(x) + λ = c(x,f(x)) + Σy φ(y) P(y | x )     (1)
where P(y | x) is the transition matrix under π.
Suppose π induces an ergodic Markov chain (i.e. a Markov chain that is irreducible and positive recurrent) and this has invariant distribution μ. We know that
μy = Σx μx P( y | x)     (2)
Multiplying (1) through by μx and then summing on x, we get
Σx μx φ(x) + λ Σx μx
= Σx μc(x,f(x)) + Σy Σx μx φ(y) P( y | x)
which, using (2) gives
Σx μx φ(x) + λ Σx μx = Σx μc(x,f(x)) + Σy μy φ(y).
Then using Σx μ= 1, and cancelling the terms that are equal on both sides, we have
λ = Σx μc(x,f(x))
and so we see that λ is the average-cost of policy π.

Thursday, October 30, 2014

Matrix chain multiplication

Following on from supervisions this week, I was wondering about Examples Sheet 1, #1 and the running time of dynamic programming. I found this wiki article about Matrix chain multiplication
https://en.wikipedia.org/wiki/Matrix_chain_multiplication

The article explains how a naive application of dynamic programming results in a run-time that is exponential in the number of matrices, $n$, which is very slow and impractical. That is what I asked about in the examples sheet. However, by remembering partial results, so that the same thing is not computed more than once, the run-time can be reduced to $O(n^3)$. By reducing the problem to one of polynomial triangulation the running time can be made $O(n\log n)$.

Lecture 7

The Gittins index theorem is one of the most beautiful results in the field of Markov decision processes. Its discovery and proof in 1974 is due to John Gittins. The proof I have given in today's lecture is very different to Gittins's original proof. It the simplest way to prove the theorem and was first presented in On the Gittins index for multiarmed banditsAnn. Appl. Prob. 2, 1024-33, 1992. Proof is pages 1026-27.

For today's lecture I presented some slides on the Gittins Index. You may enjoy looking through the entire talk. It will look better if you download the .pdf file to your computer and read the presentation with a .pdf viewer such as acrobat, rather than try to read within your browser.

The proof of the Gittins index theorem is entertaining, but non-examinable. I would expect you only to know what we mean by a SFABP, the statement of the Gittins index theorem, and how to calculate the indices in simple examples. I thought you would enjoy seeing this beautiful result and how it can be proved. Lectures 1-6 have covered everything you need to know in order to understand the Gittins index. Today's lecture has also been an opportunity to revise ideas that we already met in problems on job scheduling and pharmaceutical trials. 

The prospecting problem in 6.5 and Weiztman's Pandora's boxes problem in 7.5 are really the same problem, and a simple example to which the Gittins index theorem provides the answer.

Tuesday, October 28, 2014

Lecture 6

Today we looked at some stopping problems. Many problems are like this. They are present in financial mathematics in the theory of option pricing (Black-Scholes). The holder of an American option is allowed to exercise the right to buy (or sell) the underlying asset at a predetermined price at any time before or at the expiry date. The valuation of American options is essentially an optimal stopping problem.

You can read more about Bruss's odds algorithm in his paper: "Sum the odds to one and stopAnnals of Probability, Vol. 28, 1384–1391, (2000). I showed you  how to use this to address the Secretary Problem. A very interesting related problem is the "last arrivals problem". Consider a shopkeeper who would like to go home early just after he has served the last potential customer of the day.

I will say more about the mine prospecting problem of Section 6.5 next time.

Let me finish with some hints about questions on Examples Sheets 1 and 2. Question 6 on Sheet 1 can be solved by realizing that there are only two stationary Markov policies. You simple need to decide which is best. Question 7 is a simple stopping problem that can be solved using a OSLA rule. Question 8 is tricky. You should use Theorem 4.2. 

Question 9 on Sheet 1 is important because it should convince you that it is possible to solve stochastic dynamic programming equations using linear programming (at least if both state space and action space are finite.)  It there are $n$ states and $m$ actions (controls) available in each state then we will have an LP in $n$ variables, with $nm$ linear inequality constraints. 

When on sheet 2 you do Question 1 you might think about using Bruss's Odds Algorithm. Although you cannot apply the algorithm directly, you should be able to solve this problem using the same scheme of proof as we used to prove the optimality of the odds algorithm in this lecture.

Thursday, October 23, 2014

Lecture 5

I started today by describing a famous problem about searching for a randomly moving hidden object. This is an example of a partially observed Markov decision process. This problem has not yet been solved in full generality. However, the problem has been solved in continuous time. See R. R. Weber. Optimal search for a randomly moving object. J. Appl. Prob. 23:708-717, 1986. The reason the problem is easier in continuous time is that the state $x_1$ (the probability that the object is in location 1) changes continuously, rather than in jumps.

Let me emphasize that in Lecture 4, we used $F_s(x)$ to mean the value function of a MDP over $s$ steps, with $F_0(x)=0$. In the N case of $c(x,u)\geq 0$, $F_s(x)$ is clearly nondecreasing in $s$. By contrast, in this lecture, $F_s(x)$ was used to mean the minimum cost in a stopping problem in which we must stop within $s$ steps, and $F_0(x)=k(x)$. $F_s(x)$ is now clearly nonincreasing in $s$. That is because as $s$ increases we are being given more flexibility as to when we must stop. In both cases $F_s(x)$ satisfies an optimality equation and $F_s(x)\to$ a limit, say $F_\infty(x)$, but from a different direction.

Having completed these lectures you know all you need to do all questions on Examples Sheet 1.

Questions 7 and 8 use the idea of a one-step look-ahead (OSLA) rule that we have met in today's lecture. Question 8 is subtle, because although it is easy to guess the answer by thinking about a OSLA rule, how do you prove this answer is correct? Hint: use Theorem 4.2. Be careful when writing down the dynamic programming equation for Question 8 that you put the expected value (or integral and density function for an exponential random variable) in the right place relative to where you place your max{ , }).

Questions 9-11 are deeper, but that is mostly because of their lengthy descriptions. The solutions are not actually hard. In Q9, you'll want to use the idea of substituting an equation into itself. One of the reasons I include these questions is so you can see some typical research level problem in this field. I have written a research paper about Question 11, ABCs of the bomber problem and its relatives Annals of Operations Research, 2011. The fundamental question about the optimal policy in this model is, amazingly, still unsolved.

Tuesday, October 21, 2014

Lecture 4

In Section 4.3 (optimal gambling) we saw that timid play is optimal in the gambling problem when the game is favourable to the gambler ($p \geq 0.5$).

This should remind you of the question on the IB Markov Chains examples sheet 1 that begins "A gambler has £2 and needs to increase it to £10 in a hurry. The gambler decides to use a bold strategy in which he stakes all his money if he has £5 or less, and otherwise stakes just enough to increase his capital, if he wins, to £10."

In the case $p \leq 0.5$ bold play maximizes the probability that the gambler reaches N pounds. However, bold play may not be uniquely optimal. An example that exhibits this non-uniqueness of optimal strategy is contained in question 13 of the IB Markov Chains example sheet 1 (2012 version). 

When it was first discovered that bold play is not uniquely optimal, it was contrary to the accepted wisdom at that time (1955). It impressed Leonard Savage when it was presented to him by a Ph.D. student named Lester Dubins They developed a collaboration culminating in the famous monograph How to Gamble if You Must (Inequalities for Stochastic Processes). See the fourth paragraph of this obituary of Lester Dubins.

If $p=0.5$ all strategies are optimal. How could we prove that? Easy. Simply show that given any policy $\pi$ the value function is $F(\pi,i)=i/N$ and that this satisfies the dynamic programming equation. Then apply Theorem 4.2. You can read more about these so-called red and black games at the Virtual Laboratories in Probability and Statistics.

In Section 4.5 (pharmaceutical trials) I introduced an important class of very practical problems. One obvious generalization is to a problem with $k$ drugs, each of which has a different unknown probability of success, and about which we will learn as we make trials of the drugs. This is called a multi-armed bandit problem. The names comes from thinking about a gaming machine (or fruit machine), having $k$ arms that you might pull, one at a time. The arms have different unknown probabilities of generating payouts. In today's lecture we considered a special case of the two-armed bandit problem in which one of the arms has a known success probability, p, and the second arm has an unknown success probability, theta. I will say more about these problems in Lecture 7. The table on page 16 was computed by value iteration. The table is from  the book Multi-armed Bandit Allocation Indices.

Thursday, October 16, 2014

Lecture 3

Theorem 3.1 is our first serious theorem. It had an easy but non-trivial proof. It is important because it allows us to know that $F(x)$ satisfies a DP equation (3.7). It holds under the cases of D (discounted), N (negative) or P (positive) programming.

The problem of selling a tulip bulb collection in Section 3.5 is very much like the secretary problem in Section 2.3. The differences are that now (i) we observe values (not just relative ranks), (ii) wish to maximize the expected value of the selected candidate (rather than probability of choosing the best), and (iii) the number of offers is infinite, but with a discount factor $\beta$. We see that one way in which discounting by a factor beta can naturally occur is via a catastrophe, with probability $1-\beta$, bringing the problem to an end.

I asked if you can figure out the answer to the asset selling problem if past offers for the tulip bulb collection remain open (so long as the market has not collapsed). The state $x$ is now the best offer so far received. The DP equation would be
$$
F(x) = \int_0^\infty\max\Bigl[x,y,\beta F(\max\{x,y\})\Bigr] g(y) dy
$$
The validity of this equation is from Theorem 3.1 and that fact that this is a Positive case of dynamic programming.

In the proof of Theorem 3.1 I used that $\lim_{s\to\infty}EF_s(x_1)=E[\lim_{s\to\infty}EF_s(x_1)]$ when $F_s$ is either monotone increasing or decreasing in $s$, as it indeed is in the N and P cases. This is called the Lebesgue monotone convergence theorem. In the D case the interchange of $E$ and $\lim_{s\to\infty}$, it is true because $F_s(x)$ is close to its limit for large $s$, uniformly in $x$.

I remarked today that the D case can be recast as a N or P case. I have place this remark in Section 4.5 of the notes.

Wednesday, October 15, 2014

Notation: less is more

A student has asked, which of these is the correct way to write the DP equation for a deterministic plant equation $x_{t+1}=a(x_t,u_t,t)$.
\begin{align*}
F(x_t,t)& = \inf_{u_t} \Big[c(x_t,u_t) + F(a(x_t,u_t,t),t+1) \Big] \tag{1}\\
F(x,t)& = \inf_{u} \Big[c(x,u) + F(a(x,u,t),t+1) \Big]\tag{2}
\end{align*}
In fact, both are correct, since they express the same thing.

Recall that $F(x,t)$ is defined as $\inf_\pi C_t$, when $C_t$ is the total cost incurred over times $t,t+1,\dotsc,h$, given that we start in state $x$ at time $t$. Thus the two arguments of $F$ are the state and time. So (1) and (2) are saying the same thing. You may find it helpful to write $x_t$ as a reminder that the first argument is the state at time $t$, or you may prefer to just write $x$, since this is simpler. The choice is yours.

When I first encountered this subject the lecturer, Peter Whittle, was in the habit of aggressively omitting subscripts of $t$. I found that sometimes confusing, and I would go through my notes replacing $x$ by $x_t$. But actually, once one understands the meaning of (1) and (2) there is really no need to be confused. We can see (2) as more beautiful, and as an example of the dictum:
Less is more (architect Ludwig Mies van der Rohe speaking of minimalist design)

The typical stochastic DP equation is
\begin{equation}
F(x,t) = \inf_u \Big[ c(x,u,t) + E[F(x_{t+1},t+1) \mid x_t=x, u_t=u] \Big].\tag{3}
\end{equation}
Now it is helpful to show by $E[\ \cdot \ \mid x_t=x, u_t=u]$ that we are taking an expectation that is conditional on certain information, namely that $x_t=x$ and $u_t=u$. Remember that in (3) the random variable is $x_{t+1}$.

Tuesday, October 14, 2014

Lecture 2

The highlight of today's lecture was the Secretary Problem. This is the most famous of all problems in the field of optimal stopping. It is credited to Merrill M. Flood in 1949, who called it the fiancée problem. It gained wider publicity when it appeared in Martin Gardner's column of Scientific American in 1960. There is an interesting Wikipedia article about it. One of the interesting things said in this article is that in behavioural studies people tend to stop too soon (i.e. marry too soon, make a purchase too soon). See The devil is in the details: Incorrect intuitions in optimal search.

The story about Kepler's search for a wife is taken from the paper, Who Solved the Secretary Problem, by Thomas S. Ferguson. He also discusses the related game of Googol.

A variation of the problem that has never been completely solved is the so-called Robbin's Problem. In this problem we do observe values of candidates, say $X_1,\dotsc, X_h$, and these are assumed to be independent, identically distributed uniform$[0,1]$ random variables. The objective is to maximize the expected rank of the candidate that is selected (best = rank 1, second-best = rank 2, etc). It is known only that, as $h$ goes to infinity, the expected rank that can be achieved under an optimal policy lies between 1.908 and 2.329. This problem is much more difficult that the usual secretary problem because the decision as to whether or not to hire candidate t must depend upon all the values of $X_1,\dotsc, X_t$, not just upon how $X_t$ ranks amongst them.

Following this lecture you can do questions 1–4 and 10 on Example Sheet 1. Question 2 is quite like the secretary problem (and also has a surprising answer). The tricks that have been explained in today's lecture are useful in solving these questions (working in terms of time to go, backwards induction, that a bang-bang control arises when the objective in linear in $u_t$, looking at the cross-over between increasing and decreasing terms within a $\max\{ , \}$, as we did in the secretary problem with $\max\{t/h, F(t)\}$).

A student asked in the lecture in regard to Example 2.3 about the characterization of $a_s$ as the least $x$ such that $F_s(x)-x= -p$. 

Certainly $a_s\geq p$, since $F_s(p)-p> -p$. It might be that $a_s=\infty$ if $F_s(x)-x> -p$ for all $x$. This could happen, for example, if $\epsilon_t=1$ for all $t$. In that case the share price is guaranteed to be increasing and so one would never wish to exercise the option before the first day. An optimal policy is described by $(a_0,a_1,a_2,\dotsc)=(p,\infty,\infty,\dotsc)$. I have made a small change on page 7 of the notes so this case is mentioned.

Thursday, October 9, 2014

What is a state?

From time to time I will add a post to this blog which I hope will add to your appreciation of the course. This post relates to today's lecture, in which we had the idea of a state-structured model. The notion of a state is intuitive, or is it?

"What is a state?"

Warren B. Powell (a professor at Princeton and researcher in optimization) has mused upon this question. You can read his blog about this here: http://www.castlelab.princeton.edu/cso.htm

I think he makes a nice definition of a state, in these words:

A state variable is a minimally dimensioned function of history that is necessary and sufficient to compute the decision function, cost/reward/contribution function, and the transition function.


By "decision function" he means the $F(x,t)$ that I was today calling the "value function". For the other terms he is thinking of $c(x,u,t)$ and $a(x,u,t)$.

He discusses the fact that "state" can be a physical state, an information state, or a belief state.

Please be aware that a state can be a vector of more that one component. For example, consider the uncontrolled autoregressive plant equation,

$x_t = a_1 x_{t-1} + a_2 x_{t-2}$.

Here we would take the state to be $y_t=(x_t,x_{t-1})$, because $y_{t-1}$ is needed to compute $y_t$.

Warren is particularly interested reconciling the many communities of stochastic optimization, amongst which he includes
  • Dynamic programming (including Markov decision processes)
  • Stochastic programming (two-stage, multistage, chance constrained)
  • Stochastic search (including simulation optimization, infinitessimal perturbation analysis, and ranking and selection)
  • Optimal control (primarily engineering, but also economics and finance)
  • Approximate dynamic programming and reinforcement learning
  • ...
The first and fourth of these are the ones which our courses addresses. You may find it interesting to read his essay on what is a state?

Lecture 1

Today we had definitions and notation for state, control, history, value function, etc, and have developed dynamic programming equations for a very general case and a state-structured case,. Please be patient with the notation. It is not as complex as it may first appear. Things like $a(x,u,t)$, $F(x_t,t)$, $u_t$, and $U_t$ will begin to seem like old friends once you have used them a few times.

The terminology "plant equation" for $x_{t+1}=a(x_t,u_t,t)$ derives from the fact that early optimal control theory was developed with applications to industrial processes in mind, especially chemical plants. We also call it the dynamics equation.

From this first lecture you should be taking away the key idea of dynamic programming, and the fact that problems in stages (with separable cost and a finite time horizon) can often be solved by working backwards from the final stage. The minimum length path (stage coach) problem is trivial, but should have made these ideas very intuitive. You might like to read the Wikipedia entry for Richard Bellman, who is credited with the invention of dynamic programming in 1953.

The course page gives some hyperlinks to the recommended booklist. In particular, Demitri Bertsekas has a web page for his book and slides from his lectures. You might find these interesting to browse through at some later stage.

I mentioned that I had once appeared on ITV's Who Wants to be a Millionaire (October 2003) and that playing it has aspects of dynamic programming. There is a nice Part II exam question on this, including the model solution. You might like to look at this now – simply to see the sort of mathematical problem that this course will enable you to solve. You can also view the overhead slides for a little presentation called A Mathematician Plays "Who Wants to Be a Millionaire?" which I once gave to a Royal Institution workshop for school children. You might like to see how I made best use of my "Ask the audience" lifeline by employing an idea from statistics. Basically, I asked members of the audience not to vote if they felt at all unsure of the right answer.

Examples Sheet 1. Today's lecture has provided all you need to know to do question #1 on Examples Sheet 1. In doing this rather strange question you should grasp the idea that dynamic programming applies to problems that in which cost is incurred in stages. In many problems the stages are time points ($t=0,1,\dotsc$), but in others the stages can be different.

The remaining questions are on Markov decision problems, which we will be addressing in Lectures 2-6.

MathJax

In these blog posts I use MathJax, which claims to achieve "Beautiful math in all browsers". By using MathJax I can write normal LaTeX code within a post.

If MathJax does not look right, then you might need to refresh the page. Also, Javascript must be turned on. If you try to view using a browser that does not have Javascript then you will see the original LaTeX.

Wednesday, October 1, 2014

Course starts October 9, 2014

This is the blog page for the course of 16 lectures starting 9 October 2014 (Tue/Thu @ 12 in CMS meeting room 4). At the right is a link to my draft notes for all 16 lectures. 

My aim in these notes is to tread a Goldilocks path, by being neither too brief nor too verbose. I try to make each lecture a sort of self-contained seminar, with 4 pages of notes. I will slightly amend and change these notes as the course proceeds. In particular, I may be doing some things differently in the later lectures. Some students like to print notes in advance of the lecture and then write things in the margins when hearing me talk about things that are not in not in the notes. 

 I will use this space to talk about some extra things. Sometimes leaving a lecture I think, "I wish I had said ...". This blog gives me a place to say it. Or I can use this space to talk about a question that a student has asked. Or I might comment on an examples sheet question. There is a space after each post for you to write comments or ask questions using the IntenseDebate commenting system. 

 You can subscribe to be sent an email notification of new posts. 

 Some of you may be interested to know that A Careers Open Day is being held on by the Operational Research Society on Wednesday 19 November in Birmingham, www.theorsociety.com/careersopenday.