Tuesday, January 29, 2013

Lecture 4

One of the CATAM projects this year is Policy Improvement for a Markov Decision Process. The next few lectures (especially 7) are relevant to answering the theoretical parts of this project. Doing this project will help you better understand this course. So I recommend this CATAM project to you.

In Section 4.3 (optimal gambling) we saw that timid play is optimal in the gambling problem when the game is favorable to the gambler ($p \geq 0.5$).

Perhaps you remember the question on the IB Markov Chains examples sheet 1 that begins "A gambler has £2 and needs to increase it to £10 in a hurry. The gambler decides to use a bold strategy in which he stakes all his money if he has £5 or less, and otherwise stakes just enough to increase his capital, if he wins, to £10."

In the case $p \leq 0.5$ bold play maximizes the probability that the gambler reaches N pounds. However, bold play may not be uniquely optimal. An example that exhibits this non-uniqueness of optimal strategy is contained in question 13 of the IB Markov Chains example sheet 1. 

When it was first discovered that bold play is not uniquely optimal, it was contrary to the accepted wisdom at that time (1955). It impressed Leonard Savage when it was presented to him by a Ph.D. student named Lester Dubins They developed a collaboration culminating in the famous monograph How to Gamble if You Must (Inequalities for Stochastic Processes). See the fourth paragraph of this obituary of Lester Dubins.

If $p=0.5$ all strategies are optimal. How could we prove that? Easy. Simply show that given any policy pi the value function is $F(\pi,i)=i/N$ and that this satisfies the DP equation. Then apply Theorem 4.1. You can read more about these so-called red and black games at the Virtual Laboratories in Probability and Statistics.

In Section 4.5 (pharmaceutical trials) I introduced an important class of very practical problems. One obvious generalization is to a problem with $k$ drugs, each of which has a different unknown probability of success, and about which we will learn as we make trials of the drugs. This is called a multi-armed bandit problem. The names comes from thinking about a gaming machine (or fruit machine), having $k$ arms that you might pull, one at a time. The arms have different unknown probabilities of generating payouts. In today's lecture we considered a special case of the two-armed bandit problem in which one of the arms has a known success probability, p, and the second arm has an unknown success probability, theta. I will say more about these problems in Lecture 6. The table on page 16 was computed by value iteration. This table is from the book Multi-armed Bandit Allocation Indices (of which I am one of the authors). The amazon.co.uk page will let you look inside this book.