Tuesday, January 26, 2016

Lecture 4

In Section 4.3 (optimal gambling) we saw that timid play is optimal in the gambling problem when the game is favourable to the gambler ($p \geq 0.5$).

Similarly, a bold strategy is optimal in the case $p<0.5$. But this is harder to prove because it is not so easy to find an expression for the value function of the bold strategy. (This might remind you of the question on the IB Markov Chains examples sheet 1 that begins "A gambler has £2 and needs to increase it to £10 in a hurry. The gambler decides to use a bold strategy in which he stakes all his money if he has £5 or less, and otherwise stakes just enough to increase his capital, if he wins, to £10.")

If $p=0.5$ all strategies are optimal. How could we prove that? Easy. Simply show that given any policy $\pi$ the value function is $F(\pi,i)=i/N$ and that this satisfies the dynamic programming equation. Then apply Theorem 4.2. You can read more about these so-called red and black games at the Virtual Laboratories in Probability and Statistics.

In Section 4.5 (pharmaceutical trials) I introduced an important class of very practical problems. One obvious generalization is to a problem with $k$ drugs, each of which has a different unknown probability of success, and about which we will learn as we make trials of the drugs. This is called a multi-armed bandit problem. The names comes from thinking about a gaming machine (or fruit machine), having $k$ arms that you might pull, one at a time. The arms have different unknown probabilities of generating payouts. In today's lecture we considered a special case of the two-armed bandit problem in which one of the arms has a known success probability, p, and the second arm has an unknown success probability, theta. I will say more about these problems in Lecture 7. The table on page 16 was computed by value iteration. The table is from  the book Multi-armed Bandit Allocation Indices.