\n", "\n", "**General concepts**\n", "\n", "Reference: McClelland, J. L., Rumelhart, D. E. (1989). Explorations in parallel distributed processing: A handbook of models, programs, and exercises. Chapter 9, MIT press. url: [web.stanford.edu/group/pdplab/pdphandbook/handbookch10.html](https://web.stanford.edu/group/pdplab/pdphandbook/handbookch10.html)\n", "\n", "

\n", "\n", "* Return $G_{t}$: future cumulative reward at time $t$:\n", "\\begin{equation}\n", "G_{t} = \\sum \\limits_{k = 0}^{\\infty} \\gamma^{k} r_{t+k+1}\n", "\\end{equation}\n", "where $r_{t}$ is the amount of reward received at time $t$, and $\\gamma \\in [0, 1]$ is a discount factor that specifies the relevance in the present of future rewards. \n", "Note that the return $G_{t}$ can be written in a recursive form:\n", "\\begin{equation}\n", "G_{t} = r_{t+1} + \\gamma G_{t+1}\n", "\\end{equation}\n", "\n", "* State $s$ describes the current state or situation, typically obtained from observations that the agent receives from the environment.\n", "\n", "* Policy $\\pi$ is a specification of how the agent acts. $\\pi(a|s)$ gives the probability of taking action $a$ when in state $s$.\n", "\n", "* The value function $V_{\\pi}(s_t=s)$ is defined as the expected return starting with state $s$ and successively following policy $\\pi$. Roughly speaking, the value function estimates \"how good\" it is to be in state $s$ when following policy $\\pi$.\n", "\\begin{align}\n", "V_{\\pi}(s_t=s) &= \\mathbb{E} [ G_{t}\\; | \\; s_t=s, a_{t:\\infty}\\sim\\pi] \\\\\n", "& = \\mathbb{E} [ r_{t+1} + \\gamma G_{t+1}\\; | \\; s_t=s, a_{t:\\infty}\\sim\\pi]\n", "\\end{align}\n", "\n", "* Combining the above, we have:\n", "\\begin{align}\n", "V_{\\pi}(s_t=s) &= \\mathbb{E} [ r_{t+1} + \\gamma V_{\\pi}(s_{t+1})\\; | \\; s_t=s, a_{t:\\infty}\\sim\\pi] \\\\\n", "&= \\sum_a \\pi(a|s) \\sum_{r, s'}p(s', r)(r + V_{\\pi}(s_{t+1}=s'))\n", "\\end{align}\n", "\n", "

\n", "\n", "**Temporal difference (TD) learning**\n", "\n", "* With a [Markovian assumption](https://en.wikipedia.org/wiki/Markov_property), we can use $V(s_{t+1})$ as a proxy for the true value of the return $G_{t+1}$. Thus, we obtain a generalized equation to calculate the TD-error:\n", "\\begin{align}\n", "\\delta_{t} = r_{t+1} + \\gamma V(s_{t+1}) - V(s_{t})\n", "\\end{align}\n", "\n", "* The TD-error measures the discrepancy between the values at time $t$ and $t+1$. Once the TD-error is calculated, we can perform a \"value update\" to to reduce the value discrepancy: \n", "\n", "\\begin{align}\n", "V(s_{t}) \\leftarrow V(s_{t}) + \\alpha \\delta_{t}\n", "\\end{align}\n", "\n", "* The speed by which the discrepancy is reduced is specified by a constant (aka hyperparameter) $\\alpha$, called learning rate. \n", "\n", "

\n", "\n", "**Definitions (tl;dr):**\n", "\n", "* Return:\n", "\\begin{equation}\n", "G_{t} = \\sum \\limits_{k = 0}^{\\infty} \\gamma^{k} r_{t+k+1} = r_{t+1} + \\gamma G_{t+1}\n", "\\end{equation}\n", "\n", "* TD-error:\n", "\\begin{equation}\n", "\\delta_{t} = r_{t+1} + \\gamma V(s_{t+1}) - V(s_{t})\n", "\\end{equation}\n", "\n", "* Value updates:\n", "\\begin{equation}\n", "V(s_{t}) \\leftarrow V(s_{t}) + \\alpha \\delta_{t}\n", "\\end{equation}" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Coding Exercise 1: TD-learning with guaranteed rewards\n", " \n", "In this exercise, you will implement TD-learning to estimate the state-value function in the classical conditioning paradigm. Rewards have fixed magnitude and are delivered at a fixed delay after the conditioned stimulus, CS. You should save the TD-errors over learning (i.e., over trials) so we can visualize them afterwards. \n", "\n", "In order to simulate the effect of the CS, you should update $V(s_{t})$ only during the delay period after CS. This period is indicated by the boolean variable `is_delay`. This can be implemented by multiplying the expression for updating the value function by `is_delay`.\n", "\n", "Use the provided code to estimate the value function. We will use helper class `ClassicalConditioning`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "def td_learner(env, n_trials, gamma=0.98, alpha=0.001):\n", " \"\"\" Temporal Difference learning\n", "\n", " Args:\n", " env (object): the environment to be learned\n", " n_trials (int): the number of trials to run\n", " gamma (float): temporal discount factor\n", " alpha (float): learning rate\n", "\n", " Returns:\n", " ndarray, ndarray: the value function and temporal difference error arrays\n", " \"\"\"\n", " V = np.zeros(env.n_steps) # Array to store values over states (time)\n", " TDE = np.zeros((env.n_steps, n_trials)) # Array to store TD errors\n", "\n", " for n in range(n_trials):\n", "\n", " state = 0 # Initial state\n", "\n", " for t in range(env.n_steps):\n", "\n", " # Get next state and next reward\n", " next_state, reward = env.get_outcome(state)\n", "\n", " # Is the current state in the delay period (after CS)?\n", " is_delay = env.state_dict[state][0]\n", "\n", " ########################################################################\n", " ## TODO for students: implement TD error and value function update\n", " # Fill out function and remove\n", " raise NotImplementedError(\"Student exercise: implement TD error and value function update\")\n", " #################################################################################\n", " # Write an expression to compute the TD-error\n", " TDE[state, n] = ...\n", "\n", " # Write an expression to update the value function\n", " V[state] += ...\n", "\n", " # Update state\n", " state = next_state\n", "\n", " return V, TDE\n", "\n", "\n", "# Initialize classical conditioning class\n", "env = ClassicalConditioning(n_steps=40, reward_magnitude=10, reward_time=10)\n", "\n", "# Perform temporal difference learning\n", "V, TDE = td_learner(env, n_trials=20000)\n", "\n", "# Visualize\n", "learning_summary_plot(V, TDE)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main//tutorials/W3D4_ReinforcementLearning/solutions/W3D4_Tutorial1_Solution_adeb004b.py)\n", "\n", "*Example output:*\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Interactive Demo 1.1: US to CS Transfer \n", "\n", "During classical conditioning, the subject's behavioral response (e.g., salivating) transfers from the unconditioned stimulus (US; like the smell of tasty food) to the conditioned stimulus (CS; like Pavlov ringing his bell) that predicts it. Reward prediction errors play an important role in this process by adjusting the value of states according to their expected, discounted return.\n", "\n", "Recall that TD-errors are given by:\n", "\n", "\\begin{equation}\n", "\\delta_{t} = r_{t+1} + \\gamma V(s_{t+1}) - V(s_{t})\n", "\\end{equation}\n", "\n", "The delay period has zero reward, so throughout the learning phase, the TD-errors result from inconsistencies between $V(s_{t+1})$ and $V(s_{t})$ (note that the discount factor is set to zero in this example). The TD-errors for a given time point diminish once $V(s_{t})$ approaches $V(s_{t+1})$, but that causes the TD-error for the preceding time point to increase. Thus, throughout learning, the TD-errors will tend to move backwards in time.\n", "\n", "Use the widget below to examine how reward prediction errors change over time. \n", "\n", "Before training (orange line), only the reward state has high reward prediction error (blue line). As training progresses (slider), the reward prediction errors shift to the conditioned stimulus, where they end up when the trial is complete (green line). \n", "\n", "Dopamine neurons, which are thought to carry reward prediction errors _in vivo_, show exactly the same behavior!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Make sure you execute this cell to enable the widget!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "#@title\n", "\n", "#@markdown Make sure you execute this cell to enable the widget!\n", "\n", "n_trials = 20000\n", "\n", "@widgets.interact\n", "def plot_tde_by_trial(trial = widgets.IntSlider(value=5000, min=0, max=n_trials-1 , step=1, description=\"Trial #\")):\n", " if 'TDE' not in globals():\n", " print(\"Complete Exercise 1 to enable this interactive demo!\")\n", " else:\n", "\n", " fig, ax = plt.subplots()\n", " ax.axhline(0, color='k') # Use this + basefmt=' ' to keep the legend clean.\n", " ax.stem(TDE[:, 0], linefmt='C1-', markerfmt='C1d', basefmt=' ',\n", " label=\"Before Learning (Trial 0)\",\n", " use_line_collection=True)\n", " ax.stem(TDE[:, -1], linefmt='C2-', markerfmt='C2s', basefmt=' ',\n", " label=\"After Learning (Trial $\\infty$)\",\n", " use_line_collection=True)\n", " ax.stem(TDE[:, trial], linefmt='C0-', markerfmt='C0o', basefmt=' ',\n", " label=f\"Trial {trial}\",\n", " use_line_collection=True)\n", "\n", " ax.set_xlabel(\"State in trial\")\n", " ax.set_ylabel(\"TD Error\")\n", " ax.set_title(\"Temporal Difference Error by Trial\")\n", " ax.legend()" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Interactive Demo 1.2: Learning Rates and Discount Factors\n", "\n", "Our TD-learning agent has two parameters that control how it learns: $\\alpha$, the learning rate, and $\\gamma$, the discount factor. In Exercise 1, we set these parameters to $\\alpha=0.001$ and $\\gamma=0.98$ for you. Here, you'll investigate how changing these parameters alters the model that TD-learning learns.\n", "\n", "Before enabling the interactive demo below, take a moment to think about the functions of these two parameters. $\\alpha$ controls the size of the Value function updates produced by each TD-error. In our simple, deterministic world, will this affect the final model we learn? Is a larger $\\alpha$ necessarily better in more complex, realistic environments?\n", "\n", "The discount rate $\\gamma$ applies an exponentially-decaying weight to returns occuring in the future, rather than the present timestep. How does this affect the model we learn? What happens when $\\gamma=0$ or $\\gamma \\geq 1$?\n", "\n", "Use the widget to test your hypotheses.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Make sure you execute this cell to enable the widget!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "#@title\n", "\n", "#@markdown Make sure you execute this cell to enable the widget!\n", "\n", "@widgets.interact\n", "def plot_summary_alpha_gamma(alpha = widgets.FloatSlider(value=0.0001, min=0.0001, max=0.1, step=0.0001, readout_format='.4f', description=\"alpha\"),\n", " gamma = widgets.FloatSlider(value=0.980, min=0, max=1.1, step=0.010, description=\"gamma\")):\n", " env = ClassicalConditioning(n_steps=40, reward_magnitude=10, reward_time=10)\n", " try:\n", " V_params, TDE_params = td_learner(env, n_trials=20000, gamma=gamma, alpha=alpha)\n", " except NotImplementedError:\n", " print(\"Finish Exercise 1 to enable this interactive demo\")\n", "\n", " learning_summary_plot(V_params,TDE_params)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main//tutorials/W3D4_ReinforcementLearning/solutions/W3D4_Tutorial1_Solution_80376f94.py)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 2: TD-learning with varying reward magnitudes\n", "\n", "*Estimated timing to here from start of tutorial: 30 min*\n", "\n", "In the previous exercise, the environment was as simple as possible. On every trial, the CS predicted the same reward, at the same time, with 100% certainty. In the next few exercises, we will make the environment more progressively more complicated and examine the TD-learner's behavior. \n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Interactive Demo 2: Match the Value Functions\n", "\n", "First, will replace the environment with one that dispenses one of several rewards, chosen at random. Shown below is the final value function $V$ for a TD learner that was trained in an environment where the CS predicted a reward of 6 or 14 units; both rewards were equally likely). \n", "\n", "Can you find another pair of rewards that cause the agent to learn the same value function? Assume each reward will be dispensed 50% of the time. \n", "\n", "Hints:\n", "* Carefully consider the definition of the value function $V$. This can be solved analytically.\n", "* There is no need to change $\\alpha$ or $\\gamma$. \n", "* Due to the randomness, there may be a small amount of variation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Make sure you execute this cell to enable the widget! Please allow some time for the new figure to load\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @markdown Make sure you execute this cell to enable the widget! Please allow some time for the new figure to load\n", "\n", "n_trials = 20000\n", "np.random.seed(2020)\n", "rng_state = np.random.get_state()\n", "env = MultiRewardCC(40, [6, 14], reward_time=10)\n", "V_multi, TDE_multi = td_learner(env, n_trials, gamma=0.98, alpha=0.001)\n", "\n", "@widgets.interact\n", "def reward_guesser_interaction(r1 = widgets.IntText(value=0, min=0, max=50, description=\"Reward 1\"),\n", " r2 = widgets.IntText(value=0, min=0, max=50, description=\"Reward 2\")):\n", " try:\n", " env2 = MultiRewardCC(40, [r1, r2], reward_time=10)\n", " V_guess, _ = td_learner(env2, n_trials, gamma=0.98, alpha=0.001)\n", " fig, ax = plt.subplots()\n", " m, l, _ = ax.stem(V_multi, linefmt='y-', markerfmt='yo', basefmt=' ', label=\"Target\",\n", " use_line_collection=True)\n", " m.set_markersize(15)\n", " m.set_markerfacecolor('none')\n", " l.set_linewidth(4)\n", " m, _, _ = ax.stem(V_guess, linefmt='r', markerfmt='rx', basefmt=' ', label=\"Guess\",\n", " use_line_collection=True)\n", " m.set_markersize(15)\n", "\n", " ax.set_xlabel(\"State\")\n", " ax.set_ylabel(\"Value\")\n", " ax.set_title(\"Guess V(s)\\n\" + reward_guesser_title_hint(r1, r2))\n", " ax.legend()\n", " except NotImplementedError:\n", " print(\"Please finish Exercise 1 first!\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Think! 2: Examining the TD Error\n", "\n", "Run the cell below to plot the TD errors from our multi-reward environment. A new feature appears in this plot? What is it? Why does it happen?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "plot_tde_trace(TDE_multi)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main//tutorials/W3D4_ReinforcementLearning/solutions/W3D4_Tutorial1_Solution_9fd42e7d.py)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 3: TD-learning with probabilistic rewards\n", "*Estimated timing to here from start of tutorial: 40 min*" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Think! 3: Probabilistic rewards\n", "\n", "\n", "In this environment, we'll return to delivering a single reward of ten units. However, it will be delivered intermittently: on 20 percent of trials, the CS will be shown but the agent will not receive the usual reward; the remaining 80% will proceed as usual.\n", "\n", "Run the cell below to simulate. Recall that earlier in the notebook, we saw that changing $\\alpha$ had little effect on learning in a deterministic environment. In the simulation below, $\\alpha$ is set to 1. What happens when the learning rate is set to such a large value in a probability reward setting? Does it seem like it will _ever_ converge? \n", "\n", "With a high learning rate, the value function tracks each observed reward, changing quickly whenever there is a reward prediction error. In a probabilistic scenario case, this behavior results in the value function changing too quickly and never stabilizing (converging). Using a low learning rate can stabilize the value function by smoothing out any variation in the reward signal, leading the value function to converge to the average reward over time. However, using a low learning rate can result in slow learning.\n", "\n", "To get the best of all worls, it is often useful to use a high learning rate early on (producing fast learning), and to reduce the learning rate gradually throughout learning (so that the value function converges to the average reward). This is sometimes called \"learning rate schedule\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Execute this cell to visualize the value function and TD-errors when `alpha=1`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @markdown Execute this cell to visualize the value function and TD-errors when `alpha=1`\n", "np.random.set_state(rng_state) # Resynchronize everyone's notebooks\n", "n_trials = 20000\n", "try:\n", " env = ProbabilisticCC(n_steps=40, reward_magnitude=10, reward_time=10,\n", " p_reward=0.8)\n", " V_stochastic, TDE_stochastic = td_learner(env, n_trials*2, alpha=1)\n", " learning_summary_plot(V_stochastic, TDE_stochastic)\n", "except NotImplementedError:\n", " print(\"Please finish Exercise 1 first\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Execute this cell to visualize the value function and TD-errors when `alpha=0.2`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @markdown Execute this cell to visualize the value function and TD-errors when `alpha=0.2`\n", "np.random.set_state(rng_state) # Resynchronize everyone's notebooks\n", "n_trials = 20000\n", "try:\n", " env = ProbabilisticCC(n_steps=40, reward_magnitude=10, reward_time=10,\n", " p_reward=0.8)\n", " V_stochastic, TDE_stochastic = td_learner(env, n_trials*2, alpha=.2)\n", " learning_summary_plot(V_stochastic, TDE_stochastic)\n", "except NotImplementedError:\n", " print(\"Please finish Exercise 1 first\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main//tutorials/W3D4_ReinforcementLearning/solutions/W3D4_Tutorial1_Solution_6e708fa1.py)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Summary\n", "\n", "*Estimated timing of tutorial: 50 min*\n", "\n", "In this notebook, we have developed a simple TD Learner and examined how its state representations and reward prediction errors evolve during training. By manipulating its environment and parameters ($\\alpha$, $\\gamma$), you developed an intuition for how it behaves. \n", "\n", "This simple model closely resembles the behavior of subjects undergoing classical conditioning tasks and the dopamine neurons that may underlie that behavior. You may have implemented TD-reset or used the model to recreate a common experimental error. The update rule used here has been extensively studied for [more than 70 years](https://www.pnas.org/content/108/Supplement_3/15647) as a possible explanation for artificial and biological learning. \n", "\n", "However, you may have noticed that something is missing from this notebook. We carefully calculated the value of each state, but did not use it to actually do anything. Using values to plan _**Actions**_ is coming up next!" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Bonus" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Bonus Think! 1: Removing the CS\n", "\n", "In Coding Exercise 1, you (should have) included a term that depends on the conditioned stimulus. Remove it and see what happens. Do you understand why?\n", "This phenomena often fools people attempting to train animals--beware!" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main//tutorials/W3D4_ReinforcementLearning/solutions/W3D4_Tutorial1_Solution_f4fa447c.py)\n", "\n" ] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "W3D4_Tutorial1", "provenance": [], "toc_visible": true }, "kernel": { "display_name": "Python 3", "language": "python", "name": "python3" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 0 }