\n",
"## Click here for text recap of relevant part of video

\n",
"\n",
"We will be exploring neural activity in mice while the mice is viewing oriented grating stimuli on a screen in front of it. We record neural activity using a technique called two-photon calcium imaging, which allows us to record many thousands of neurons simultanously. The neurons light up when they fire. We then convert this imaging data to a matrix of neural responses by stimuli presented. For the purposes of this tutorial we are going to bin the neural responses and compute each neuron’s tuning curve. We used bins of 1 degree. We will use the response of all neurons in a single bin to try to predict which stimulus was shown. So we are going to be using the responses of 24000 neurons to try to predict 360 different possible stimulus conditions corresponding to each degree of orientation - which means we're in the regime of big data!\n",
"\n",
"

\n",
"\n",
"In the next cell, we have provided code to load the data and plot the matrix of neural responses.\n",
"\n",
"Next to it, we plot the tuning curves of three randomly selected neurons. These tuning curves are the averaged response of each neuron to oriented stimuli within 1$^\\circ$, and since there are 360$^\\circ$ in total, we have 360 responses.\n",
"\n",
"In the recording, there were actually thousands of stimuli shown, but in practice we often create these tuning curves because we want to visualize averaged responses with respect to the variable we varied in the experiment, in this case stimulus orientation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Execute this cell to load and visualize data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Execute this cell to load and visualize data\n",
"\n",
"# Load data\n",
"resp_all, stimuli_all = load_data(fname) # argument to this function specifies bin width\n",
"n_stimuli, n_neurons = resp_all.shape\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(2 * 6, 5))\n",
"\n",
"# Visualize data matrix\n",
"plot_data_matrix(resp_all[:, :100].T, ax1) # plot responses of first 100 neurons\n",
"ax1.set_xlabel('stimulus')\n",
"ax1.set_ylabel('neuron')\n",
"\n",
"# Plot tuning curves of three random neurons\n",
"ineurons = np.random.choice(n_neurons, 3, replace=False) # pick three random neurons\n",
"ax2.plot(stimuli_all, resp_all[:, ineurons])\n",
"ax2.set_xlabel('stimulus orientation ($^o$)')\n",
"ax2.set_ylabel('neural response')\n",
"ax2.set_xticks(np.linspace(0, 360, 5))\n",
"\n",
"fig.suptitle(f'{n_neurons} neurons in response to {n_stimuli} stimuli')\n",
"fig.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"We will split our data into a training set and test set. In particular, we will have a training set of orientations (`stimuli_train`) and the corresponding responses (`resp_train`). Our testing set will have held-out orientations (`stimuli_test`) and the corresponding responses (`resp_test`)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Execute this cell to split into training and test sets\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Execute this cell to split into training and test sets\n",
"\n",
"# Set random seeds for reproducibility\n",
"np.random.seed(4)\n",
"torch.manual_seed(4)\n",
"\n",
"# Split data into training set and testing set\n",
"n_train = int(0.6 * n_stimuli) # use 60% of all data for training set\n",
"ishuffle = torch.randperm(n_stimuli)\n",
"itrain = ishuffle[:n_train] # indices of data samples to include in training set\n",
"itest = ishuffle[n_train:] # indices of data samples to include in testing set\n",
"stimuli_test = stimuli_all[itest]\n",
"resp_test = resp_all[itest]\n",
"stimuli_train = stimuli_all[itrain]\n",
"resp_train = resp_all[itrain]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 2: Deep feed-forward networks in *pytorch*\n",
"\n",
"\n",
"\n",
"## Click here for text recap of relevant part of video

\n",
"\n",
"We can build a linear network with no hidden layers, where the stimulus prediction $y$ is a product of weights $\\mathbf{W}_{out}$ and neural responses $\\mathbf{r}$ with an added term $\\mathbf{b}$ which is called the bias term. When you fit a linear model such as this you minimize the squared error between the predicted stimulus $y$ and the true stimulus $\\tilde{y}$, this is the “loss function”.\n",
"\n",
"\\begin{align}\n",
"L &= (y - \\tilde{y})^2 \\\\\n",
"&= ((\\mathbf{W}^{out} \\mathbf{r} + \\mathbf{b}) - \\tilde{y})^2\n",
"\\end{align}\n",
"\n",
"The solution to minimizing this loss function in a linear model can be found in closed form, and you learned how to solve this linear regression problem in the first week if you remember. If we use a simple linear model for this data we are able to predict the stimulus within 2-3 degrees. Let’s see if we can predict the neural activity better with a deep network.\n",
"\n",
"Let’s add a hidden layer with $M$ units to this linear model, where now the output $y$ is as follows:\n",
"\\begin{align}\n",
"\\mathbf{h} &= \\mathbf{W}^{in} \\mathbf{r} + \\mathbf{b}^{in}, && [\\mathbf{W}^{in}: M \\times N \\, , \\, \\mathbf{b}^{in}: M \\times 1] \\, , \\\\\n",
"y &= \\mathbf{W}^{out} \\mathbf{h} + \\mathbf{b}^{out}, && [\\mathbf{W}^{out}: 1 \\times M\\, , \\, \\mathbf{b}^{in}: 1 \\times 1] \\, ,\n",
"\\end{align}\n",
"\n",
"\n",
"Note this linear network with one hidden layer where $M$ hidden units is less than $N$ inputs is equivalent to performing [reduced rank regression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444519/), a technique that is useful for regularizing your regression model.\n",
"\n",
"Adding this hidden layer means the model now has a depth of $1$. The number of units $M$ is termed the width of the network. Increasing the depth and the width of the network can increase the expressivity of the model -- in other words how well it can fit complex non-linear functions. Many state-of-the-art models now have close to 100 layers! But for now let’s start with a model with a depth of $1$ and see if we can improve our prediction of the stimulus. See [bonus section 1](#b1) for a deeper discussion of what this choice entails, and when one might want to use deeper/shallower and wider/narrower architectures.\n",
"\n",
"The $M$-dimensional vector $\\mathbf{h}$ denotes the activations of the **hidden layer** of the network. The blue components of this diagram denote the **parameters** of the network, which we will later optimize with gradient descent. These include all the weights and biases $\\mathbf{W}^{in}, \\mathbf{b}^{in}, \\mathbf{W}^{out}, \\mathbf{b}^{out}$. The **weights** are matrices of size (# of outputs, # of inputs) that are multiplied by the input of each layer, like the regression coefficients in linear regression. The **biases** are vectors of size (# of outputs, 1), like the intercept term in linear regression (see W1D3 for more details on multivariate linear regression).\n",
"\n",
"

\n",
"\n",
"We'll now build a simple deep neural network that takes as input a vector of neural responses and outputs a single number representing the decoded stimulus orientation.\n",
"\n",
"Let $\\mathbf{r}^{(n)} = \\begin{bmatrix} r_1^{(n)} & r_2^{(n)} & \\ldots & r_N^{(n)} \\end{bmatrix}^\\top$ denote the vector of neural responses (of neurons $1, \\ldots, N$) to the $n$th stimulus. The network we will use is described by the following set of equations:\n",
"\n",
"\\begin{align}\n",
"\\mathbf{h}^{(n)} &= \\mathbf{W}^{in} \\mathbf{r}^{(n)} + \\mathbf{b}^{in}, && [\\mathbf{W}^{in}: M \\times N \\, , \\, \\mathbf{b}^{in}: M \\times 1] \\, , \\\\\n",
"y^{(n)} &= \\mathbf{W}^{out} \\mathbf{h}^{(n)} + \\mathbf{b}^{out}, && [\\mathbf{W}^{out}: 1 \\times M \\, , \\, \\mathbf{b}^{in}: 1 \\times 1] \\, ,\n",
"\\end{align}\n",
"\n",
"where $y^{(n)}$ denotes the scalar output of the network: the decoded orientation of the $n$-th stimulus."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Section 2.1: Introduction to PyTorch\n",
"\n",
"*Estimated timing to here from start of tutorial: 16 min*\n",
"\n",
"Here, we'll use the **PyTorch** package to build, run, and train deep networks of this form in Python. PyTorch uses a data type called a `torch.Tensor`. They are effectively just like `numpy` arrays, except that they have some important attributes and methods needed for automatic differentiation (to be discussed below). They also come along with infrastructure for easily storing and computing with them on GPU's, a capability we won't touch on here but which can be really useful in practice.\n",
"\n",
"\n", " \n", "

\n", "\n", "\n",
"## Click here for text recap of relevant part of video

\n",
"\n",
"First we import the pytorch library called `torch` and its neural network module `nn`. Next we will create a class for the deep network called DeepNet. A class has functions which are called methods. A class in python is initialized using a method called `__init__`. In this case the init method is declared to takes two inputs (other than the `self` input which represents the class itself), which are `n_inputs` and `n_hidden`. In our case `n_inputs` is the number of neurons we are using to do the prediction, and `n_hidden` is the number of hidden units. We first call the super function to invoke the `nn.Module`’s init function. Next we add the hidden layer `in_layer` as an attribute of the class. It is a linear layer called `nn.Linear` with size `n_inputs` by `n_hidden`. Then we add a second linear layer `out_layer` of size `n_hidden` by `1`, because we are predicting one output - the orientation of the stimulus. PyTorch will initialize all weights and biases randomly.\n",
"\n",
"Note the number of hidden units `n_hidden` is a parameter that we are free to vary in deciding how to build our network. See [Bonus Section 1](#b1) for a discussion of how this architectural choice affects the computations the network can perform.\n",
"\n",
"Next we add another method to the class called `forward`. This is the method that runs when you call the class as a function. It takes as input `r` which is the neural responses. Then `r` is sent through the linear layers `in_layer` and `out_layer` and returns our prediction `y`. Let’s create an instantiation of this class called `net` with 200 hidden units with `net = DeepNet(n_neurons, 200)`. Now we can run the neural response through the network to predict the stimulus (`net(r)`); running the “net” this way calls the forward method.\n",
"\n",
"\n",
"

\n",
"\n",
"The next cell contains code for building the deep network we defined above and in the video using the `nn.Module` base class for deep neural network models (documentation [here](https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=nn%20module#torch.nn.Module))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "both",
"execution": {}
},
"outputs": [],
"source": [
"class DeepNet(nn.Module):\n",
" \"\"\"Deep Network with one hidden layer\n",
"\n",
" Args:\n",
" n_inputs (int): number of input units\n",
" n_hidden (int): number of units in hidden layer\n",
"\n",
" Attributes:\n",
" in_layer (nn.Linear): weights and biases of input layer\n",
" out_layer (nn.Linear): weights and biases of output layer\n",
"\n",
" \"\"\"\n",
"\n",
" def __init__(self, n_inputs, n_hidden):\n",
" super().__init__() # needed to invoke the properties of the parent class nn.Module\n",
" self.in_layer = nn.Linear(n_inputs, n_hidden) # neural activity --> hidden units\n",
" self.out_layer = nn.Linear(n_hidden, 1) # hidden units --> output\n",
"\n",
" def forward(self, r):\n",
" \"\"\"Decode stimulus orientation from neural responses\n",
"\n",
" Args:\n",
" r (torch.Tensor): vector of neural responses to decode, must be of\n",
" length n_inputs. Can also be a tensor of shape n_stimuli x n_inputs,\n",
" containing n_stimuli vectors of neural responses\n",
"\n",
" Returns:\n",
" torch.Tensor: network outputs for each input provided in r. If\n",
" r is a vector, then y is a 1D tensor of length 1. If r is a 2D\n",
" tensor then y is a 2D tensor of shape n_stimuli x 1.\n",
"\n",
" \"\"\"\n",
" h = self.in_layer(r) # hidden representation\n",
" y = self.out_layer(h)\n",
" return y"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Section 2.2: Activation functions\n",
"\n",
"*Estimated timing to here from start of tutorial: 25 min*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 2: Nonlinear activation functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 2: Nonlinear activation functions\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'JAdukDCQALA'), ('Bilibili', 'BV1m5411h7V5')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Nonlinear_activation_functions_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"This video covers adding a nonlinear activation function, specifically a Rectified Linear Unit (ReLU), to the linear network.\n",
"\n",
"\n",
"## Click here for text recap of video

\n",
"\n",
"Note that the deep network we constructed above comprises solely **linear** operations on each layer: each layer is just a weighted sum of all the elements in the previous layer. It turns out that linear hidden layers like this aren't particularly useful, since a sequence of linear transformations is actually essentially the same as a single linear transformation. We can see this from the above equations by plugging in the first one into the second one to obtain\n",
"\n",
"\\begin{equation}\n",
"y^{(n)} = \\mathbf{W}^{out} \\left( \\mathbf{W}^{in} \\mathbf{r}^{(n)} + \\mathbf{b}^{in} \\right) + \\mathbf{b}^{out} = \\mathbf{W}^{out}\\mathbf{W}^{in} \\mathbf{r}^{(n)} + \\left( \\mathbf{W}^{out}\\mathbf{b}^{in} + \\mathbf{b}^{out} \\right)\n",
"\\end{equation}\n",
"\n",
"In other words, the output is still just a weighted sum of elements in the input -- the hidden layer has done nothing to change this.\n",
"\n",
"To extend the set of computable input/output transformations to more than just weighted sums, we'll incorporate a **non-linear activation function** in the hidden units. This is done by simply modifying the equation for the hidden layer activations to be\n",
"\n",
"\\begin{equation}\n",
"\\mathbf{h}^{(n)} = \\phi(\\mathbf{W}^{in} \\mathbf{r}^{(n)} + \\mathbf{b}^{in})\n",
"\\end{equation}\n",
"\n",
"where $\\phi$ is referred to as the activation function. Using a non-linear activation function will ensure that the hidden layer performs a non-linear transformation of the input, which will make our network much more powerful (or *expressive*, see [Bonus Section 1](#b1)). In practice, deep networks *always* use non-linear activation functions.\n",
"\n",
"The most common non-linearity used is the rectified linear unit (or ReLU), which is a max(0, x) function. At the beginning of neural network development, researchers experimented with different non-linearities such as sigmoid and tanh functions, but in the end they found that RELU activation functions worked the best. It works well because the gradient is able to back-propagate through the network as long as the input is positive - the gradient is 1 for all values of x greater than 0. If you use a saturating non-linearity then the gradients will be very small in the saturating regimes, reducing the effective computing regime of the unit.\n",
"\n",
"

\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"#### Coding Exercise 2.2: Nonlinear Activations\n",
"\n",
"Create a new class `DeepNetReLU` by modifying our above deep network model to add a **non-linear activation** function $\\phi$:\n",
"\\begin{equation}\n",
"\\mathbf{h}^{(n)} = \\phi(\\mathbf{W}^{in} \\mathbf{r}^{(n)} + \\mathbf{b}^{in})\n",
"\\end{equation}\n",
"\n",
"We'll use the linear rectification function:\n",
"\n",
"\\begin{equation}\n",
"\\phi(x) =\n",
"\\begin{cases}\n",
"x & \\text{if } x > 0 \\\\\n",
"0 & \\text{else}\n",
"\\end{cases}\n",
"\\end{equation}\n",
"\n",
"which can be implemented in PyTorch using `torch.relu()`. Hidden layers with this activation function are typically referred to as \"**Re**ctified **L**inear **U**nits\", or **ReLU**'s.\n",
"\n",
"Initialize this network with 10 hidden units and run on an example stimulus.\n",
"\n",
"**Hint**: you only need to modify the `forward()` method of the above `DeepNet()` class to include `torch.relu()`.\n",
"\n",
"\n", "\n", "We then initialize and run this network. We use it to decode stimulus orientation (true stimulus given by `ori`) from a vector of neural responses `r` to the very first stimulus. Note that when the initialized network class is called as a function on an input (e.g., `net(r)`), its `.forward()` method is called. This is a special property of the `nn.Module` class.\n", "\n", "Note that the decoded orientations at this point will be nonsense, since the network has been initialized with random weights. Below, we'll learn how to optimize these weights for good stimulus decoding." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "class DeepNetReLU(nn.Module):\n", " \"\"\" network with a single hidden layer h with a RELU \"\"\"\n", "\n", " def __init__(self, n_inputs, n_hidden):\n", " super().__init__() # needed to invoke the properties of the parent class nn.Module\n", " self.in_layer = nn.Linear(n_inputs, n_hidden) # neural activity --> hidden units\n", " self.out_layer = nn.Linear(n_hidden, 1) # hidden units --> output\n", "\n", " def forward(self, r):\n", "\n", " ############################################################################\n", " ## TO DO for students: write code for computing network output using a\n", " ## rectified linear activation function for the hidden units\n", " # Fill out function and remove\n", " raise NotImplementedError(\"Student exercise: complete DeepNetReLU forward\")\n", " ############################################################################\n", "\n", " h = ... # h is size (n_inputs, n_hidden)\n", " y = ... # y is size (n_inputs, 1)\n", "\n", "\n", " return y\n", "\n", "\n", "# Set random seeds for reproducibility\n", "np.random.seed(1)\n", "torch.manual_seed(1)\n", "\n", "# Initialize a deep network with M=200 hidden units\n", "net = DeepNetReLU(n_neurons, 200)\n", "\n", "# Get neural responses (r) to and orientation (ori) to one stimulus in dataset\n", "r, ori = get_data(1, resp_train, stimuli_train) # using helper function get_data\n", "\n", "# Decode orientation from these neural responses using initialized network\n", "out = net(r) # compute output from network, equivalent to net.forward(r)\n", "\n", "print(f'decoded orientation: {out.item():.2f} degrees')\n", "print(f'true orientation: {ori.item():.2f} degrees')" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "The decoded orientation is 0.17 $^{\\circ}$ while the true orientation is 139.00 $^{\\circ}$. You should see:\n", "```\n", "decoded orientation: 0.17 degrees\n", "true orientation: 139.00 degrees\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main/tutorials/W1D5_DeepLearning/solutions/W1D5_Tutorial1_Solution_6bfd6123.py)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Nonlinear_activations_Exercise\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 3: Loss functions and gradient descent\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video 3: Loss functions & gradient descent\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @title Video 3: Loss functions & gradient descent\n", "from ipywidgets import widgets\n", "from IPython.display import YouTubeVideo\n", "from IPython.display import IFrame\n", "from IPython.display import display\n", "\n", "\n", "class PlayVideo(IFrame):\n", " def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n", " self.id = id\n", " if source == 'Bilibili':\n", " src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n", " elif source == 'Osf':\n", " src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n", " super(PlayVideo, self).__init__(src, width, height, **kwargs)\n", "\n", "\n", "def display_videos(video_ids, W=400, H=300, fs=1):\n", " tab_contents = []\n", " for i, video_id in enumerate(video_ids):\n", " out = widgets.Output()\n", " with out:\n", " if video_ids[i][0] == 'Youtube':\n", " video = YouTubeVideo(id=video_ids[i][1], width=W,\n", " height=H, fs=fs, rel=0)\n", " print(f'Video available at https://youtube.com/watch?v={video.id}')\n", " else:\n", " video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n", " height=H, fs=fs, autoplay=False)\n", " if video_ids[i][0] == 'Bilibili':\n", " print(f'Video available at https://www.bilibili.com/video/{video.id}')\n", " elif video_ids[i][0] == 'Osf':\n", " print(f'Video available at https://osf.io/{video.id}')\n", " display(video)\n", " tab_contents.append(out)\n", " return tab_contents\n", "\n", "\n", "video_ids = [('Youtube', 'aEtKpzEuviw'), ('Bilibili', 'BV19k4y1271n')]\n", "tab_contents = display_videos(video_ids, W=730, H=410)\n", "tabs = widgets.Tab()\n", "tabs.children = tab_contents\n", "for i in range(len(tab_contents)):\n", " tabs.set_title(i, video_ids[i][0])\n", "display(tabs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Loss_functions_and_gradient_descent_Video\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "This video covers loss functions, gradient descent, and how to implement these in Pytorch.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "### Section 3.1: Loss functions\n", "\n", "*Estimated timing to here from start of tutorial: 40 min*\n", "\n", "Because the weights of the network are currently randomly chosen, the outputs of the network are nonsense: the decoded stimulus orientation is nowhere close to the true stimulus orientation. We'll shortly write some code to change these weights so that the network does a better job of decoding.\n", "\n", "But to do so, we first need to define what we mean by \"better\". One simple way of defining this is to use the squared error\n", "\n", "\\begin{equation}\n", "L = (y - \\tilde{y})^2\n", "\\end{equation}\n", "\n", "where $y$ is the network output and $\\tilde{y}$ is the true stimulus orientation. When the decoded stimulus orientation is far from the true stimulus orientation, $L$ will be large. We thus refer to $L$ as the **loss function**, as it quantifies how *bad* the network is at decoding stimulus orientation.\n", "\n", "

\n",
"## Click here for text recap of relevant part of video

\n",
"\n",
"First we run the neural responses through the network `net` to get the output `out`. Then we declare our loss function, we will use the built in `nn.MSELoss` function for this purpose: `loss_fn = nn.MSELoss()`. This loss function takes two inputs, the network output `out` and the true stimulus orientations `ori` and finds the mean squared error: `loss = loss_fn(out, ori)`. Specifically, it will take as arguments a **batch** of network outputs $y_1, y_2, \\ldots, y_P$ and corresponding target outputs $\\tilde{y}_1, \\tilde{y}_2, \\ldots, \\tilde{y}_P$, and compute the **mean squared error (MSE)**\n",
"\n",
"\\begin{equation}\n",
"L = \\frac{1}{P}\\sum_{n=1}^P \\left(y^{(n)} - \\tilde{y}^{(n)}\\right)^2\n",
"\\end{equation}\n",
"\n",
"where $P$ is the number of different stimuli in a batch, called the *batch size*."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"**Computing MSE**\n",
"\n",
"\n",
"Evaluate the mean squared error for a deep network with $M=10$ rectified linear units, on the decoded orientations from neural responses to 20 random stimuli."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Set random seeds for reproducibility\n",
"np.random.seed(1)\n",
"torch.manual_seed(1)\n",
"\n",
"# Initialize a deep network with M=10 hidden units\n",
"net = DeepNetReLU(n_neurons, 10)\n",
"\n",
"# Get neural responses to first 20 stimuli in the data set\n",
"r, ori = get_data(20, resp_train, stimuli_train)\n",
"\n",
"# Decode orientation from these neural responses\n",
"out = net(r)\n",
"\n",
"# Initialize PyTorch mean squared error loss function (Hint: look at nn.MSELoss)\n",
"loss_fn = nn.MSELoss()\n",
"\n",
"# Evaluate mean squared error\n",
"loss = loss_fn(out, ori)\n",
"\n",
"print(f'mean squared error: {loss:.2f}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"You should see a mean squared error of 42949.14."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Section 3.2: Optimization with gradient descent\n",
"*Estimated timing to here from start of tutorial: 50 min*\n",
"\n",
"

\n",
"## Click here for text recap of relevant part of video

\n",
"\n",
"\n",
"Next we minimize this loss function using gradient descent. In **gradient descent** we compute the gradient of the loss function with respect to each parameter (all $W$’s and $b$’s). We then update the parameters by subtracting the learning rate times the gradient.\n",
"\n",
"Let’s visualize this loss function $L$ with respect to a weight $w$. If the gradient is positive (the slope $\\frac{dL}{dw}$ > 0) as in this case then we want to move in the opposite direction which is negative. So we update the $w$ accordingly in the negative direction on each iteration. Once the iterations complete the weight will ideally be at a value that minimizes the cost function.\n",
"\n",
"In reality these cost functions are not convex like this one and depend on hundreds of thousands of parameters. There are tricks to help navigate this rocky cost landscape such as adding momentum or changing the optimizer but we won’t have time to get into that today. There are also ways to change the architecture of the network to improve optimization, such as including skip connections. These skip connections are used in residual networks and allow for the optimization of many layer networks.\n",
"\n",
"

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Execute this cell to view gradient descent gif\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Execute this cell to view gradient descent gif\n",
"\n",
"from IPython.display import Image\n",
"Image(url='https://github.com/NeuromatchAcademy/course-content/blob/main/tutorials/static/grad_descent.gif?raw=true')"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"We'll use the **gradient descent (GD)** algorithm to modify our weights to reduce the loss function, which consists of iterating three steps.\n",
"\n",
"1. **Evaluate the loss** on the training data,\n",
"```python\n",
"out = net(train_data)\n",
"loss = loss_fn(out, train_labels)\n",
"```\n",
"where `train_data` are the network inputs in the training data (in our case, neural responses), and `train_labels` are the target outputs for each input (in our case, true stimulus orientations).\n",
"2. **Compute the gradient of the loss** with respect to each of the network weights. In PyTorch, we can do this with the `.backward()` method of the loss `loss`. Note that the gradients of each parameter need to be cleared before calling `.backward()`, or else PyTorch will try to accumulate gradients across iterations. This can again be done using built-in optimizers via the method `.zero_grad()`. Putting these together we have\n",
"```python\n",
"optimizer.zero_grad()\n",
"loss.backward()\n",
"```\n",
"3. **Update the network weights** by descending the gradient. In Pytorch, we can do this using built-in optimizers. We'll use the `optim.SGD` optimizer (documentation [here](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD)) which updates parameters along the negative gradient, scaled by a learning rate. To initialize this optimizer, we have to tell it\n",
" * which parameters to update, and\n",
" * what learning rate to use\n",
"\n",
" For example, to optimize *all* the parameters of a network `net` using a learning rate of .001, the optimizer would be initialized as follows\n",
" ```python\n",
" optimizer = optim.SGD(net.parameters(), lr=.001)\n",
" ```\n",
" where `.parameters()` is a method of the `nn.Module` class that returns a [Python generator object](https://wiki.python.org/moin/Generators) over all the parameters of that `nn.Module` class (in our case, $\\mathbf{W}^{in}, \\mathbf{b}^{in}, \\mathbf{W}^{out}, \\mathbf{b}^{out}$).\n",
"\n",
" After computing all the parameter gradients in step 2, we can then update each of these parameters using the `.step()` method of this optimizer,\n",
" ```python\n",
" optimizer.step()\n",
" ```\n",
"In the next exercise, we'll give you a code skeleton for implementing the GD algorithm. Your job will be to fill in the blanks.\n",
"\n",
"For the mathematical details of the GD algorithm, see [bonus section 2.1](#b21).\n",
"\n",
"In this case we are using gradient descent (not *stochastic* gradient descent) because we are computing the gradient over ALL training data at once. Normally there is too much training data to do this in practice, and for instance the neural responses may be divided into sets of 20 stimuli. An **epoch** in deep learning is defined as the forward and backward pass of all the training data through the network. We will run the forward and backward pass of the network here for 20 **epochs**, in practice training may require thousands of epochs.\n",
"\n",
"See [bonus section 2.2](#b22) for a more detailed discussion of stochastic gradient descent."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"#### Coding Exercise 3.2: Gradient descent in PyTorch\n",
"\n",
"Complete the function `train()` that uses the gradient descent algorithm to optimize the weights of a given network. This function takes as input arguments\n",
"* `net`: the PyTorch network whose weights to optimize\n",
"* `loss_fn`: the PyTorch loss function to use to evaluate the loss\n",
"* `train_data`: the training data to evaluate the loss on (i.e., neural responses to decode)\n",
"* `train_labels`: the target outputs for each data point in `train_data` (i.e., true stimulus orientations)\n",
"\n",
"We will then train a neural network on our data and plot the loss (mean squared error) over time. When we run this function, behind the scenes PyTorch is actually changing the parameters inside this network to make the network better at decoding, so its weights will now be different than they were at initialization."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"def train(net, loss_fn, train_data, train_labels,\n",
" n_epochs=50, learning_rate=1e-4):\n",
" \"\"\"Run gradient descent to optimize parameters of a given network\n",
"\n",
" Args:\n",
" net (nn.Module): PyTorch network whose parameters to optimize\n",
" loss_fn: built-in PyTorch loss function to minimize\n",
" train_data (torch.Tensor): n_train x n_neurons tensor with neural\n",
" responses to train on\n",
" train_labels (torch.Tensor): n_train x 1 tensor with orientations of the\n",
" stimuli corresponding to each row of train_data\n",
" n_epochs (int, optional): number of epochs of gradient descent to run\n",
" learning_rate (float, optional): learning rate to use for gradient descent\n",
"\n",
" Returns:\n",
" (list): training loss over iterations\n",
"\n",
" \"\"\"\n",
"\n",
" # Initialize PyTorch SGD optimizer\n",
" optimizer = optim.SGD(net.parameters(), lr=learning_rate)\n",
"\n",
" # Placeholder to save the loss at each iteration\n",
" train_loss = []\n",
"\n",
" # Loop over epochs\n",
" for i in range(n_epochs):\n",
"\n",
" ######################################################################\n",
" ## TO DO for students: fill in missing code for GD iteration\n",
" raise NotImplementedError(\"Student exercise: write code for GD iterations\")\n",
" ######################################################################\n",
"\n",
" # compute network output from inputs in train_data\n",
" out = ... # compute network output from inputs in train_data\n",
"\n",
" # evaluate loss function\n",
" loss = loss_fn(out, train_labels)\n",
"\n",
" # Clear previous gradients\n",
" ...\n",
"\n",
" # Compute gradients\n",
" ...\n",
"\n",
" # Update weights\n",
" ...\n",
"\n",
" # Store current value of loss\n",
" train_loss.append(loss.item()) # .item() needed to transform the tensor output of loss_fn to a scalar\n",
"\n",
" # Track progress\n",
" if (i + 1) % (n_epochs // 5) == 0:\n",
" print(f'iteration {i + 1}/{n_epochs} | loss: {loss.item():.3f}')\n",
"\n",
" return train_loss\n",
"\n",
"\n",
"# Set random seeds for reproducibility\n",
"np.random.seed(1)\n",
"torch.manual_seed(1)\n",
"\n",
"# Initialize network with 10 hidden units\n",
"net = DeepNetReLU(n_neurons, 10)\n",
"\n",
"# Initialize built-in PyTorch MSE loss function\n",
"loss_fn = nn.MSELoss()\n",
"\n",
"# Run gradient descent on data\n",
"train_loss = train(net, loss_fn, resp_train, stimuli_train)\n",
"\n",
"# Plot the training loss over iterations of GD\n",
"plot_train_loss(train_loss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/main/tutorials/W1D5_DeepLearning/solutions/W1D5_Tutorial1_Solution_1ca16188.py)\n",
"\n",
"*Example output:*\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Gradient_descent_in_Pytorch_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"We can further improve our model - please see the **Bonus** part of the Tutorial when you have time to dive deeper into this model by evaluating and improving its performance by visualizing the weights, looking at performance on test data, switching to a new loss function and adding regularization."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Summary\n",
"\n",
"*Estimated timing of tutorial: 1 hour, 20 minutes*\n",
"\n",
"We have now covered a number of common and powerful techniques for applying deep learning to decoding from neural data, some of which are common to almost any machine learning problem:\n",
"* Building and training deep networks using the **PyTorch** `nn.Module` class and built-in **optimizers**\n",
"* Choosing **loss functions**\n",
"\n",
"An important aspect of this tutorial was the `train()` function we wrote in coding exercise 3.2. Note that it can be used to train *any* network to minimize *any* loss function on *any* training data. This is the power of using PyTorch to train neural networks and, for that matter, **any other model**! There is nothing in the `nn.Module` class that forces us to use `nn.Linear` layers that implement neural network operations. You can actually put anything you want inside the `.__init__()` and `.forward()` methods of this class. As long as its parameters and computations involve only `torch.Tensor`'s, and the model is differentiable, you'll then be able to optimize the parameters of this model in exactly the same way we optimized the deep networks here.\n",
"\n",
"What kinds of conclusions can we draw from these sorts of analyses? If we can decode the stimulus well from visual cortex activity, that means that there is information about this stimulus available in the visual cortex. Whether or not the animal uses that information to make decisions is not determined from an analysis like this. In fact mice perform poorly in orientation discrimination tasks compared to monkeys and humans, even though they have information about these stimuli in their visual cortex. Why do you think they perform poorly in orientation discrimination tasks?\n",
"\n",
"See [Stringer, _et al._, 2021](https://doi.org/10.1016/j.cell.2021.03.042) for some potential hypotheses, but this is totally an open question!"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"\n",
"## Bonus Section 1: Neural network *depth*, *width* and *expressivity*\n",
"\n",
"Two important architectural choices that always have to be made when constructing deep feed-forward networks like those used here are\n",
"* the number of hidden layers, or the network's *depth*\n",
"* the number of units in each layer, or the layer *widths*\n",
"\n",
"Here, we restricted ourselves to networks with a single hidden layer with a width of $M$ units, but it is easy to see how this code could be adapted to arbitrary depths. Adding another hidden layer simply requires adding another `nn.Linear` module to the `__init__()` method and incorporating it into the `.forward()` method.\n",
"\n",
"The depth and width of a network determine the set of input/output transformations that it can perform, often referred to as its *expressivity*. The deeper and wider the network, the more *expressive* it is; that is, the larger the class of input/output transformations it can compute. In fact, it turns out that an infinitely wide *or* infinitely deep networks can in principle [compute (almost) *any* input/output transformation](https://en.wikipedia.org/wiki/Universal_approximation_theorem).\n",
"\n",
"A classic mathematical demonstration of the power of depth is given by the so-called [XOR problem](https://medium.com/@jayeshbahire/the-xor-problem-in-neural-networks-50006411840b#:~:text=The%20XOr%2C%20or%20%E2%80%9Cexclusive%20or,value%20if%20they%20are%20equal.). This toy problem demonstrates how even a single hidden layer can drastically expand the set of input/output transformations a network can perform, relative to a shallow network with no hidden layers. The key intuition is that the hidden layer allows you to represent the input in a new format, which can then allow you to do almost anything you want with it. The *wider* this hidden layer, the more flexibility you have in this representation. In particular, if you have more hidden units than input units, then the hidden layer representation of the input is higher-dimensional than the raw data representation. This higher dimensionality effectively gives you more \"room\" to perform arbitrary computations in. It turns out that even with just this one hidden layer, if you make it wide enough you can actually approximate any input/output transformation you want. See [here](http://neuralnetworksanddeeplearning.com/chap4.html) for a neat visual demonstration of this.\n",
"\n",
"In practice, however, it turns out that increasing depth seems to grant more expressivity with fewer units than increasing width does (for reasons that are not well understood). It is for this reason that truly *deep* networks are almost always used in machine learning, which is why this set of techniques is often referred to as *deep* learning.\n",
"\n",
"That said, there is a cost to making networks deeper and wider. The bigger your network, the more parameters (i.e., weights and biases) it has, which need to be optimized! The extra expressivity afforded by higher width and/or depth thus carries with it (at least) two problems:\n",
"* optimizing more parameters usually requires more data\n",
"* a more highly parameterized network is more prone to overfit to the training data, so requires more sophisticated optimization algorithms to ensure generalization"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Bonus Section 2: Gradient descent"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"\n",
"### Bonus Section 2.1: Gradient descent equations\n",
"\n",
"Here we provide the equations for the three steps of the gradient descent algorithm, as applied to our decoding problem:\n",
"\n",
"1. **Evaluate the loss** on the training data. For a mean squared error loss, this is given by\n",
"\n",
"\\begin{equation}\n",
"L = \\frac{1}{P}\\sum_{n=1}^P \\left( y^{(n)} - \\tilde{y}^{(n)} \\right)^2\n",
"\\end{equation}\n",
"\n",
"where $y^{(n)}$ denotes the stimulus orientation decoded from the population response $\\mathbf{r}^{(n)}$ to the $n$th stimulus in the training data, and $\\tilde{y}^{(n)}$ is the true orientation of that stimulus. $P$ denotes the total number of data samples in the training set. In the syntax of our `train()` function above, $\\mathbf{r}^{(n)}$ is given by `train_data[n, :]` and $\\tilde{y}^{(n)}$ by `train_labels[n]`.\n",
"\n",
"2. **Compute the gradient of the loss** with respect to each of the network weights. In our case, this entails computing the quantities\n",
"\n",
"\\begin{equation}\n",
"\\frac{\\partial L}{\\partial \\mathbf{W}^{in}}, \\frac{\\partial L}{\\partial \\mathbf{b}^{in}}, \\frac{\\partial L}{\\partial \\mathbf{W}^{out}}, \\frac{\\partial L}{\\partial \\mathbf{b}^{out}}\n",
"\\end{equation}\n",
"\n",
"Usually, we would require lots of math in order to derive each of these gradients, and lots of code to compute them. But this is where PyTorch comes to the rescue! Using a cool technique called [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), PyTorch automatically calculates these gradients when the `.backward()` function is called.\n",
"\n",
"More specifically, when this function is called on a particular variable (e.g., `loss`, as above), PyTorch will compute the gradients with respect to each network parameter. These are computed and stored behind the scenes, and can be accessed through the `.grad` attribute of each of the network's parameters. As we saw above, however, we actually never need to look at or call these gradients when implementing gradient descent, as this can be taken care of by PyTorch's built-in optimizers, like `optim.SGD`.\n",
"\n",
"3. **Update the network weights** by descending the gradient:\n",
"\n",
"\\begin{align}\n",
"\\mathbf{W}^{in} &\\leftarrow \\mathbf{W}^{in} - \\alpha \\frac{\\partial L}{\\partial \\mathbf{W}^{in}} \\\\\n",
"\\mathbf{b}^{in} &\\leftarrow \\mathbf{b}^{in} - \\alpha \\frac{\\partial L}{\\partial \\mathbf{b}^{in}} \\\\\n",
"\\mathbf{W}^{out} &\\leftarrow \\mathbf{W}^{out} - \\alpha \\frac{\\partial L}{\\partial \\mathbf{W}^{out}} \\\\\n",
"\\mathbf{b}^{out} &\\leftarrow \\mathbf{b}^{out} - \\alpha \\frac{\\partial L}{\\partial \\mathbf{b}^{out}}\n",
"\\end{align}\n",
"\n",
"where $\\alpha$ is called the **learning rate**. This **hyperparameter** of the SGD algorithm controls how far we descend the gradient on each iteration. It should be as large as possible so that fewer iterations are needed, but not too large so as to avoid parameter updates from skipping over minima in the loss landscape.\n",
"\n",
"While the equations written down here are specific to the network and loss function considered in this tutorial, the code provided above for implementing these three steps is completely general: no matter what loss function or network you are using, exactly the same commands can be used to implement these three steps.\n",
"\n",
"The way that the gradients are calculated is called **backpropagation**. We have a loss function:\n",
"\n",
"\\begin{align}\n",
"L &= (y - \\tilde{y})^2 \\\\\n",
"&= (\\mathbf{W}^{out} \\mathbf{h} - \\tilde{y})^2\n",
"\\end{align}\n",
"\n",
"where $\\mathbf{h} = \\phi(\\mathbf{W}^{in} \\mathbf{r} + \\mathbf{b}^{in})$, and $\\phi(\\cdot)$ is the activation function, e.g., RELU.\n",
"You may see that $\\frac{\\partial L}{\\partial \\mathbf{W}^{out}}$ is simple to calculate as it is on the outside of the equation (it is also a vector in this case, not a matrix, so the derivative is standard):\n",
"\n",
"\\begin{equation}\n",
"\\frac{\\partial L}{\\partial \\mathbf{W}^{out}} = 2 (\\mathbf{W}^{out} \\mathbf{h} - \\tilde{y})\\mathbf{h}^\\top\n",
"\\end{equation}\n",
"\n",
"Now let's compute the derivative with respect to $\\mathbf{W}^{in}$ using the chain rule. Note it is only positive if the output is positive due to the RELU activation function $\\phi$. For the chain rule we need the derivative of the loss with respect to $\\mathbf{h}$:\n",
"\n",
"\\begin{equation}\n",
"\\frac{\\partial L}{\\partial \\mathbf{h}} = 2 \\mathbf{W}^{out \\top} (\\mathbf{W}^{out} \\mathbf{h} - \\tilde{y})\n",
"\\end{equation}\n",
"\n",
"Thus,\n",
"\n",
"\\begin{align}\n",
"\\frac{\\partial L}{\\partial \\mathbf{W}^{in}} &= \\begin{cases}\n",
"\\frac{\\partial L}{\\partial \\mathbf{h}} \\frac{\\partial \\mathbf{h}}{\\partial \\mathbf{W}^{in}} & \\text{if } \\mathbf{h} > 0 \\\\\n",
"0 & \\text{otherwise}\n",
"\\end{cases} \\\\\n",
"&= \\begin{cases}\n",
"2 \\mathbf{W}^{out \\top} (\\mathbf{W}^{out} \\mathbf{h} - \\tilde{y}) \\mathbf{r}^\\top & \\text{if } \\mathbf{h} > 0 \\\\\n",
"0 & \\text{otherwise}\n",
"\\end{cases}\n",
"\\end{align}\n",
"\n",
"Notice that:\n",
"\n",
"\\begin{equation}\n",
"\\frac{\\partial \\mathbf{h}}{\\partial \\mathbf{W}^{in}}=\\mathbf{r}^\\top \\odot \\phi^\\prime\n",
"\\end{equation}\n",
"\n",
"where $\\odot$ denotes the Hadamard product (i.e., elementwise multiplication) and $\\phi^\\prime$ is the derivative of the activation function. In case of RELU:\n",
"\n",
"\\begin{align}\n",
"\\phi^\\prime &= \\begin{cases}\n",
"1 & \\text{if } \\mathbf{h} > 0 \\\\\n",
"0 & \\text{otherwise}\n",
"\\end{cases}\n",
"\\end{align}\n",
"\n",
"\n",
"It is most efficient to compute the derivative once for the last layer, then once for the next layer and multiply by the previous layer's derivative and so on using the chain rule. Each of these operations is relatively fast, making training of deep networks feasible.\n",
"\n",
"The command `loss.backward()` computes these gradients for the defined `loss` with respect to each network parameter. The computation is done using [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), which implements backpropagation. Note that this works no matter how big/small the network is, allowing us to perform gradient descent for any deep network model built using PyTorch."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"\n",
"### Bonus Section 2.2: *Stochastic* gradient descent (SGD) vs. gradient descent (GD)\n",
"\n",
"In this tutorial, we used the gradient descent algorithm, which differs in a subtle yet very important way from the more commonly used **stochastic gradient descent (SGD)** algorithm. The key difference is in the very first step of each iteration, where in the GD algorithm we evaluate the loss *at every data sample in the training set*. In SGD, on the other hand, we evaluate the loss only at a random subset of data samples from the full training set, called a **mini-batch**. At each iteration, we randomly sample a mini-batch to perform steps 1-3 on. All the above equations still hold, but now the $P$ data samples $\\mathbf{r}^{(n)}, \\tilde{y}^{(n)}$ denote a mini-batch of $P$ random samples from the training set, rather than the whole training set.\n",
"\n",
"There are several reasons why one might want to use SGD instead of GD. The first is that the training set might be too big, so that we actually can't actually evaluate the loss on every single data sample in it. In this case, GD is simply infeasible, so we have no choice but to turn to SGD, which bypasses the restrictive memory demands of GD by subsampling the training set into smaller mini-batches.\n",
"\n",
"But, even when GD is feasible, SGD turns out to often be better. The stochasticity induced by the extra random sampling step in SGD effectively adds some noise in the search for local minima of the loss function. This can be really useful for avoiding potential local minima, and enforce that whatever minimum is converged to is a good one. This is particularly important when networks are wider and/or deeper, in which case the large number of parameters can lead to overfitting.\n",
"\n",
"Here, we used only GD because (1) it is simpler, and (2) it suffices for the problem being considered here. Because we have so many neurons in our data set, decoding is not too challenging and doesn't require a particularly deep or wide network. The small number of parameters in our deep networks therefore can be optimized without a problem using GD."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "W1D5_Tutorial1",
"provenance": [],
"toc_visible": true
},
"kernel": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
}
},
"nbformat": 4,
"nbformat_minor": 0
}