Subtitles section Play video Print subtitles YOSHUA BENGIO: [INAUDIBLE]. Thank you [INAUDIBLE]. So I'll talk about [INAUDIBLE]. I'll talk about representations and learning representations. And the word deep here, I'll explain what it means. So my goal is to contribute to building intelligent machines, also known as AI. And how do we get a machine to be smart-- to take good decisions? Well, it needs knowledge. [INAUDIBLE] [? researchers ?] from the early days-- '50s, '60s, '70s-- tried to give the knowledge to the machine-- the knowledge we have exclusively. And it didn't work quite as well as was hoped. One reason is that a lot of our knowledge is not something we can communicate verbally and we can write that in a program. So that knowledge has to be taken somewhere else. And basically what we have found is you can get that knowledge through observing the world around us. That means learning. OK-- so we need learning for AI. What is learning? What is machine learning? It's not about learning things by heart. That's just a fact. What it is about is generalizing from the examples you've seen to new examples. And what I like to tell my students is it's taking probability mass-- that is, on the training examples and somehow guessing where it should go-- which new configurations of the things we see make sense or are plausible. This is what learning is about. It's guesswork. At first we can measure [INAUDIBLE] we can guess. And I'll mention something about dimensionality and geometry that comes up when we think about this [INAUDIBLE]. And one of the messages will be that we can maybe fight this [? dimensionality ?] problem by allowing the machine to discover underlying causes-- the underlying factors that explain the data. And this is a little bit like [INAUDIBLE] is about. So let's start from learning, an easy [INAUDIBLE] of learning. Let's say we observe x,y pairs where x is a number-- y is a number. And the stars here represent the examples we've seen of x,y configurations. So we want to [? generalize ?] for new configurations. In other words, for example, in this problem, typically we want to predict a y given a new x. And there's an underlying relationship between y and x, meaning the expected value of the y given x, which is given with this purple curve. But we don't know it. That's the problem with machine learning. We're trying to discover something we don't know already. And we can guess some function. This is the predicted or learned function. So how could we go about this? One of the most basic principles by which machine learning algorithms are able to do this is assume something very simple about the world around us-- about the data we're getting or the function we're trying to discover. It's just assuming that the function we're trying to discover is smooth, meaning if I know the value of the function that's come from the x, and I want to know the value at some nearby point x prime, then it's reasonable to assume that the value x prime of the function we want to learn is close to the value of x. That's it. I mean, you can formalize that and [INAUDIBLE] in many different ways and exploit it in many ways. And what it means here is if I ask you why should we at this point-- what I'm going to do is look up the value of y that I observed at nearby points. And combining these-- make a reasonable guess like this one. And if I do that on problems like this, it's actually going to work quite well. And a large fraction of the applications that we're sharing use this principle. And [INAUDIBLE] enough of just this principle. But if we only rely on this principle virtualization, we're going to be in trouble. That's one of the messages I want to explain here. So why are we going to be in trouble? Well, basically we're doing some kind of interpolation. So if I see enough examples-- the green stars here-- to cover the ups and down of the function I'm trying to learn, then I'm going to be fine. But what if the function I want to learn has many more ups and downs than I can possibly observe through data? Because even Google has a finite number of examples. Even if you have millions or billions of examples, the functions we want to learn for AI are not like this one. They have-- the number of configurations of articles of interest-- that may be exponentially large. So something maybe bigger than the number of items in the universe. So there's no way we're going to have enough examples to cover all the configurations. For example, think of the number of different English sentences, which is something that Google is interested in. And this problem is illustrated by the so-called curse of dimensionality where you consider what happens when you have not just one variable but many variables and all of their configurations. How many configurations of [? N ?] variables do you have? Well, you have an exponential number of configurations. So if I wanted to learn about a single variable, I can just divide-- [? it ?] [? takes ?] a real variable. And I divide its value into intervals. And I count how many of those bins I've seen in my data. I can estimate probability of different intervals coming up. So that's easy Because i only want to know about a small number of different configurations. But if I'm looking at two variables, then the number of configurations may be [INAUDIBLE] [? square ?] [? bigger, ?] and [? it'd ?] have [? 390-- ?] even more. But typically, I'm going to have hundreds-- if you're thinking about images, it's thousands-- tens of thousands-- hundreds of thousands. So it's crazy how many configurations there are. So how do we possibly generalize to new configurations? We cannot just break up this space into small cells and count how many things happen in each cell because the new examples that we want to [? carry-- ?] new configurations that [INAUDIBLE] asked about might be in some region where we hadn't [INAUDIBLE]. So that's the problem of generalizing [INAUDIBLE]. So there's one thing that can help us, but it's not going to be sufficient. It's something that happens with the iPhones. It's very often [INAUDIBLE] vision, [INAUDIBLE] processing and understanding and many other problems where the set of configurations of variables that are plausible-- that can happen in the real world-- occupy a very small volume of all this set of possible configurations. So let me give an example. In images, if I choose the pixels in an image randomly-- in other words, if I sample an image from completely uniform distribution, I'm going to get things like this. Just [INAUDIBLE]. And I can repeat this for eons and eons. And I'm never going to assemble something that looks like a face. So what it means is that faces-- images of faces-- are very rare in the space of images. They occupy a very small volume, much less than what this picture would suggest. And so this is a very important hint. It means that actually the task is to find out where this distribution concentrates. I have another example here. If you take the image of a four like this one and you do some geometry transformations to it like rotating it, scaling it, you get slightly different images. And if at each point, you allow yourself to make any of these transformations, you can create a so-called manifold-- so a surface of possible images. Each point here corresponds to a different image. And the number of different changes that you make is basically the dimensionality of this manifold. So in this case, even though the data lives in the high dimension space, the actual variations we care about are of low dimensionality. And knowing that, we can maybe do better in terms of learning. One thing about curves of dimensionality is I don't like the name curves of dimensionality because it's not really dimensionality. You can have many dimensions but have a very simple function. What really matters is how many variations does the function have-- how many ups and downs? So we actually had some fairly [? cool ?] results about-- the number of examples you would need if you were only relying on this [INAUDIBLE] assumption, essentially is linear-- the number of ups and downs of the function [INAUDIBLE]. So let's come back to this idea of learning where to put probability [? math. ?] So in machine learning, what we have is data. Each example is a configuration of variables. And we know that this configuration [? occurred ?] in the real world. So we can say the probability for this configuration. So this is the [? space ?] of configuration I'm showing in 2D. So we know that this configuration is plausible. [INAUDIBLE]. So we can just put a [? beacon ?] [INAUDIBLE] here. And we can put a [? beacon ?] at every example. The question is how do we take this probability mass and sort of give a little bit of that to other places. In particular, we'd like to put mass in between if there really was a manifold that has some structure and if we could discover that structure, it would be great. So the classical machine learning way of doing things is say that the distribution function-- the function that you're trying to [? learn ?] in this case is smooth. So if it's very probable here, it must be also probable in the neighborhood. So we can just do some mathematical equation that will shift some mass from here to the [? different ?] neighbors. Then we get a distribution like this as our model. And that works reasonably well. But it's not the right thing to do. It's putting mass in many directions we don't care about. Instead, what we're going to do is to discover that there is something about this data. There is some structure. There is some abstraction that allows us to be very specific about where we're going to put probability mass. And we might discover with something like this, which in 2D doesn't look like a big difference. But in high dimensions, the number of directions you're allowed to move here is very small compared to the number of dimensions here. And the volume goes exponentially with dimension. So you can have a huge [? gain ?] by guessing probably which direction things move-- are allowed to keep high probability. So, now to the core of this presentation which is about representation learning. I've talked about learning in general and some of the issues-- some of the challenges with applying learning to AI. Now, when you look at how machine learning is applied in industry, what people do for 90% of time-- what they do with the effort of engineers is not really improve machine learning. They use existing machine learning. But to make the machine learning [INAUDIBLE] work well, they do [INAUDIBLE] feature engineering. So that means taking the raw data and transforming it-- extracting some features-- deciding what matters-- throwing away