Yann LeCun, the chief AI scientist at Facebook, helped develop the deep learning algorithms that power many artificial intelligence systems today. In conversation with head of TED Chris Anderson, LeCun discusses his current research into self-supervised machine learning, how he's trying to build machines that learn with common sense (like humans) and his hopes for the next conceptual breakthrough in AI.

So, you've dedicated your life to this extraordinary project of trying to organize electrons into patterns in which they can think useful thoughts.

I mean, if your younger self could have a quick visit and an update from you today, how surprised would he be at what had actually happened?

So, I think I'd be, You know, when you're young, you're extremely enthusiastic, and you don't realize how complex things are, right?

And so you're a little naive about how easy it would be to build intelligent machines or discover new principles.

And in a way, that's why, you know, when you're young, you're a little more creative, right?

Because you're not scared by the complexity of what you're imagining.

And so very often when you're young, you know, you think you have an idea that, you know, is very different from what everybody else is doing, and then you push it, and then you realize someone else thought about it 20 years ago and couldn't make it work.

Or you realize maybe people thought about it and couldn't make it work, but maybe it's still a good idea and you should push, and that's basically what happened to me.

Maybe if you'd seen how hard it actually was, you'd have thought, I don't know, dentistry seems pretty appealing.

A couple years ago, you won this ACM Turing Prize with Geoffrey Hinton and Yoshua Bengiel.

That's the most prestigious prize in computer science.

It was awarded to us for basically having promoted and also developed some of the early algorithms for what we now call deep learning.

And so deep learning is this idea that you can train a machine end-to-end to do a particular task.

So machine learning has been around for a long time, but sort of deep learning is sort of an extreme form of it if you want.

And it's based on the idea that a learning machine can be based on a large network, can be built as a large network of very simple elements, which are somewhat analogous to the neurons in the brain, but not really kind of the same thing.

It's like they're as analogous to the neurons in the brain as the wings of an airplane are to the wings of a bird.

The roots of it go back to the 1940s, and there was kind of a wave of interest for this in the 50s and 60s, and it died off.

And then it came back to the fore in the mid 80s, late 80s.

And Jeff had been working on it for a while.

That's when I started my career and Joshua as well.

And there was a wave of interest that kind of died off again in the mid-90s.

But the 3 of us knew this was a good set of techniques and 1 day the community will kind of get interested in it again.

And so we basically started a conspiracy, you can think of it this way, or at least a deliberate attempt, demonstrating that those methods worked well and that succeeded beyond a wildest dream.

So it sort of started a new wave of interest in those methods and basically started a new industry.

So that almost, as a sort of first order of approximation, almost like a history of AI, is that it was interesting for a long time, it kind of didn't get very far, and then suddenly, about 10 years ago, certainly from a commercial point of view, it suddenly exploded, and the reason it exploded was because of deep learning, which you helped create.

I mean, you're credited with creating these things called convolutional neural networks.

Right, So a convolutional neural network is a particular what we call architecture of a neural net.

So a neural net, as I said, is a network of simple elements, neuron-like.

And a convolutional net is a particular way of connecting those neurons with each other so that the architecture is particularly well-suited to deal with input data that comes to the system in the form of an array of numbers.

So an image is basically an array of numbers.

Speech signal can be represented as an array of numbers as well.

And so there's a lot of signals, natural signals, and less natural ones that you can represent this way.

And convolutional nets are well suited to this kind of signals.

Now, their architecture is inspired by what we know about the architecture of the visual cortex.

So there's a lot of inspiration from neuroscience in this.

And it's called convolutional neural nets because it's based on a mathematical operation called convolution.

And it's 1 of those multi-layer neural net that automatically learns representations of, say, images that are hierarchical.

So these sort of representations are more and more abstract as you go up in the layers.

Right, so that hierarchy is a fundamental part of the structure.

At 1 point you're seeing a colored pixel and then you see that that's part of a little shape, and then that shape is part of a more complex object.

How does that relate to this idea of backpropagation, which seems to be fundamental to getting deep learning to work?

Yeah, so really, 1 of the things that Jeff, and to some extent I, are famous for is backpropagation.

So I have to explain how machine learning works.

So there is a very simple form of machine learning called supervised learning.

And the way it works is that you want to train a machine to distinguish images of cars from images of airplanes.

You run through the neural net, wait for the output to come out.

If the output is car, you basically don't do anything.

If the output is not car, then you adjust parameters inside of the neural net so that the output gets closer to the output you want, to car, right?

And those parameters are the strength of the connections between the neurons.

So each connection between every neuron, there could be billions of them, has sort of a strength, adjustable positive or negative number that you can adjust.

model where the AI is being trained on human-tagged images.

So here is an image, this is a car, this is not a car.

And so the net is going, OK, I was successful, this is a car.

And when it's successful, it strengthens the connections that took it to that point?

But it figures out good configurations of all the connections that will produce the correct answer whenever you show it 1 of the training samples it's been trained on.

The magic of this is, what is it going to do when you show it an image it's never seen before?

Another image of a car it's never seen, for example, Is it going to produce car?

That's called the generalization ability.

So here is a problem that back propagation solves.

What you have to figure out is in which direction and by how much you have to change a particular weight among the billions that there are in the network so that the output gets closer to the 1 you want.

And you can do this by basically computing the equivalent of a derivative.

So the function that the network computes produces a number on the output or a series of numbers.

And you have to figure out, you know, if I change this weight a little bit this direction, is this number going to go up or down?

Now, you have 1 of those derivatives for every single 1 of the weights in the network, so you might have a giant list of a billion derivatives, and that's called a gradient.

Backpropagation is a way of computing this gradient very efficiently by essentially propagating signals backwards inside of the neural net.

Is there an analog to how brains work there?

Because we've heard from neuroscientists that you can think of the brain as a prediction machine where information flows up this hierarchy, but for the brain to interpret it, it's constantly back-propagating down the hierarchy to a certain set of expectations.

Is there something similar going on in this deep learning process?

Okay, so there's a terrible confession that all neuroscientists have to make, which is that we actually have no idea what learning algorithm the brain uses.

I mean, we have some idea, of course, of we can see the effect of it.

We know that learning affects the efficacy of the synapses, which are the connections between neurons.

And there are certain rules that we know those kind of synaptic modification obeys.

So for example, there's something called spike timing dependent plasticity.

And it means that if a neuron that connects to another 1, it's very often active.

And then the second neuron also becomes active as a consequence of the first 1.

But that's probably just a side effect of something much more complicated that we don't understand.

And so in my mind, the question is, is the brain doing something similar to a learning machine, which means optimizing some sort of error between the output it produces and the output it wants to produce?

And then the question is, where is this output that it wants to produce?

And then if it does do that, does it do it by what we call a gradient-based algorithm, which means a method that will evaluate those gradient derivatives that I was talking about earlier.

And what is pretty clear is that if the brain does something like this, it's not using straight back propagation.

It's using something a little different, which may have the same effect in the end, but it's not entirely clear what it is.

And so there's a lot of hypotheses about this, but no real answer.

So often, the picture people have is that we're trying to reverse engineer the brain, but actually, that may be a harder job than just to try out different forms of building AI through the human way or the technical way, rather than trying to copy what is still deeply mysterious?

Yeah, I think the sort of bird analogy is very good there, because you can try to build airplanes by copying the birds, but then the birds have a lot of details to them, like feathers and muscles and things like this that may be irrelevant to building airplanes, in fact, are irrelevant to building airplanes.

And they have the advantage of having very fast control with brain and vision and stuff like that, which you cannot actually reproduce with machines, certainly not 19th century technology.

It'd be pretty awesome to sit in the plane and look out the window and see the wings flapping like this, that would be kind of incredible.

It's incredible, and you know, birds explore all kinds of properties of aerodynamics.

And so the underlying mechanism of flight, whether it's the flight of airplanes or birds, is aerodynamics.

And so the question is, we need to understand aerodynamics to build airplanes.

Even though they were inspired by Bird initially, they're very different.

But the underlying principle is the same.

You generate lift by pushing yourself through the air.

What are the underlying principles behind intelligence and learning?

What is the equivalent of aerodynamics, if you want, for intelligence and learning?

I mean, that's the quest of my entire life, OK?

CAWTHONS So you described a form of machine learning, supervised machine learning, which is, I guess, the form that's most understood by the lay public of these sort of masses of labeled images or videos that a computer gradually gets to recognize.

But the real power of machine learning can go far beyond that to something that I guess you call self-supervised learning.

OK, well, There are really 3 forms of learning, or paradigms of learning, I should say, that people use today.

1 is supervised learning, which we just talked about.

In supervised learning, you tell the machine what the correct answer is, right?

And it adjusts itself to get closer to that.

In reinforcement learning, you don't tell the machine what the correct answer is.

You only tell it whether the answer it produced was good or bad.

And so if there is lots and lots of possible answers, it's much less efficient.

The system has to try many things before it figures out how to produce the right answer.

So if you want to train a machine to play Go, play chess, play a game, something like this, you have several copies of this machine play against itself.

And by this reinforcement system, it basically improves itself.

And it doesn't need to be fed the correct answer by humans.

So it's successful in a number of those situations.

There's very few applications in real life where it's useful because it's so inefficient in learning.

And then the third form is self-supervised learning.

People in the past used to call this unsupervised learning.

And I don't like that name because it's a loaded term and it doesn't really reflect what's going on.

So the idea of self-supervised learning is the type of learning that we think we're observing in young animals and baby humans.

It's the ability to learn how the world works by observation.

Babies have very little ability to interact with the world.

They can't act on it, But they observe all the time.

And just by observation, in just a space of a few months, they can distinguish between animate and inanimate object.

They figure that if there is an object that's hidden by another 1, it still exists.

And they figure that if an object has certain shape, it's not going to stay stable.

If you put it on a table, it's going to fall.

And so that's how you can tell whether a baby has learned a particular concept when this concept is violated by reality.

And those models of the world allow us to represent the world.

And that allows us to now learn any particular task very quickly.

Because by representing the visual world, for example, we have a good idea of what objects are, that objects are in front of backgrounds, that the world is three-dimensional, that there are objects that move, and there are animals with 4 legs and blah, blah, blah.

And so now I show you An example of an elephant, and you know what an elephant is.

I don't need to show you a million examples of elephants.

CA So in terms of understanding the power of that, is this an example?

So 1 obvious application for AI is self-driving cars.

On the supervised learning model, you know, a car is driving, you're looking at lots of videos of, I don't know, children near a road or tree branches waving near a road.

You know, just at the level of pixels, those may not be fundamentally different.

The car can't really decide how to alter its behavior compared to that.

But if you start to know, you know, branch, doesn't matter if you hit it.

And so the learning to actually learn what is an object and what category and that is just a that feels like a hugely steep curve to climb but that's what you've been working on?

Yes, you know learning the concept of object and geometry, learning to represent the world, learning that certain objects behave in certain ways.

So on the street, a car and a pedestrian will not behave the same way.

And you can tell there's a car at a stop sign, and you're coming up to the car, and you know that the person is not looking at your car, you know that maybe there is danger.

You're kind of slowing down, because you can tell.

Your model of the world basically sort of allows you to do things, to learn things quickly and to do it safely.

So if the current models of reinforcement learning, for example, that we could use to train cars to drive themselves are so inefficient that you would have to have a car drive itself for millions of hours, and even then it may not be very reliable.

So the idea of self-supervised learning is that you don't want the system to start from 0.

You want it to learn as much as possible about how the world works beforehand.

If you don't have a model of the world and you're learning to drive, and you're driving a car right next to a cliff, you have no idea that by turning the wheel to the right, the car is going to run off the cliff and you're going to get killed.

Whereas if you do have a model of the world, and you don't need to have a very sophisticated 1, You know that's going to happen, so you don't even try it.

So how much progress are we making on that?

To what extent are there computer programs now that have some kind of compelling model of the world that they can operate and navigate through?

Okay, there's 2 reasons for self-supervised learning.

1 is learning models of the world that are predictive, so you can use them for planning in self-driving cars, robotics, et cetera, basically to allow a machine to predict in advance the consequences of its actions and so they can plan to reach a particular goal.

But the other 1 is just learning to represent the world.

Basically, learning as much as you can about how the world works so that this knowledge can be used for learning a particular task subsequently.

And those are kind of 2 very similar things that self-supervised learning might help us solve.

There's been a lot of success in the second 1, so learning representations.

And There's been a lot of success in the second 1, so learning representations.

And there's been a lot of success in prediction, but only for text, basically, not for things like video or images or the real world.

There's been some progress, but It's not there yet.

You're talking about here basically for written language translation, for example.

So the way the best natural language understanding systems are trained today, natural language understanding, translation, anything that deals with text, the best systems today are trained in self-supervised mode or at least pre-trained in self-supervised mode.

And the way they're trained is that you take a segment of a sequence of words from a text, from a large text corpus, and You remove some of the words, or you substitute them by others.

And then you train a very large neural net, which may have billions of parameters, to predict the missing words or the words that have been changed.

And just by doing this, the system has to learn a good representation of words that will allow it to actually solve that problem.

So it needs to learn that when a sentence talks about a pet chasing another animal, It can be a cat chasing a mouse, or it can be something like...

Or it could be a lion chasing an antelope in the savannah, if it's not in a house, depending on the context.

And so, you know, it learns basically the structure of the world just by, you know, figuring out how to fill in the blanks in text.

And the amazing thing is that those systems seem to acquire a surprisingly large amount of knowledge about how the world works without actually having any connection with any reality other than text representations of it.

Let me play that back and see if I've got that right.

So someone might think that to teach a computer how to understand language, you have to focus on the rules of grammar and all the rest of it.

What you're saying is, no, you take a bunch of text, you just delete words from it, what, pretty much at random, almost, and let a computer loose at it and basically just say, all you have to do is to say, you know, try a bunch of things, have you been successful in identifying the missing words?

So a computer can run millions of different attempts, algorithmic attempts, at identifying the words.

But at the end of it, when it starts to find the missing words, Amazingly, along with that comes a kind of an understanding of language?

Yes, so you collect an enormous amount of text data, you know, probably text with billions of words.

You take a segment of about a thousand words, you remove maybe 20 percent of them, maybe a couple hundred words.

And then you run this segment through this giant neural net that may have billions of weights in it.

And it's got some sort of memory built in.

Those are called transformer architectures.

They appeared about 2 and 1 1 half years ago.

And then you train the system to predict the missing words, right, or the substituted words.

Now, the system cannot do a perfect job at it.

It can only predict, you know, I think the word that should be here is probably a pet of some kind.

I don't know if it's a cat or a dog, but it's something like that, right?

So it gives you a big list of numbers, which are basically scores for each possible words in the dictionary.

So a long list of 100, 000 numbers, basically, tell you for each word in the dictionary, here is how likely it is that the word appears here.

And then what happens is the system learns to represent text by doing this.

And then what you can do is when you're faced with a real problem like understanding text or wanting to translate it, You give a sequence of words to the system.

You run through a subset of the layers of that network.

You cut at some point pretty close to the output.

And that's a representation of the meaning of the input sentence.

And then using this representation, you train something to predict whatever it is that you want to predict.

And the beauty of it is that now you can do this multilingually, so you can train those systems to represent language, regardless of which language it is.

And it basically produces a representation of the meaning independently of language.

That is super important and is completely revolutionary.

CAWTHONSEN And that program that results from that, the software that actually results from it.

It's not like a human engineer can go in and look at the code line by line and say, oh, I get what it's doing there.

The code itself is kind of impenetrable to us, right?

Yeah, because this program is actually very simple.

In a lot of those neural nets, like a typical convolutional net, the program could fit in basically a few lines of code.

Basically, a program says, take a bunch of inputs, compute a weighted sum of those inputs where the weights are those parameters that are learned, and then compare this to a threshold.

If it's above a threshold, turn on the output of the neuron.

And you do this, and you have a big loop that does this over millions of neurons, billions of connections, and with a layer successively.

Those transformer models I was talking about for text are a little more complicated, but it's basically the same principle.

The knowledge of the program resides in the value of those weights, but that's not part of the program.

See all Yann LeCun transcripts on Tedtalk

Deep learning, neural networks and the future of AI