Jeffrey Hinton - one of the creators of the conceptdeep learning, prize winner of the Turing Award 2019 and Google engineer. Last week, during an I / O developer conference, Wired interviewed him and discussed his fascination with the brain and the ability to model a computer based on the neural structure of the brain. For a long time, these ideas were considered foolish. Interesting and fascinating conversation about consciousness, Hinton's future plans and whether computers can be taught to dream.
What will happen to neural networks?
Let's start with the times when you wrotetheir very first, very influential articles. Everyone said: "The idea is clever, but in fact we will not be able to design computers in this way." Explain why you insisted on yourself and why you were so sure you found something important.
It seemed to me that the brain could not work somehowotherwise. He is obliged to work, studying the power of connections. And if you want to make the device do something clever, you have two options: you either program it or it learns. And no one programmed people, so we had to learn. This way should have been right.
Explain what neural networks are. Explain the original presentation.
You take relatively simple machining.elements that very remotely resemble neurons. They have incoming connections, each connection has weight, and this weight can change during the training process. What a neuron does is take actions on connections multiplied by weights, summarizes them and then decides whether to send data. If the sum is typed large enough, it makes an output. If the amount is negative, it does not send anything. That's all. You only need to connect a cloud of such neurons with weights and figure out how to change these weights, and then they will do whatever they want. The only question is how you will change the weight.
When did you realize that this is a rough idea of how the brain works?
Oh, yes, everything was originally intended. Designed to resemble the brain in work.
So, at a certain point in your career, youbegan to understand how the brain works. Maybe you were twelve, maybe twenty-five. When did you decide to try to model computers by brain type?
Yes, right away. That was the whole point. The whole idea was to create a learning device that learns like the brain, according to people's ideas about how the brain learns, by changing the strength of the connections. And it was not my idea; Turing had the same idea. Although Turing invented a huge part of the basics of standard informatics, he believed that the brain was an unorganized device with random weights and used reinforcement training to change connections, so he could learn anything. And he believed that this was the best path to intelligence.
And you followed the idea of Turing that the best way to create a car - to design it like a human brain. So, they say, the human brain works, so let's create a similar machine.
Yes, not only Turing thought so. Many thought so.
When did the dark times come? When it happened that other people who worked on it and thought that Turing’s idea was right, began to retreat, and you continued to bend your line?
There have always been a handful of people who believedin spite of everything, especially in the field of psychology. But among computer scientists, I believe, in the 90s, it turned out that the data sets were small enough, and computers were not so fast. And with small data sets, other methods, in particular, the support vector machine, worked a little better. They were not so embarrassed by the noise. So all this was sad, because in the 80s we developed a back propagation method [back propagation, an error propagation back method that is very important for neural networks]. We thought he would solve everything. And they were puzzled that he decided nothing. The question was really in scale, but then we did not know.
Why did you think it was not working?
We thought it was not working because we hadthere were not quite correct algorithms and not quite correct target functions. I thought for a long time that this was because we tried to conduct training under observation, when you label data, and we had to engage in training without observation, when training takes place according to untagged data. It turned out that the question was for the most part on the scale.
It is interesting. It turns out that the problem was that you had insufficient data. You thought you had the right amount of data, but you flagged it incorrectly. It turns out, you just incorrectly identified the problem?
I thought the mistake was that we wereuse tags. Most of your learning happens without using any tags; you are just trying to model the structure in the data. I actually still think so. I think that since computers become faster, if the computer is fast enough, it is better to conduct training without observation for any data set of a given size. And as soon as you complete the study without observation, you will be able to study with a smaller number of marks.
So, in the 1990s, you continue your research,you are in academic circles, you are still publishing, but you are not solving big problems. Have you ever had a moment when you said: “You know, that's enough of it. Trying to do something else? ” Or did you simply tell yourself that you will continue to engage in deep learning [that is, the concept deep learning, deep learning neural networks; read more here.]
Yes. Something like this should work. I mean, the compounds in the brain are somehow learning, we just need to figure out exactly how. And, perhaps, there are many different ways to strengthen connections in the learning process; the brain uses one of them. There may be other ways. But you definitely need something that can strengthen these compounds during training. Never doubted it.
You never doubted it. When did it feel like it works?
One of the biggest disappointments of the 80s wasthat if we made networks with many hidden layers, we could not train them. This is not entirely true, because you can train relatively simple processes like handwriting. But we did not know how to train the majority of deep neural networks. And somewhere in 2005, I came up with a way to train deep networks without observation. You enter data, say, pixels, and you train several detail detectors that simply explained well why the pixels were as they were. Then you feed the data to these part detectors and train another set of part detectors, so that we can explain why specific part detectors have specific correlations. You continue to train layer by layer. But the most interesting thing was that you could decompose mathematically and prove that every time you teach a new layer, you do not necessarily improve the data model, but you will have to deal with the range of how good your model is. And this range got better with each layer added.
What do you mean by the range of how good your model is?
Once you got the model, you could askquestion: "How unusual is this model finding this data?" You show her the data and ask the question: "Do you find all this as you intended, or is it unusual?" And this could be measured. And I wanted to get a model, a good model that looks at the data and says: “Yes, yes. I knew it. No wonder". It is always very difficult to accurately calculate how unusual the model will find the data. But you can calculate the range of this. It can be said that the model will find this data less unusual than this. And it could be shown that as new layers are added to the detail detectors, a model is formed, and with each added layer, when it finds data, the range of understanding how unusual it finds data is better.
It turns out, approximately in 2005 you carried outthis math breakthrough. When did you start getting the right answers? What data did you work with? You got the first breakthrough with speech data, right?
These were just handwritten numbers. Very simple. And at about the same time, the development of GPU (graphics processors) began. And the people who were involved in neural networks began using the GPU in 2007. I had a very good student who started using GPUs to search for roads on aerial photos. He wrote the code, which was then adopted by other students who use the GPU to recognize phonemes in speech. They used this idea of prior learning. And when the pre-training was completed, they just put the tags on top and used the reverse spread. It turned out that you can create a very deep network that was previously trained in this way. And then back propagation could be applied, and it actually worked. In speech recognition this worked perfectly. At first, however, it was not much better.
Was it better than commercially available speech recognition? Bypassed the best scientific work on speech recognition?
In a relatively small data set called TIMIT, it was slightly better than the best academic work. IBM has also done a lot of work.
Very quickly, people realized that all this - sinceit bypasses the standard models that have been developed for 30 years - it will work fine if you develop a little bit. My graduates got into Microsoft, IBM, and Google, and Google very quickly created a working speech recognizer. By 2012, this work, which was done in 2009, got on Android. Android suddenly became much better at recognizing speech.
Tell me about the moment when you, who has been keeping these ideas for 40 years, published on this subject for 20 years, suddenly bypass your colleagues. What does this feeling look like?
Well, at that time I kept these ideas for only 30 years!
There was a wonderful feeling that all this had finally turned into a real problem.
Do you remember when you first received data indicating this?
Okay. So, you understand that it works with speech recognition. When did you start applying neural networks to other problems?
At first we started to apply them to all sorts ofother problems. George Dahl, with whom we initially worked on speech recognition, applied them to predict whether a molecule can connect with something and become a good medicine. And there was a competition. He simply applied our standard speech recognition technology to predict drug activity and won this competition. It was a sign that we are doing something very universal. Then a student appeared who said, “You know, Jeff, this thing will work with image recognition, and Fey-Fey Lee created a suitable data set for this. There is a public competition, let's do something. ”
We got results that were far superior to standard computer vision. It was 2012.
That is, in these three areas you have succeeded: modeling chemicals, speech, voice. Where did you fail?
Do you understand that failures are temporary?
Well, what sets the area where it all works?fastest, and areas where more time is needed? It seems that visual processing, speech recognition and something like basic human things that we do with sensory perception are considered the first barriers to be overcome, right?
And yes and no, because there are other thingswhich we do well is the same motility. We are very good at controlling motor skills. Our brains are definitely adapted for this. And only now neural networks begin to compete with the best other technologies for it. They will win in the end, but now they are just starting to win.
I think thinking, abstract thinking - the last things we learn. I think they will be among the last things that these neural networks learn to do.
And so you keep saying that neural networks will eventually win everywhere.
Well, we are neural networks. Everything that we can, they can too.
True, but the human brain is far from the most efficient computer ever created.
Definitely not my human brain! Is there a way to model machines that will be much more efficient than the human brain?
From the point of view of philosophy, I have no objectionsagainst the idea that there could be some completely different way to do it all. Maybe if you start with logic, try to automate logic, come up with some bizarre proof of theorems, argue, and then decide that you come to visual perception through reasoning, it may be that this approach will win. But not yet. I have no philosophical objections to such a victory. We just know that the brain is capable of it.
But there are things that our brain is not capable of doing well. Does this mean that neural networks will also not be able to do them well?
It is possible, yes.
And there is a separate problem, which is that we do not quite understand how neural networks work, right?
Yes, we really do not understand how they work.
We do not understand how neural networks work withdownward approach. This is the main element of the work of neural networks, which we do not understand. Explain this, and then let me ask the following question: if we know how it all works, how does it all work then?
If you look at modern systemscomputer vision, most of them are mainly directed forward; they do not use feedback connections. And there is one more thing in modern computer vision systems that are very susceptible to competitive errors. You can slightly change a few pixels, and what was a panda image and still looks exactly like a panda for you will suddenly become an ostrich in understanding the neural network. Obviously, the way of replacing pixels is designed to deceive the neural network, forcing her to think about the ostrich. But the fact is that for you it is still a panda.
Initially, we thought it all worked.perfectly. But then, faced with the fact that they are looking at a panda and are confident that it is an ostrich, we are worried. And I think that part of the problem is that they are not trying to reconstruct from high-level representations. They are trying to learn in isolation, when only the layers of detail detectors are being trained, and the whole purpose is to change the weights so that they become better looking for the right answer. We recently discovered in Toronto, or Nick Frost discovered that if you add reconstruction, resistance to adversarial error will increase. I think that in human vision for reconstruction is used. And since we learn a lot by doing a reconstruction, we are much more resistant to competitive attacks.
You think that downward communication in the neural network allows you to check how something is reconstructed. You check it and make sure that it is a panda, not an ostrich.
I think this is important, yes.
But scientists who study the brain, do not quite agree?
Brain scientists do not argue that if you have twoareas of the cortex in the path of perception, there will always be reverse connections. They argue with what it is for. It may be necessary for attention, for training or for reconstruction. Or for all three.
And therefore we do not know what feedback is. You build your new neural networks, starting from the assumption that ... no, not even that - you build feedback, because it is needed for reconstruction in your neural networks, although you don’t even understand how the brain works?
Isn't that a trick? Well, that is, if you are trying to do something like a brain, but not sure what the brain is doing?
Not really. I do not do computational neuroscience. I'm not trying to create a model of the brain. I look at the brain and say: "It works, and if we want to do something else that works, we must look and be inspired by it." We are inspired by neurons, rather than building a neural model. Thus, the whole model, the neurons we use, are inspired by the fact that neurons have many connections and that they change weights.
It is interesting. If I were a computer scientist, and worked on neural networks, and wanted to bypass Jeff Hinton, one of the options would be to build a downlink communication and base it on other models of brain sciences. Basing on training, not on reconstruction.
If there were better models, you would have won. Yes.
It is very, very interesting. Let's touch on a more general topic. So, neural networks can solve all possible problems. Are there any puzzles of the human brain that cannot or will not cover neural networks? For example, emotions.
So love can be reconstructed by a neural network? Consciousness can be reconstructed?
Absolutely. Once you figure out what these things mean. We are neural networks, right? Consciousness is a particularly interesting topic for me. But ... people do not really know what they mean by this word. There are lots of different definitions. And I think this is a rather scientific term. Therefore, if 100 years ago you asked people: what is life? They would answer: “Well, living things have life force, and when they die, life force leaves them. This is the difference between the living and the dead, either you have life force or not. ” Now we have no vitality, we think that this concept appeared before science. And as soon as you start a little understanding of biochemistry and molecular biology, you no longer need the vitality, you will understand how it all really works. And the same thing, I think, will happen with consciousness. I think that consciousness is an attempt to explain mental phenomena with the use of an entity. And this entity, it is not needed. As soon as you can explain it, you can explain how we do everything that makes people conscious beings, explain the different meanings of consciousness, without attracting any particular entities.
It turns out that there are no emotions that cannotwould create? No thought to create? There is nothing that the human mind can do, that theoretically it would be impossible to recreate a fully functioning neural network, once we actually understand how the brain works?
John Lennon sang something similar in one of his songs.
Are you 100% sure of this?
No, I'm a Bayesian, so I'm 99.9% sure.
Well, what then is this 0.01%?
Well, we could, for example, all be part of a big simulation.
Fair. So, what do we learn about the brain from our work on computers?
Well, I think from what we have learned forthe last 10 years, it is interesting that if you take a system with billions of parameters and a target function — for example, to fill a gap in a line of words — it will work better than it should. It will work much better than you might expect. You might think, and many people in the field of traditional research on AI would think that you can take a system with a billion parameters, run it at random values, measure the gradient of the objective function and then correct it to improve the objective function. You might think that a hopeless algorithm will inevitably get stuck. But no, it turns out, this is a really good algorithm. And the larger the scale, the better it works. And this discovery was essentially empirical. There was some theory behind all this, of course, but the discovery was empirical. And now, since we discovered this, it seems more likely that the brain calculates the gradient of a certain objective function and updates the weights and strength of the connection of synapses in order to keep up with this gradient. We only need to find out what this objective function is and how it deteriorates.
But we did not understand it on the example of the brain? Did not understand the update weights?
It was a theory. Long ago, people thought it was possible. But in the background there were always some computer scientists who said: “Yes, but the idea that everything is random and learning occurs due to the gradient descent will not work with a billion parameters, you will have to connect a lot of knowledge”. Now we know that it is not. You can simply enter random parameters and learn everything.
Let's dive a little. As we learn more and more, we are expected to continue to learn more and more about how the human brain works, as we will conduct massive tests of models based on our ideas about brain functions. As soon as we understand this better, will there be a moment when we essentially rebuild our brains to become much more efficient machines?
If we really understand what is happening, wewe can improve some things like education. And I think we will improve. It would be very strange to finally understand what is happening in your brain, how it learns, and not to adapt in such a way as to study better.
How do you think, how in a couple of years we use what we learned about the brain and about the work of deep learning, to change education? How would you change classes?
I'm not sure that in a couple of years we will learn a lot. I think that changing education will take longer. But if you talk about it, [digital] helpers are getting pretty smart. And when helpers can understand conversations, they can talk to the children and teach them.
And theoretically, if we understand the brain better, we will be able to program helpers so that they can better talk with the children, starting from what they have already learned.
Yes, but I did not think much about it. I do another. But it all seems quite similar to the truth.
Can we understand how dreams work?
Yes, I am very interested in dreams. I am so interested that I have at least four different theories of dreams.
Tell about them - about the first, second, third, fourth.
Once upon a time there were such things called networkHopfield, and they studied memories as local attractors. Hopfield discovered that if you try to put too many memories, they will be confused. They will take two local attractors and combine them into one attractor somewhere halfway between them.
Then came Francis Creek and Graham Mitchison andthey said that we can get rid of these false lows by unlearning (that is, forgetting what we learned). We disable data entry, transfer the neural network to a random state, allow it to calm down, say it is bad, change the connections so as not to fall into this state, and thus you can make the network store more memories.
Then we came with Terry Sejnovski and said: “Listen, if we have not only neurons that store memories, but also a bunch of other neurons, can we find an algorithm that will use all of these other neurons to help restore memories?” As a result, we created the Boltzmann machine learning algorithm. And the Boltzmann machine learning algorithm had an extremely interesting property: I show the data, and he sort of goes through the rest of the units until he gets into a very happy state, and after that he increases the strength of all the connections, based on the fact that two units are active at the same time.
You should also have a phase in which youturn off input, allow the algorithm to "turn around" and translate it into a state in which it is satisfied, so that it fantasizes, and as soon as he has a fantasy, you say: "Take all the pairs of neurons that are active and reduce the strength of the connections."
I explain to you the algorithm as a procedure. But in reality, this algorithm is a product of mathematics and the question: “How should these chains of connections be changed, so that this neural network with all these hidden units would not find any data?”. And there must be another phase, which we call the negative phase, when the network works without data entry and learns, no matter what state you put it in.
We see dreams for many hours every night. And if you suddenly wake up, you can say that you just dreamed of, because the dream is stored in a short-term memory. We know that we see dreams for many hours, but in the morning, after awakening, we can only remember the last dream, and others do not remember that it was very successful, because it would be mistaken to take them as reality. So why don't we remember our dreams at all? According to Crick, this is the meaning of dreams: to unlearn these things. You kind of learn the opposite.
Terry Seinowski and I showed that this is actually a learning procedure with the highest probability for Boltzmann machines. This is the first theory of dreams.
I want to switch to your other theories. But ask the question: did you manage to train any of your deep learning algorithms to actually dream?
Some of the first algorithms that couldTo learn how to work with hidden units, there were Boltzmann machines. They were extremely ineffective. But later I found a way to work with approximations, which turned out to be effective. And this actually triggered the resumption of work with deep learning. These were things that taught one layer of function detectors at a time. And it was the effective form of the Boltzmann restrictive machine. And so she was engaged in this kind of reverse training. But instead of going to sleep, she could just fantasize a little after each tag with the data.
Well, it means the androids actually dream of electroshots. Let's move on to the theories two, three and four.
Theory Two was named Wake Sleep Algorithm[wake-up-sleep algorithm]. You need to train a generative model. And you have an idea to create a model that can generate data, has layers of feature detectors and activates the higher and lower layers, and so on, up to the activation of the pixels - creating the image, in fact. But you would like to teach her another. You would like her to recognize the data.
And so you have to do an algorithm with twophases. In the wake-up phase, the data comes in, he tries to recognize them, and instead of studying the connections he uses for recognition, he studies the generative connections. Data arrives, I activate hidden units. And then I try to teach these hidden units to recover this data. He learns to reconstruct in each layer. But the question is how to study direct connections? Therefore, the idea is that if you knew direct connections, you could learn reverse connections, because you could learn how to reconstruct.
Now it also turns out that if you usereverse connections, you can learn and direct connections, because you can just start from the top and generate a bit of data. And since you are generating data, you know the states of all the hidden layers and you can study direct connections to restore these states. And that's what happens: if you start with random connections and try to alternately use both phases, everything will turn out. To work well, you will have to try different options, but it will work.
Well, what about the other two theories? We have only eight minutes left, I think, I will not have time to ask about everything.
Give me another hour, and I'll tell you about the other two.
Let's talk about what comes next. Where are your studies going? What problems are you trying to solve now?
In the end, have to work on somethingwhat work is not finished yet. I think I can work on something that I’ll never finish, called “capsules,” a theory about how visual perception is performed using reconstruction and how information is sent to the right places. The two main motivating factors were that in standard neural networks, information, activity in the layer just automatically goes somewhere, and you do not make decisions about where to send it. The idea of the capsules was to make decisions about where to send the information.
Now, when I started working on capsules, very smart people from Google invented transformers that do the same. They decide where to send the information, and this is a big victory.
Next year we will be back to talk about theories of dreams number three and number four.
Register in our chat in Telegram not to miss.
The article uses illustrations by Maria Menshikova.