Wen/ Xia Yi and QG
Source: Qubits (ID: QbitAI)
We are wrong.
This morning, on the last day of the 2018 Congress, John Hennessy, the new Chairman of the alphabet who had just won the annual Turing Award, stepped onto the stage.
Originally thought this was a routine speech. Just simply talk about responsibility, do science and more. Unexpectedly, former headmaster Stanford Hennessy did not play official. (He is much more interesting than the previous version of the Chairman of the Alphabet)
After listening, we couldn't be calm for a long time.
Hennessy spent a lot of time at the opening, carefully explained the development, status quo and dilemma of computing, and possible future breakthroughs. Sentence dry goods.
Of course, there are also exciting emotions.
For example, the 65-year-old guru said: You may not believe that I built my first computer. It was almost 50 years ago.
Hennessy said that over the past 50 years, he has witnessed the incredible IT industry and launched waves of revolution. The Internet, chips, smartphones, and computers have shown their magic. But there is still one thing that he believes will truly change our lives.
"This is a breakthrough in machine learning and artificial intelligence." ”
"People have invested in this field for 50 years. Finally, finally, we made a breakthrough. In order to achieve this breakthrough, the basic computational power we need is one million times that previously envisaged. But in the end we did it. ”
He said: This is a revolution that will change the world.
Then a more dramatic scene was staged.
I believe many people have already seen Google Duplex. It was the dialogue AI that Google CEO demonstrated on the first day of the Google I/O 2018 conference two days ago.
Ability to call hair salons, restaurant reservations and seats in a real environment. Smooth communication throughout, perfect response to unwitting human operators. Take a look at the video below and experience Google’s AI Black technology.
Google Duplex came out and everyone was blown up. The effect is better. The audience was slow to think about it: Google demonstrated this AI, it is difficult to achieve through the Turing test?
Yes, Alphabet Chairman John Hennessy finally admitted personally today: “In the appointment area, this AI has passed the Turing test. ”
"This is an extraordinary breakthrough. ”
He further added that although this AI did not achieve a breakthrough in all situations, it still pointed out the way forward.
In 1950, Turing published an epoch-making paper proposing that humans could create machines with true intelligence. He also put forward the famous Turing test: If a machine can initiate a dialogue with humans (via telex equipment) without being identified, then this machine has intelligence.
Passing the Turing test means that the machine can think.
△ Turing’s stunning thesis
Great, Call again for Google (and hopefully AI can answer).
In the speech, John Hennessy also asked the audience: There is one thing, and it can still grow at a rapid rate according to Moore's Law. What is your guess?
The answer is: the number of machine learning papers.
After this, the audience laughed. (Thanks to the data contributed by Jeff Dean, Dave Patterson, Cliff Young, etc.)
He also explained the internal structure of the TPU. The following is a detailed description.
Well, today we send this push with full respect and excitement. We think we can call this: "Aphabet Chairman's Speech at the Mountain View Technology Symposium."
Hennessy's speech was dehydrated as follows, and no single copy has yet been published.
Today I want to talk about what the biggest challenge we will face in computing in the next 40 years, but it is also a great opportunity for us to rethink how to build computers.
Now it is popular to talk about the end of Moore's Law, Gordon · Moore himself once said to me: All indexes will end, but sooner or later. It is this natural law that Moore’s Law has encountered.
What does the end of Moore's Law mean?
Let's take a look at the DRAM (dynamic random access memory) situation first. For many years, DRAM has been growing at a rate of 50% per year, which is faster than Moore’s Law.
But later, it entered a period of peace. What happened in the past 7 years? DRAM is a relatively special technology that requires deep trench capacitors and therefore requires a special assembly technique.
So what happened in the processor area? The slowdown in processor development is similar to DRAM. The red line is a prediction of Moore's law, and the blue line is the number of transistors on an ordinary Intel processor. The two lines are only slightly divergent at the beginning, but by 2015 and 2016, the gap will be very big.
We must remember that there is a cost factor here. As production lines are getting more expensive, the cost of the chip is not falling so fast, so the cost per transistor is actually rising. When we consider the architecture, we will see the impact of these issues.
Beyond the deceleration of Moore's Law, there is a bigger problem, the end of Dennard Scaling.
Bob Dennard, an IBM employee who was the inventor of the first transistor DRAM, had predicted for many years that the required energy per square millimeter of silicon would remain constant because voltage levels and current capacity would decrease.
What does this mean? If the energy is kept constant, the number of transistors increases exponentially, and the energy of each transistor is actually decreasing. That is to say, from the perspective of energy consumption, the cost of calculation is getting cheaper and cheaper.
What actually happened to the Dennard scaling law? Let's take a look at the above chart. The red line is the development of the technology based on the standard Moore's law curve, and the blue line represents the change in energy consumption per unit area.
The current processor will slow down the clock frequency and shut down the core, because if it does not do it it will burn. I never thought that I would have seen such a day when the processor decelerated itself to prevent overheating, but now this has happened.
In fact, the Dennard scaling law has ceased in 2007, leading to dramatic changes in the chip industry. Suddenly, the key limiting factor is no longer the number of transistors, but energy consumption. This requires you to completely rethink the architecture and think about how to build machines.
This means that the transistor's inefficiency in the calculation and its architectural inefficiencies are more detrimental than ever.
The equipment that we usually use and take with us is inseparable from the battery. Suddenly, energy becomes a key resource. Is there anything worse than a cell phone? Think again of the coming IoT era, where equipment is always on the machine, and it also counts on relying on energy harvesting technology for a battery for 10 years.
We need more and more equipment to stay on forever. For example, with Google Assistant installed, you may not need its screen to be on, but you need the CPU to be working. Therefore, we need to consider energy efficiency more and more.
To many people's surprise, energy efficiency is also a huge problem in large cloud computing data centers.
This is the typical cost structure of a Google data center. The red one is the cost of energy plus cooling, which is almost the same as the cost spent on the server. Therefore, energy efficiency is a very critical issue, and the end of the Dunaway scaling law means that there is no free lunch, and you can also see its impact.
The figure above shows the changes in processor performance over the past 40 years. In the first few years, we were able to see 22% of the progress each year; after the RISC invention in the mid-1980s, the annual progress rate reached about 50%; then, the end of the Dunaway law of scaling, everyone in the chip community turned to Multi-core. What is the use of multi-core? The hardware designer used it to kick the ball of energy efficiency to the software designer; now we have entered a platform period where the average annual performance growth is only about 3% and it takes 20 years to double. This is the end of general processor performance growth.
Why is this?
It is not feasible to execute a large number of instructions in parallel. For example, in Intel Core i7, the execution results of 25% of instructions are thrown away. However, the execution of these instructions still consumes energy. This is why the uniprocessor performance curve is terminated.
However, multi-core processors also face similar problems. To write a large-scale complex software, there must be a sequence of parts.
Assume that you will use a 64-core processor in the future, but 1% of the code it runs is serial, so its computing speed is equivalent to only a 40-core processor, but you have to use 64 cores. The energy to pay.
We must overcome this obstacle in energy efficiency and we must rethink how we design machines.
What else can we do to make the system more economical?
Is a software-centric approach feasible? Modern scripting languages are very efficient for programmers using them, but they are inefficient in execution.
Hardware-centric approach? I and Patterson referred to it as a "domain-specific architecture." Such architectures are not universal, but they can handle applications in some areas very well.
Based on the challenges mentioned above, let's look at the opportunities.
This form is based on papers by Charles Leiserson and his MIT colleagues. There’s Plenty of Room at the Top” They use matrix multiplication as an example to run this algorithm on an Intel Core processor and optimize it. Rewriting in C, plus parallel loops, and memory optimization all bring speed improvements. In the end, they rewritten the program with Intel AVX instructions, which is more than 60,000 times faster than Python.
Although it is a very simple algorithm, it shows the potential of software optimization.
What about domain-specific architecture (DSA)? What we really need to do is to make breakthroughs in the energy efficiency of hardware. "Specific areas" refers to the ability of a processor to achieve a range of uses. They can use this domain-specific knowledge when they run, so they are more efficient, such as processors designed to run neural networks. It is machine learning related things.
DSA is not magic. Limiting the architecture to a certain area does not automatically make computing faster. We need to make some architectural changes to be more efficient. There are several important points:
First of all, we have implemented parallelism more effectively. Now that multi-instruction multi-data from multi-core processors has become single-instruction, multi-data, each core does not have to read instruction streams from different caches. Instead, it uses a set of functional units. Store a set of instructions. This is a huge efficiency gain at the expense of some flexibility. We use more like VLiW to let the compiler decide if a set of operations is to be parallelized and shift the work from runtime to compile time.
We are also out of the cache. Caching is a good thing, but when space locality and time locality are low, it is not only useless, it also slows down the program. So we moved it to user-controlled local storage. The trade-offs made here are that someone must have their application snaked into a user-controlled storage structure.
In addition, we have removed unnecessary precision and turned to low-precision floating-point operations.
In conjunction with these, we also need specific domain languages. With code written in Python or C, you cannot extract the mappings to information needed for a particular domain architecture.
Therefore, we need to rethink how to program these machines. They use advanced operations such as vector multiplication, vector matrix multiplication, or the organization of coefficient matrices. These operations can obtain high-level information compiled into the architecture.
The key to designing a domain-specific language is to maintain sufficient machine independence. You cannot rewrite a program without changing a machine. There should be a compiler that can target a particular domain language, map it to the cloud running architecture, and map it to the architecture running on the phone, which is a challenge. TensorFlow, OpenGL, etc. are moving in this direction, but this is a new space for ships. We have just begun to understand and we have just begun to understand how to design for it.
Designing a specific domain architecture for deep neural networks. What do you think should be?
This picture is the internal structure of the TPU. What I want to illustrate is that this area of silicon is not controlled or cached, but is directly related to computing.
Because of this, this processor can complete 256× 256 times per clock, that is, 64,000 8-bit matrix multiplications, so it can easily take down reasoning tasks. You can't use it to run generic C code, but use it to run neural network reasoning.
Let's take a look at the performance achieved per watt of energy consumption. The TPU is more than 30 times that of a general-purpose processor and is even much better than a GPU. This reflects the key to customizing the architecture for a specific area.
This is a new era, but in a sense it has also returned to the past. In the early days of computer development, experts in the field of application, software environment, compilers, and architects formed a vertical team.
Now, we also need such a comprehensive team to understand how to develop domain-specific languages from applications, develop specific domain architectures, and think about how to reconstruct machines.
For the development of the industry, this is a huge opportunity and a new challenge. I believe that there are enough interesting applications like this, we can achieve great performance advantages by customizing machines for the field.
I think if we can achieve this kind of progress, we will make time for people to worry about information security. This is an important issue that we should pay attention to.
One More Thing
After the speech, teacher John Hennessy was still at the scene and answered more than 20 minutes.
The areas covered in this Q&A include: quantum computing, neural networks, security, evolution of computing architecture, future development of the industry, education, and many other areas. Hennessy gave a sincere and clear answer.
Here are a few tips.
An on-site questioner said: The processor architecture course I used was the textbook you wrote.
Hennessy immediately answered: "I hope it didn't hurt you." ”
It was another laugh at the audience.
Another question concerns cryptocurrencies such as Bitcoin.
Hennessy replied: "Indeed, I can build an architecture dedicated to bitcoin mining. "He said that cryptocurrency is very important, but there are still some problems that need to be solved."