Not long ago, NVIDIA officially released a new generation of GPU architecture — — Turing at SIGGRAPH 2018. Huang Renxun said that the Turing architecture is the biggest leap since the invention of the CUDA GPU in 2006. Two important features of the Turing architecture are the integration of RT Core for ray tracing and Tensor Core for AI computing, making it the world's first GPU to support real-time ray tracing.
However, when it comes to AI computing, NVIDIA GPU has become a recognized accelerator. It is not the Turing that put Tensor Core on the GPU business card, but his predecessor & mdash; — Volta.
Titan V based on the Volta architecture is a masterpiece of NVIDIA's achievements in computing. Deep learning and neural networks have become the driving force behind NVIDIA GPUs. As the most advanced computational accelerator, it integrates built-in hardware and software acceleration for machine learning operations, and deep learning capabilities can be used as a business card for Titan V and Volta.
Titan V is a far cry from the original Kepler-based GeForce GTX Titan. The original Titan is positioned as a universal graphics card, both as a flagship game graphics card for game enthusiasts and as a full-double-precision floating point for professional consumers (FP64). )Calculate ability. Before the birth of Titan V, the Titan product line was almost based on this design method. A huge GPU core is the best representative of NVIDIA's “big and complete” design ideas.
On Titan V, NVIDIA once again extended the upper limit of the big core. Volta's most notable is its new dedicated processing module ——Tensor Core, which complements Volta's other microarchitectures, and software that supports deep learning and high performance computing (HPC)/ The framework is integrated.
With a huge GV100 core of 815mm² Titan's product line is closer to the workstation level than ever before, and Titan V has the world's strongest graphics rendering performance while delivering deep learning and high performance computing performance. With great improvement, of course, its price has reached $3,000 at the workstation level.
Strengthen computing power and rebuild Titan
In addition to deep learning and general-purpose computing, this iteration of the Titan brand involves other factors. NVIDIA doesn't really need to build its image through the Titan series. The original GTX Titan has provided computing power to the Oak Ridge National Laboratory's Titan supercomputer through the NVIDIA K20Xs. Moreover, the Titan series does not have any special competitive pressure on product price and performance.
Although Titan V's non-ECC HBM2 memory and GeForce driver stack are more consumer-oriented, the card still benefits directly from the software support of the framework and API as part of NVIDIA's deep learning development. Given that the single-channel Titan V does not affect the server card Quadro GV100, NVIDIA has only cut off the NVLink high-speed bus for server multi-channel design on Titan V, while the main computing power (FP64/FP16/Tensor Core) ) are completely retained.
Unlike Pascal and earlier products that use PC-oriented GPUs and conventional GDDR5(x) memory, NVIDIA chose a server-scale chip with a large scale, low yield and low yield, with a large number of transistors. It's spent on non-graphics (Tensor Core), which is a bet on NVIDIA's computing space, and NVIDIA is no longer satisfied with leading traditional graphics cards and general-purpose computing.
Since it is the first time to analyze the deep learning performance of the GPU, there is currently no set of standard benchmarks on the market, especially for Volta's unique tensor kernel and hybrid precision features. For Titan V, we will use Baidu DeepBench, NVIDIA's Caffe2 Docker, Stanford DAWNBench and HPE Deep Learning Benchmark Suite (DLBS) to test.
However, before delving into these test data, Lei Feng will first introduce some backgrounds on deep learning, GPU, Volta micro-architecture and deep learning performance benchmarks.
GPU and deep learning
The first thing to note is that although "machine learning" or more general "AI" is sometimes used interchangeably for "deep learning", technically, they refer to different things, machines. Learning is a subset of AI, and deep learning is a subset of machine learning.
Deep learning is named after “Deep Neural Networks”, which is ultimately designed to identify patterns in the data, generate relevant predictions, receive feedback on prediction accuracy, and then self-adjust based on feedback. The calculation takes place on the "node", "node" is organized into "layer": the raw input data is first processed by the "input layer", and the "output layer" is introduced to represent the data predicted by the model. Any layer between the two is called “hidden layer”, while “deep” means that the deep neural network has many hidden layers.
These hidden layers can run at ever-increasing levels of abstraction, allowing them to extract and distinguish nonlinear features from complex input data. A standard example is image recognition, where the initial layer looks for certain edges or shapes, which informs the latter layer to look for the nose and eyes, and the later layers may look for faces. The final layer combines all of this data for classification.
As the input data advances through the model, the calculation includes special internal parameters (weights), which in turn produces a loss function that represents the error between the model prediction and the correct value. This error message is then used to run the model back to calculate the weight adjustments that will improve the model predictions, which include a single training iteration.
In order to make the inference, this process naturally excludes the reverse transfer, and ultimately requires less computational strength than the training model. In this sense, the inference also does not require such high precision as the FP32, and the model can be properly trimmed and optimized for deployment on specific devices. However, it is inferred that devices become more sensitive to latency, cost, and power consumption, especially in the context of edge computing.
Convolutional neural networks (CNN) and recurrent neural networks (RNN) are two important subtypes of deep neural networks. Convolution itself is an operation that combines input data and convolution kernels to form a feature map, transforming or filtering raw data to extract features.
CNN is usually "forward" because the data flows through the layers without looping. For RNN (and variants like LSTM and GRU), each calculation will have a separate weight loop back to itself, giving the network a sense of "memory", which allows the network to make time. The prediction of consciousness is useful in scenarios such as text analysis.
Since deep learning math can be reduced to linear algebra, some operations can be rewritten as inter-matrix multiplications that are more GPU friendly. When NVIDIA first developed and announced cuDNN, one of the important implementations was to downgrade the algorithm to matrix multiplication to speed up the convolution. Over the years, the development of cuDNN has included a pre-computed implicit GEMM convolution algorithm, which happens to be the only algorithm that triggers Tensor Core convolution acceleration.
The advantages of NVIDIA GPUs
For deep learning, GPUs have become the best choice for accelerators. Most calculations are essentially parallel floating-point calculations, that is, a large number of matrix multiplications. The optimal performance requires a large amount of memory bandwidth and size. These requirements are very consistent with the requirements of HPC. The GPU can provide high-precision floating-point calculations and a large number of VRAM and parallel computing power, NVIDIA's CUDA is just the right time.
The development of CUDA and NVIDIA's computing business is in line with the progress of machine learning research, and machine learning has only re-emerged as “deep learning” around 2006. The GPU-accelerated neural network model provides an order of magnitude acceleration compared to the CPU, which in turn re-promotes deep learning to today's popular vocabulary. At the same time, NVIDIA's graphics competitor ATI was acquired by AMD in 2006; OpenCL 1.0 was released in 2009, and AMD divested their GlobalFoundries fab in the same year.
As DL researchers and scholars successfully used CUDA to train neural network models faster, NVIDIA released optimized deep learning primitives for their cuDNN libraries, many of which are HPC-centric BLAS (basic linear algebra) Subroutines) and the corresponding cuBLAS precedent, cuDNN abstracts the need for researchers to create and optimize CUDA code to improve DL performance. As for AMD's similar product MIOpen, it was released under the ROCm umbrella last year and is currently only released in Caffe.
So in this sense, although the underlying hardware of NVIDIA and AMD is suitable for DL acceleration, NVIDIA GPUs have finally become a reference implementation for deep learning.
Anatomy of Tensor Core
One of the more annoying questions about Volta's mixed-precision Tensor Core is the 4×4 matrix multiplication capability. Tensor Core is a new processing core that performs a specialized matrix math operation for deep learning and certain types of HPC. Tensor Core performs fusion multiplication, where two 4*4 FP16 matrices are multiplied and the result is added to a 4*4 FP16 or FP32 matrix, which ultimately outputs a new 4*4 FP16 or FP32 matrix.
NVIDIA refers to this operation by Tensor Core as mixed-precision mathematics because the accuracy of the input matrix is half-precision, but the product can achieve full precision. As it happens, this kind of operation by Tensor Core is common in deep learning training and reasoning.
Although Tensor Core is a new computing unit in the GPU, it is not much different from the standard ALU pipeline, except that the Tensor Core handles large matrix operations rather than simple single-instruction stream multi-stream scalar operations. Tensor Core is a choice of flexibility and throughput trade-offs that perform poorly when performing scalar operations, but it can pack more operations into the same chip area.
Although Tensor Core has some programmability, it still stays at the 4*4 matrix multiply-accumulate level, and it is not clear how and when the accumulation step occurs. Although described as performing 4*4 matrix math, the Tensor Core operation seems to always use a 16*16 matrix, and the operation is processed across two Tensor Cores at a time. This seems to be related to other changes in the Volta architecture, and more specifically to how these Tensor Cores are integrated into SM.
For the Volta architecture, the SM is divided into four processing blocks or subcores. For each subcore, the scheduler issues a warp instruction to the local branch unit (BRU), the Tensor Core array, the math dispatch unit, or the shared MIO unit per clock, which first prevents the Tensor operation from occurring concurrently with other mathematical operations. When using two Tensor Cores, the warp scheduler issues matrix multiplication directly, and after receiving the input matrix from the register, performs a 4*4*4 matrix multiplication. After the matrix multiplication is completed, the Tensor Core writes the resulting matrix back to the register.
When the Tensor Core executes the actual instructions, even at the compiler level using NVVM IR (LLVM), there is only the intrinsic for warp-level matrix operations. For CUDA++ and PTX ISA, the warp level is still the only level. The form of the loading input matrix is that each twisted thread holds a fragment whose distribution and identity are not specified. Broadly speaking, it follows the same pattern of thread-level stitching-based GEMM calculations in the standard CUDA core.
In general, given the A*B+C Tensor Core operation, the fragment consists of 8 FP16*2 elements of A (ie 16 FP16 elements) and 8 additional FP16*2 elements of B, and 4 of the FP16 accumulators. The FP16*2 element or the FP32 accumulator consists of 8 FP32 elements.
After the matrix multiply-accumulate operation, the result of the calculation is scattered in the target register fragment of each thread, and needs to be unified over the entire range. If one of the warp threads exits, these low-level operations will basically fail.
The low-level micro-benchmark of the Citadel LLC team revealed many Volta micro-architecture details, including Tensor Core operations and related fragments, which are located in registers and flags compared to the input matrix. They observed that the subcore core computes matrix multiplications in a specific stitching pattern in which all 32 warp threads are running.
Conceptually, Tensor Core runs on a 4*4 submatrix to compute a larger 16*16 matrix. The warp thread is divided into 8 groups of 4 threads each, and each thread group continuously calculates an 8*4 block. A total of 4 groups of processes are processed, and each thread group processes 1/8 of the target matrix.
In a collection, four HMMA steps can be done in parallel, each step for a 4*2 sub-block. These four threads are directly linked to those matrix values in the register, so the thread group can process a single Step 0 HMMA instruction to calculate the sub-blocks at once.
Since matrix multiplication mathematically requires multiplexing of certain rows and columns to allow parallel execution across all 8*4 blocks, each 4*4 matrix is mapped to registers of two threads. In calculating the 4*4 sub-matrix operation of the 16*16 parent matrix, this would involve adding the sets of successive calculations to form the corresponding blocks of 4*8 elements in the 16*16 matrix. Although Citadel did not test FP16, they found that the FP16 HMMA instruction only produced 2 steps instead of 4 steps, which may be related to the smaller register space occupied by FP16.
The basic 4*4*4 Tensor Core operation is converted to a semi-programmable 16*16*16 mixed-precision matrix multiply-accumulate with independent thread scheduling and execution, as well as warp synchronization and warp-wide result allocation. Although CUDA 9.1 supports 32*8*16 and 8*32*16 matrices, the multiplied matrices require corresponding columns and behaviors. The final matrix is 32*8 or 8*32.
The way Tensor Core works seems to be a hardware-implemented step in the NVIDIA GEMM compute hierarchy, as shown in CUTLASS (the CUDA C++ template library for GEMM operations). For the traditional CUDA core, the final step is to decompose the warp tile structure into scalar and vector elements owned by each thread. Using the WMMA API (now representing the tensor kernel), all of this is abstracted out, leaving only the cooperative matrix fragment loading/storage and multiple accumulations that need to be processed. Accumulation occurs in an FMA type of operation.
At the register level, NVIDIA mentioned in their Hot Chips 2017 paper that a 64-time multiply-and-accumulate operation can be performed using a 4*4 matrix of three relatively small multiply and accumulator data. ” The per-threaded program counter of the enhanced Volta SIMT model (which can support tensor cores) typically requires 2 register slots per thread. The HMMA instruction itself multiplexes as many registers as possible, so I can't imagine the registers in most cases without bottlenecks.
For independent 4*4 matrix multiply-accumulation, the Tensor Core array is very nuclear in terms of registers, data paths, and scheduling and has no physical design. It can only be used for specific sub-matrix multiplications.
In any case, from the perspective of NVIDIA, Volta is not a dedicated ASIC for deep learning, it still covers the field of GPGPU, so it is logical to keep CUDA programmable Tensor Core for GEMM / cuBLAS and HPC. This is especially true for CUTLASS for CUDA c++, as its WMMA API support is designed to enable Tensor CoreGEMM operations for a wide range of applications. Fundamentally, the development of NVIDIA deep learning hardware acceleration has a lot to do with the development of cuDNN (and cuBLAS).
Make FP16 suitable for deep learning
Volta's deep learning capabilities are based on deep learning with half-precision floating point (IEEE-754 FP16) rather than single precision floating point (FP32).
This capability was first supported by cuDNN 3 and implemented in Tegra X1's Maxwell architecture. Subsequent native half-precision calculations were introduced into the Pascal architecture and referred to as “pseudo-FP16”, which used FP32 ALUs to process pairs of FP16 instructions. Double the FP16 throughput per clock. This feature has actually been reflected in the Tensor Core processing of the matrix fragments in the register, and its two FP16 input matrices are collected in 8 FP16*2 or 16 FP16 elements.
As far as FP32 and FP16 are concerned, since single-precision floating-point contains more data than half-precision floating-point, it is more computationally intensive, requires more memory capacity and bandwidth to accommodate and transmit data, and brings more power. Consumption. Therefore, the successful use of low-precision data in computing has been the holy grail of the poor, and the target is those that do not require high-precision data.
In addition to API/compiler/framework support, deep learning has always had the problem of losing precision when using FP16 data types, which makes the training process less accurate and the model cannot converge.
According to Lei Feng.com, NVIDIA has previously introduced the concept of “mixing precision” in a similar situation, in Pascal's fast FP16 (for GP100) and DP4A/DP2A integer dot product operations (for GP102, GP104 and In the GP106 GPU), a similar statement has been made.
At that time, people were concerned with reasoning ability. Like Titan V's “Deep Learning TFLOPS”, Titan X (Pascal) introduced “44 TOPS (New Deep Learning Inference Command)”. The new instruction performs an integer dot product on a 4-ary 8-bit vector or a 2-ary 8-bit/16-bit vector, resulting in a 32-bit integer product that can be accumulated with other 32-bit integers.
For Volta's mixing accuracy, important precision-sensitive data (such as sovereign weight) will still be stored as FP32; and Tensor Core's mixed-precision training will multiply two half-precision input matrices to obtain a precision product, which then accumulates A precision and. NVIDIA says that Tensor Core converts the results back to half-precision before writing the results to memory, so that when using the half-precision format, less data is needed in registers and memory, which helps with data in very large matrices.
The data space included in FP16 and FP32 is not the same. The normalization method can solve the problem that the FP32 format data exceeds the FP16 representation range. For example, many of the activation gradient values fall outside the range of FP16, but since these values are clustered together, multiplying the loss by the scaling factor moves most of the value in the FP16 range. By rescaling the gradient to the original range before the final weight update is completed, its original precision can be maintained.
However, not all math, neural networks, and layers are suitable for FP16. Usually the mixing accuracy of FP16 and Tensor Core is best for convolution and RNN re-image processing. For unsuitable neural network frames or types, FP16 will be disabled by default. Or not recommended.
Memory improvement, SM change
Using Tensor Core to handle mixed-precision data seems to alleviate the memory bandwidth problem, but it turns out that although Volta has been enhanced by the memory subsystem in almost every respect, the magnitude is not obvious.
First, Volta has a 12 KB L0 instruction cache. Although Pascal and other previous GPU cores also have instruction caches, Volta's more efficient L0 is private to the subcore SM partition, so it is also private to the warp scheduler. This is a compensation for the larger instruction size of the Volta architecture and may contribute to the framework supporting the Tensor Core channel. At the same time, the Volta instruction delay is also lower than Pascal, especially the core FMAs are reduced from 6 cycles to 4 cycles.
As the proportion of schedulers in each SM increases, cutting off the second dispatch port appears to be a trade-off for sub-cores with independent data paths and mathematical scheduling units. With FP32/INT32 execution capability, it also opens the door to other low-precision/mixed-precision models. These sub-core enhancements are all aimed at optimizing the Tensor Core array.
Another major change is the consolidation of L1 cache and shared memory. In the same block, shared memory can be configured to be up to 96 KB per SM. The HBM2 controller has also been updated to increase efficiency by 10-15%.
Deep learning benchmark
Deep learning From frameworks to models to APIs and libraries, many parts of AI hardware are highly customizable, and such new areas can sometimes be very difficult to understand.
As the saying goes, "Light says not practicing the trick", practice is always the only criterion for testing truth. For the computer, the introduction is not as detailed as the real knife, and there is nothing more than the benchmark to extract the actual performance of the hardware.
With the impact of ImageNet and some derivative models (AlexNet, VGGNet, Inception, Resnet, etc.), image dataset training in ILSVRC2012 (ImageNet Mass Visual Recognition Challenge) is gradually recognized by the industry. Basically, all deep learning frameworks support CUDA and cuDNN. For Volta, the framework that supports FP16 storage also supports Tensor Core acceleration. Tensor Core acceleration is automatically enabled after FP16 storage is enabled.
In the next article, we will test the Volta architecture through a number of tests, using various deep learning frameworks to understand the performance of Tensor Core.