Home > News content

Deep analysis of Ice Lake architecture Intel's Athena goddess

via:Expreview超能网     time:2019/8/12 13:32:33     readed:53

Visit the purchase page:

Intel flagship store

This means that Intel's first-generation production-grade 10nm product (not the only 10nm i3 from Cannon Lake) has finally appeared on the market. At this time, Xiaobian compiled and organized the current Ice Lake architecture. A related analytical article to explore the improvements behind it.

It’s been nearly five years since Intel last updated its desktop-class processor architecture. It’s hard to say that Skylake is a very successful architecture for generations, and it’s probably the longest-running Intel processor since P6. Architecture, supporting Intel to the present and still in the mainstreamserverThe market has the upper hand.

First of all, we have to clarify that Ice Lake is the code name of the entire processor architecture, and now the Intel processor architecture includes the kernel, GPU, and other IO units in the Uncore part, so this article is not just the kernel micro-architecture for the CPU. Analyze, but for the entire architecture.

Note: If no source is stated, the images in this article are from WikiChip and AnandTech.

Ice Lake processor structure

Sunny Cove core microarchitecture: IPC increased by an average of 18%

Sunny Cove core structure diagram

Front-end buffer: increase and increase

The kernel of the x86 processor can be simply divided into two parts, the front end part and the back end execution part, and the front end part mainly completes the work of "finger decoding", and the back end is mainly the specific execution unit of the instruction, and there is a front and rear end between the front and back ends. A buffer for storing and interpreting the merged microinstructions. Intel introduced the "micro-instruction fusion" technology in the kernel very early to improve efficiency. The merged micro-instructions will enter the buffer and be assigned to the back-end execution part for specific execution. Intel currently believes that more bottlenecks in the program today are in the memory access and front-end instruction dispatch. Sunny Cove's front-end part of the improvement reflects this concept, so the buffer has been expanded a lot.

Buffer partial comparison

Architecture

Haswell

Skylake

Ice lake

Out-of-order buffer

192

224

352

Fetch memory queue size

72

72

128

Fetch store queue size

42

56

72

Super energy network

It can be seen that Intel has reordered the buffers (ReOrder Buffer, mainly used to execute the micro-instructions after the out-of-order execution, according to the original order of the instruction buffer) to achieve 352 micro-instructions, directly It has improved by 128/57%, and Haswell has only improved 32 by Skylake. Also in the memory access, there has been no small improvement. The Load queue has increased by 56, and the Store queue has increased by 16, which is significantly more than the change from Haswell to Skylake.

Cache comparison

Architecture

Haswell

Skylake

Ice lake

Single core primary data cache size

32KB

32KB

48KB

Single core level 1 instruction cache size

32KB

32KB

32KB

Single core L2 cache size

256KB

256KB

512KB

Microinstruction cache

1.5K u OPS

1.5K u OPS

2.25K μ OPS

Super energy network

Looking at the cache part, the new kernel finally adds a level 1 data cache that has not changed for thousands of years. From 32KB to 48KB, although only 12KB is added, it is necessary to know that 32KB of the first level instruction cache + 32KB level 1 data cache. The design has been used since the first generation of the Core series, the Core micro-architecture, which has been used until now, while the bandwidth of the primary data cache has also increased. The L2 cache attached to each core is directly doubled to 512KB, which is the largest change in the kernel cache since the Nehalem architecture built the L2 cache into each core and set up the shared L3 cache separately. .

Skylake vs. Sunny Cove kernel architecture, left Skylake, right Sunny Cove

The improvement in the front-end part is small, mainly to improve the performance of the prefetcher and branch predictor, and increase the size of the micro-instruction cache so that it can satisfy the transmission of 5 (6) instructions per cycle.

Backend: wider

On Skylake, under Icelake, pay attention to the Port

There are also no small changes in the backend. The Sunny Cove has two more execution ports than Skylake, reaching as many as 10. And the port is more refined, there are ports dedicated to reading and storing addresses, and the number of ports dedicated to accessing data is two.

Then in the execution unit, Sunny Cove added a unit that supports AVX-512 commands. In fact, these units have been added to the Skylake-Server, and the iDIV hardware integer divider added to Cannon Lake has been introduced. A new MulHi unit has also been added, dedicated to the processing of multiply instructions.

The introduction of the AVX-512 computing unit allows the Sunny Cove core to process one 512-bit instruction or two 256-bit instructions at a time.

In terms of kernel interconnection, the desktop level Ice Lake will still adopt the design of Ringbus, which is the ring bus, and the server side will continue the Mesh bus design of Skylake-Server.

Instruction set and AI acceleration

The instruction set has been expanded with the addition of new units. New instructions have been added to encryption and decryption, AI acceleration, general-purpose calculations, and specific calculations, especially the AVX-512 instruction set.

For the artificial intelligence that has been popular in recent years, Intel has added its own "Gaussian Network Accelerator" in the Uncore section.Mobile phoneThe common AI hardware acceleration circuit on the SoC also uses the AVX512VNNI instruction set to use the AVX-512 unit for AI-related acceleration calculations. Intel calls this acceleration "DL (Deep Learning) Boost". This is a very clever trick. The introduction of a dedicated computing unit can guarantee a certain acceleration performance, and the addition of a new instruction set can also make full use of the new CPU features.

The above changes in the encryption and decryption instruction set, such as the increased throughput of AES, the addition of a new series of instructions for the SHA algorithm, etc., in general, under the premise of proper optimization by the compiler, Ice Lake's encryption and decryption performance is much stronger than Skylake. of.

summary

Simply summarize the improvements of the Sunny Cove microarchitecture:

  • Improved performance of prefetchers and branch predictors

  • Level 1 data cache increases by 50%

  • Level 1 cache storage bandwidth increased by 100%

  • Level 2 cache is increased by 100%

  • Microinstruction cache increased by 50%

  • 25% more micro-instructions that can add out-of-order reorder buffers per cycle

  • Out of order reorder buffer is 57% larger

  • 25% more backend execution port

  • Support for new instruction sets such as AVX-512

Combining these improvements, Sunny Cove achieved an average 18% improvement over IPC over SkyC, and 47% for Broadwell or Haswell, the highest in tests optimized for AVX-512. It can be 2~2.5 times faster than the previous generation mobile low-voltage processor. Today, when Moore's Law is moving slowly, this number is already very high.

Off-topic, in fact, many improvements have been made on Cannon Lake, such as AVX-512, related instruction set changes and cache bandwidth increase, and some changes are from the Skylake-Server architecture, such as AI acceleration. The instruction set has actually appeared on the server side processor. But because Cannon Lake was actually abandoned by Intel, the Sunny Cove kernel architecture that inherited Cannon Lake's improvement point can get an average of 18% IPC improvement compared to Skylake. If everything is normal, Intel's 10nm is not postponed, and Ice Lake should be Cannon. The next generation of Lake is not so much better in comparison.

11th generation graphics architecture

Ice Lake's nuclear display achieved the computing performance of 1TFlops for the first time, and added a lot of features, which can be described as a lot of improvement. Intel used "the most powerful version" to describe the performance of this generation of nuclear, how to do it?

Violent stacking scale with 10nm process

Intel's 10nm process has a very large increase in transistor density. In the 14nm era, it is equipped with up to 24 sets of EU cores. It directly turns 2.67 times on Ice Lake, and can reach up to 64 groups of EU, and the frequency is not Low, the highest can go to 1100MHz, only 50MHz lower than before, at this time the overall FP32 calculation of the nuclear has reached 1.15TFlops. In view of this, compared to the 9th generation nuclear display on the 8th generation Core processor, Intel officially claims to provide an average frame rate of about 1.8 times.

You must ask where the 10th generation went, right? It’s still on the dead Cannon Lake, and the only nuclear show is blocked.

Currently on the mobile low-voltage version of the Ice Lake processor, Intel provides a total of three configurations of G1, G4 and G7, respectively, 32/48/64 group EU, the low-end G1 name is still "UHD", and Both G4 and G7 appear under the brand of "Iris Plus".

In addition to stacking EU quantities through process advancement, optimization of internal architecture is equally important.

Internal architecture optimization

Comparison table with the ninth generation of nuclear display, source: Weekend talk, Icelake CPU assistant, Gen11 nuclear introduction

First, the size is increased by increasing the sub-slice contained in a single slice, so that the number of calculations per cycle is increased.

The second is to make a fuss on the cache system, expanding the capacity of the third-level cache, Intel announced that the EU's three-level cache has 3MB, and there is 0.5MB of local shared memory. There is also a memory controller upgrade through the processor that can use a higher memory bandwidth.

New interface version and enhanced hardware coding circuit

One of the most uncomfortable things for Xiaobian last month was to buy a 1440p, 144Hz refresh rate monitor. When using HDMI to connect a notebook, the maximum output is only 60Hz under 1440p. The reason is the old 9th. The HDMI version supported by the nuclear display can only reach 1.4, the highest can only provide4KThe output of @30Hz, the maximum under 1080p is 120Hz, and the small notebook does not provide USB-C or DP output.

And Ice Lake finally solved this pain point, supporting HDMI 2.0b and DP 1.4 HBR3, these two need not say more, anyway, the highest resolution and frame number increase by the way can also support HDR.

In addition, in the video hardware coding part, which is the independent hardware circuit used by the Intel QuickSync feature, the new core display also has a relatively large improvement. Now it supports two HEVC 10-bit simultaneous encodings. In the case of YUV444, it supports up to two. 4K60 frame video stream, or a YUV422 8K30 frame video stream.

Variable Rate Coloring (VRS)

VRS full name, Variable Rate Shading, is a new technology that allows the GPU to adjust the coloring accuracy according to the importance of the picture area. The specific effect of our previous news has been introduced, you can look at it: to compare the VRS variable rate coloring technology. Performance improvements 3DMark will add a comparison of the images in the technical benchmark article.

VRS can save certain GPU resources on unimportant screens, and make this part of GPU resources participate in the rendering of more important parts of the picture, thus increasing the overall number of frames. At present, NVIDIA has added relevant support in the Turing core. Intel has not fallen behind, providing this feature in the 11th generation of nuclear, and they announced that they will work with Epic to add this feature to the Unreal Engine. Currently, Civilization Six has supported the technology and is based on Intel. The data, the maximum number of frames increased by 30%.

summary

The improvement of the GPU part is mainly due to the increase in scale. The architecture is a minor change, mainly improving the cache system, but the progress of the 11th generation of nuclear display is quite obvious.

It may not be a chicken rib in the 1080p low quality, and it will be able to play the game in 30 frames.

Uncore section

The Uncore part refers to the part of the System Agent in addition to the kernel and other parts of the GPU. Since Intel moved the memory controller and PCI-E controller into the CPU inside Nehalem, there is no What a big change, but this time Intel added a new thing to it and upgraded a lot of old parts.

Thunderblot 3

One of the reasons that blocked the use of Thunderblot (hereinafter referred to as TB) devices was that the cost of using this interface was slightly higher. When TB3 began to appear as a USB Type-C interface, the usage rate was indeed high, but there were other One of the barriers, one is that TB needs the motherboard to carry additional chips to use, this control chip is not cheap. Finally on Ice Lake, Intel integrated the TB controller into the processor and never took up the number of PCI-E buses provided by the processor or squeezed the already crowded DMI 3.0 bus with the PCH. It has its own position on the ring bus.

And Intel generously provided four TB3 interfaces at a time, each with a full specification of PCI-E 3.0 x4. That is to say, the Ice Lake processor actually has a total of 32 PCI-E 3.0 channels, but Half of them are provided in the form of TB3. Of course, these interfaces support USB mode. When running in the USB 2.0 state, they will wrap around the PCH for communication.

Of course, not all vendors will give four TB3 interfaces. The specific configuration depends on the OEM. After all, other independent chips such as USB PD will increase the cost, and the TB interface needs additional. Retimer chip, but Intel has already halved the required Retimer, and only two Retimers are needed for two TB3s.

However, integrating the TB controller into the CPU also makes the IO part of the entire System Agent more complicated. The above is a detailed schematic diagram. A Type-CIO route (named CIO Router on the picture) has two PCI-Es. The 3.0 x4 is connected to the CPU, and the internal display control engine (Display Engine on the diagram) is also connected to the Type-CIO route to control the state of the Type-C interface and determine the signal to be sent. At the same time, USB xHCI should also be connected with Type-CIO, and also manage the entire memory uniformity......

The complicated structure leads to an increase in the overall delay. Intel attributed the reason to the power control. The original split chip is easy to manage the power state, but after integration, each part has its own power state to be managed. A more refined power management system, which adds overall latency. However, more refined power management is still beneficial, that is, it can improve energy efficiency. Intel said that a fully loaded TB3 interface chip plus link layer will use 300mW of power, and the four add up to only 1.2W.

It is worth mentioning that Intel has already made compatibility with USB4, but considering that USB4 is still in the draft stage, it is not ruled out that future modifications will invalidate the compatibility. However, the current architecture analysis is only for the mobile version of Ice Lake. Of course, Intel does not retain the internal TB controller on the desktop Ice Lake.

Off-topic, TB3 is said to have been on Cannon Lake, but it died.

Memory controller

Now the memory controller natively supports DDR4 3200/LPDDR4X 3733 memory. The memory controller on Skylake can only support DDR4 2666 at most, or the eighth generation of Coffee Lake. With the development of DDR4 memory, the memory stick of 3000 is also beginning to appear. It is a good thing that the memory controller directly supports DDR4 3200. And as the number of processor cores increases, memory bandwidth is gradually becoming a bottleneck in processor performance. In our tests, the impact of memory bandwidth on game performance is quite obvious.

Intel's mobile low-voltage platform can only use LPDDR3 as memory, and one of the advantages of supporting LPDDR4/X is that it can bring more performance under lower power consumption, especially for this graphics performance. For Lake, it has a lot of practical significance, because the memory bandwidth directly affects the actual performance of the GPU.

GNA

In the AI ​​acceleration of the kernel mentioned earlier, the Uncore part added GNA, the hardware acceleration unit for AI. At present, I don’t know too much about it. Even the specific names have two kinds of arguments.WindowsIn the introduction page of Machine Learning, its full name is Gaussian Network Accelerator, and in many articles introducing Ice Lake architecture, its name has become Gaussian Neural Accelerator.

It is currently known that the power consumption of this unit is very low, and will continue to work even when the rest of the SoC is turned off, in order to provide stable AI acceleration performance, and the application scenario is speech recognition.

Image processing unit

The Image Process Unit on Ice Lake has been upgraded to the 4th generation. Yes, you probably haven't heard of an image processing unit on Intel's CPU, but it has been around since Skylake, but only in On the mobile dual-core model, it belongs to the DSP (Digital Signal Processor) category and provides image processing functions for the camera of the device.

The IPU on Ice Lake can provide 4K@30fps video capture capability, better hardware noise reduction, support for more cameras, and support for two different cameras, such as a grab IR message, one for RGB information. A camera is modeled as a device.

Intel said it is opening more IPU registers to the software to provide better convenience to the application and to provide support for machine learning. It is also worth mentioning that Intel transferred the integrated MIPI interface on the previous PCH to the CPU, which can be used to connect to the AI ​​acceleration device in the future.

summary

The Uncore part can be said to have changed dramatically. It can be said that Ice Lake has changed the most compared to Skylake. The built-in TB3 controller will definitely bring great convenience to future use. Xiaobian personally likes this improvement. . Others can be attributed to functional updates of a conventional nature.

PCH improvement

The current PCH and CPU on the Ice Lake platform are packaged on the same substrate, and the PCH upgrade is also an upgrade of the entire Ice Lake platform. Similarly, the Ice Lake CPU is connected to the PCH via a DMI 3.0 x4 bus and provides the same bandwidth as PCI-E 3.0 x4.

Reintroducing FIVR

FIVR has actually been introduced in the Haswell architecture, but it has been removed from Skylake because FIVR did not perform well at the time, resulting in an increase in overall power consumption and heat. But on Ice Lake, it returned to the inside of the CPU and PCH. Intel officials said that doing so would save the entire platform and simplify the OEM's power supply design. The new FIVR has higher power efficiency and is closely related to the energy-saving features of the entire platform. It seems that Intel has solved some of the problems in FIVR and can integrate it into the CPU and PCH.

CNVi 2

In fact, Intel has added the CNVi solution Wi-Fi module in the chipset that has been shipped in the past two years. This solution transfers part of the Wi-Fi network card to the inside of the chipset and still acts as a radio outside. The module's Wi-Fi network card can be made very small, such as M.2 2230 or directly soldered to the motherboard with 1216 specifications. The NIC inside the PCH is connected to the external RF module via an Intel proprietary CNVi link.

The special CNVi link on Ice Lake's PCH was upgraded to the second version, CNVi 2.

Of course, the supported Wi-Fi standard is still determined by the external Wi-Fi network card, which is convenient for OEM customization. Intel's move is to break the barrier for people to upgrade Wi-Fi (you are pushing AX)routerPrice reduction ah) Currently, Intel has two wireless network cards that support the Wi-Fi 6 standard: AX200/201.

For specific enhancements of Wi-Fi 6, please refer to our previous article: Super Class (188) Why can WiFi 6 be so "six"? .

IO

This is a simple list of data.

  • 6 USB 3.1 (5Gbps)/10 USB 2.0

  • 16 PCI-E 3.0, generally 8 for two NVMe interfaces

  • 3 SATA 3.0

  • eMMC 5.1

Intel did not mention UFS support.

summary

The change in PCH is not very large, mainly due to the regular functional improvement.

Package, Turbo, and Power consumption multiple power targets and different packaging methods

Currently Ice Lake-U and Ice Lake-Y are two different series of target TDPs, designed for 15~28W and 7~12W respectively. The future mobile standard voltage level TDP is about 45W, and the desktop level is currently unknown.

The 11 low-voltage and ultra-low voltages that were first released also took two different packages. The U-series did not change much. It was still the same, and the ultra-low voltage was different from usual. Intel used a more compact package. The bottom contact is also relatively tighter.

Dynamic adjustment 2.0

The new dynamic adjustment 2.0 technology change point can be seen, the general meaning is that the Ice Lake processor will not return to the base frequency after only 18 seconds of the Turbo frequency, but slowly down, the whole process is more than It was originally 8 seconds long. The new technology also uses machine learning to predict which type of load the CPU will eat, and then intelligently adjust the power budget to maximize the Turbo time.

to sum up

In general, Ice Lake is a very large-scale architecture, whether it is the kernel or the various components outside. People say that Intel is squeezing toothpaste, but how to say it is that the pressure from competitors is also a reason for Intel to squeeze toothpaste, but more of the reason is probably due to the problems encountered by Intel in the process technology in the past few years. In Intel's Tick-Tock strategy, Cannon Lake appeared as an upgraded version of Skylake's process. However, due to the 10nm dystocia, the Tick-Tock strategy completely failed and became the PAO-process-architecture-optimization strategy. The role of the 10nm first generation was introduced, and the result was 10nm later than the PAO strategy plan, but the competitor's Zen and Zen+ architecture began to put pressure on Intel. No way, Skylake added two cores with 14nm++ and then topped it. This top is almost two years later, Cannon Lake was completely abandoned, and many of the above optimizations were inherited by Ice Lake.

From the overall architecture point of view, Ice Lake continues to rise in single-threaded performance, and the test results also confirm this: single-threaded scores will be flattened when the base frequency and acceleration frequency are lower than the previous generation. It has been very difficult. For multi-core, the Ringbus limit should be around ten cores. If you don't use the Mesh architecture, the future version of the Intel Ice Lake processor will still be lost.AMDZen 2/3.

In terms of scalability, Ice Lake is still relatively conscience. The addition of TB3 controllers makes USB and TB devices no longer need to crowd out the PCI-E 3.0 bus, which is not enough. It also reserves compatibility with USB4 in the future. We are expected to see official USB 4 support on the optimized or upgraded version of Ice Lake.

Ice Lake will also be Intel's main architecture for some time to come, but it will take a while to get to the desktop level. Intel's current product line is also very confusing, and we have the opportunity to open a single article to take a look.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments