Will deep learning acceleration technology be a cracked method of AI "high myopia"?

Author:Beauty has their own wonderful Time:2022.08.24

【Introduction】

"What black technology is the fulfillment of AI, fast, accurate and saving wish?"

Is there anything more sad than bald?

Yes, for example, the intelligent robot recognizes the bald back of the head into a face without a mask, and follows all the way to remind you to wear a mask.

Today, AI applications are very common. In the process, many "artificial mental retardation" jokes have also been received. The above is only one of them. In addition to the outrageous image automatic recognition, there is an intelligent dialogue that answers the non -asked.

Since ushered in the third wave of development represented by deep learning, artificial intelligence technology has been widely used in the scenes such as target detection, image recognition, natural language processing (NLP), from voice recognition, automatic food delivery robot to production line image monitoring The figure of AI is everywhere.

Customers' business needs and innovative applications have put forward stricter requirements for the efficiency and quality of AI reasoning and training, and promoting the development of artificial intelligence from three aspects -data, computing power, and algorithms need to be further adjusted and efficient.

The beauty of the two things is rare in the world. If you want to reach the "three full" in one thing, it is really difficult and "greedy".

But "artificial mental retardation" really needs "greedy".

Data accuracy, storage space, and processing speed. Developing AI must make progress together.

There are some native contradictions between data, computing power and algorithms.

Generally speaking, the larger the width of the data type, the higher the dynamic range and accuracy that can be expressed.

The larger dynamic range and higher accuracy means more storage space. For example, FP32 requires twice the memory occupation of FP16, and puts more pressure on the memory bandwidth to challenge the computing power.

At the same time, although deep learning is the hero of artificial intelligence (AI) in recent years, it is also a huge "black hole" for devouring computing power.

There will still be difficulties in the balance of these three. From the level of the data type, to save the storage space, you need to make a certain concession or sacrifice. For example, Google ’s BFLOAT16 (BF16) data type introduced by Google to accelerates AI deep learning is used. The data width of FP16 realizes a dynamic scope as FP32, and its cost is reduced.

This is just one aspect, but if it is achieved "accurate, also saved, and fast," the three major elements must be made together: simplifying data, strengthening computing power, and optimizing algorithms.

Intel deep learning acceleration technology: quasi, province, fast, low -precision achievement high efficiency!

The innovation in the algorithm is the top priority.

As mentioned above, most deep learning applications use 32 -bit floating -point accuracy (FP32) in their training and reasoning workloads. Although the accuracy is high, it takes up larger memory, which will affect the calculation efficiency.

When the data format is converted from FP32 to an 8 -bit integer (INT8) or 16 -bit floating -point (BF16), the memory can move more data, and then use computing resources to greaterly use it.

Caption: The effect of different data formats on memory utilization

Will the reduction of this accuracy affect the accuracy of data processing?

The answer is: not, or the impact is minimal.

In recent years, many research and practice have shown that deep learning training and reasoning are performed in a lower accuracy data format, which will not have much impact on the accuracy of the results. The loss of accuracy rate is minimized, and there is no loss at all.

The advantages brought by low -precision data formats are not only to improve memory utilization efficiency. In the common multiplication computing of deep learning, it can also reduce the consumption of processor resources and achieve higher operating speed (OPS).

The upgrade of the algorithm has helped the "quasi" and "province" standards, but the level of "fast" is slightly unsatisfactory.

In order to ensure the accuracy in the reasoning process, when matrix operations are performed in the CPU vector processing unit, the 8 -bit value is multiplied to 32 -bit, and 3 instructions are required to complete. It also caused peak computing performance to only increase by 33%.

Then the "speed -up" task is given to the computing power unit.

The essence of the Intel® Deep Learning Acceleration (DL BOOST) technology is to integrate the operation instructions of low-precision data format into the AVX-512 instructions, that is, AVX-512_VNNI (vector Neural Network Instructor (vector nerve nerve (vector nerve nerve ( Network instructions) and AVX-512_BF16 (BFLOAT16) provide support for int8 (main reasoning) and BF16 (taking into account reasoning and training).

Intel® deep learning acceleration technology brings training and reasoning efficiency improvement

At this point, Intel® DL Boost technology can allow artificial intelligence to achieve three comprehensives, namely:

Provincial: Simplify data, improve the utilization of memory capacity and bandwidth, and relieve memory pressure;

Quote: Optimized algorithm, lower numerical accuracy of the model quantification can also ensure the accuracy of the results, especially in reasonable applications;

Fast: strengthen computing power, avoid increasing additional operations, and ensure that performance and memory utilization are improved simultaneously.

Reasoning and training, hardware acceleration dual -pronged approach

Like a road, the number of vehicles through the size of the vehicle (simplified data) naturally came up. Two new, AVX-512 instruction sets for AI applications have passed more vehicles (data volume) on roads (registers). The new instruction set will undoubtedly improve the computing efficiency.

Click here to review the previous life of the Intel AVX instruction set "This 15 -year" advanced "advanced" technology design, let the CPU be released in the AI ​​reasoning era "

The advantages of these two different instruction sets are also different.

Starting from the second-generation expansion processor of Intel, code-named Cascade Lake, the AVX-512 instruction set increased VNNI, and a FMA instruction can complete the 8-bit multiplication and then accumulate to 32-bit operations.

The third -generation Intel® Xeon ® expansion processor, which came out in 2020, has integrated the two AI acceleration instruction sets of Intel® deep learning acceleration technology, and has been widely used in the training and reasoning process of commercial deep learning applications.

With the support of the instruction set, simplify the data while avoiding additional expenses, so that the performance can get the same improvement as the memory utilization rate. AVX-512_VNNI uses INT8 to reason. Compared with FP32, it can theoretically obtain 4 times performance, and reduce memory requirements to ¼.

Memory reduction and frequency improvement accelerate the speed of low numerical accuracy operations, and finally accelerate the AI ​​and deep learning reasoning, which is suitable for image classification, voice recognition, voice translation, object detection and other aspects.

The advantage of AVX-512_BF16 is that it can be used for both reasoning or training, which can help training performance to improve by 1.93 times. The third -generation expansion processor of Intel, code -named COOPER LAKE, integrates the BFLOAT16 acceleration function, and improves performance and memory utilization at similar accuracy as FP32.

Soft and hard coordinating the establishment of the "AI Reconstruction Workshop".

Computing power, algorithms, and data now have their respective standards and solutions. When the three form a complete closed loop, how to make the entire process efficiency more efficient?

There are still places that can continue to improve, that is: when the AI ​​application of native data format is not uniform, how to efficiently use the AI ​​model in a large number of traditional FP32 data formats to use it into BF16 or INT8 format.

The OpenVino ™ tool kit launched by Intel provides a model quantification function and provides a good recipe for the above issues.

It allows FP32 data format AI models built by different AI frameworks, such as TensorFlow, MXNet, PyTorch, etc., and converts into INT8 and BF16 data formats when the loss is rarely accurate.

In addition to the quantitative function of the model, for a series of AI application scenarios, such as visual simulation, automatic voice recognition, natural language processing and recommendation system, the OpenVino ™ tool kit also provides components that can improve their development and deployment efficiency, such as OpenVino ™ Model Components such as Server and OpenVino ™ Model ZOO can implement more efficient optimizations for training models built by different frameworks such as TensorFlow, PyTorch, MXNet, Keras, and simplifying the process and time consumption of these model deployment.

There are many AI application scenarios. What kind of scene can highlight the advantages of deep learning acceleration technology such as AVX-512_BF16?

For example, in the scenario that pays more attention to accuracy of medical imaging, Huiyi Huiying introduced the second -generation Intel® Xeon ® & strong ® expansion process of the second -generation Intel® Xeon ® in the analysis scene of breast cancer image analysis scene In conjunction with the OpenVino ™ tool kit, after the int8 conversion and optimization of the detection model, the reasoning speed increases by up to 8.24 times compared with the original scheme, and the accuracy is less than 0.17%.

Enterprises start the construction of AI applications, and it is not a cost -effective choice to change strings. It can fully evaluate the existing data storage, processing and analysis platforms. AI application.

What's more, the CPU's own AI capability is also evolving. Intel's fourth generation of the fourth generation of the fourth generation of the fourth -generation most powerful scalable processor named SAPPHIRE RAPIDS, which is about to be released, adds the Advanced Matrix Extensions technology.

AMX is a new X86 expansion. It has its own storage and operation. It is mainly aimed at the very important tidal matrix multiplication of the AI ​​field. It is more complicated than the realization of the previous two DL Boost. What about the effect? Let's wait and see if we (with a microscope) ~

- END -

The park category is unique!Qingdao Zhongde Ecological Park listed on the list of the first batch of "Science and Technology Innovation China" innovation base in the country in 2022

Qingdao Daily/Guanhai News August 9th. Recently, the publication period of the Ann...

my country's new generation manned rocket completes important tests to the initial development stage

The reporter learned from the First House of China Aerospace Science and Technology Group on the 28th that the 702 of the hospital's 702 has successfully completed the new generation of manned carrier...