Qi Guihai: Four key issues in the development of DPU

Author:Chinese science and technology Time:2022.08.24

At the beginning of the birth of the DPU concept, people argued that it "how to define", but later discovered that only definitions can not explain what DPU can do, what role, and how to better cooperate with existing systems. This article will discuss the four key issues in the development of DPU: What is DPU? Can DPU be standardized? What are the challenges of DPU industrialization? And is there a "Chinese plan"? Some questions are still difficult to give an exact answer at present, but they are attracted to attract everyone's attention.

What is DPU?

DPU is a newly developed dedicated processor, but the interpretation of DPU is not as easy as "self -explanatory" as some of the previous processors. For example, GPU, everyone knows what the name is, the name is defined. Similar to digital signal processor DSP, deep learning processor NPU, etc. In fact, the CPU is also an unclear concept of interpretation. What does it mean for the "central government", there is not much argument about 50 years ago. But what does the CPU need to do? What is the role in the system is indeed clear -this is actually the first problem. In contrast, the so -called "definition" is not so important. In short, what type of DPU's reference structure is, what type of load can be processed, and how to integrate into the existing computing system is the key issue to solve the development of DPU research and development.

DPU is a data processing unit for infrastructure layers. In view of this, Intel also called his DPU "IPU". So the so -called infrastructure layer is different from the application layer. It is to provide applications with a logical layer that provides physical or virtualized resources, and even provides basic services. In fact, this concept is well understood. From the perspective of the macro logic level of the computing system first, it is artificially divided into infrastructure layers (IaaS), platform layer (PAAS), software layer (SaaS), and the top layer is the upper layer. Application layer. If you look at it, it will be clearer. The basic layer mainly includes components that interact with hardware resources and abstract hardware functions, including networks, storage, servers, etc. From the point of view of optimization technology, the more basic components, the more tendency to be guided by performance priority. There are more "Machine-Dependent". The layer packaging, shielding the bottom layer difference, transparent to users.

DPU is a data processing unit for infrastructure layers

So, can the existing data centers, GPUs, and routers and switches cannot continue to be "data processing unit for infrastructure layers"? The study of the computing system is largely "optimized" research. The existing infrastructure is not impossible, but not enough "optimization". Without the invention and introduction of new technologies, the contradiction between ultimate demand and supply will become more and more prominent.

The first thing that DPU must be solved is the problem of network data packet processing. Traditionally, the data frame of the 2 -layer network is processed by the network card. The kernel protocol stack in the OS running on the CPU to handle the transceiver of the network data packet. When the network bandwidth is relatively low, this overhead is not a big problem, and even the interrupted expenses can be acceptable. However, with the development of the core network and the gathering network towards 100G and 200G, the access network reaches 50G and 100G, the CPU cannot provide sufficient computing power to process the data packet. We found a phenomenon, called "performance bandwidth growth ratio ratio". The simple understanding is that the performance of the CPU performance has slowed down due to the slowdown of Moore's law. The scale of the data center expansion and the driver of digital progress, so the growth rate is even more rapid, which further exacerbates the calculation burden of CPU on the server node.

Another example is a core application in the cloud computing scene. The data forwarding problem between virtual machines, namely OVS. Generally speaking, 20 VMs need to consume the computing power. If you use Xeon's multi -core CPU to process it, it requires about 5 core computing power -this is indeed a relatively large expense.

In addition, the current system structure is actually not born to process network data, but to manage local resources more efficiently, support multi -user, multi -tasks parallel, local security, proper concurrent, so it is necessary to divide the execution of different privileges instructions The access rights adopt a complex interrupt mechanism. These mechanisms are not efficient for high -bandwidth networks, random access, and high concurrency retracting. Therefore, the existing technology has opened up a user -state access mechanism, bypassed the kernel of the operating system directly, and replaced the interruption of IO to handle the IO operation. These "repairs and supplements" on the basis of the current system are essentially the inadequate adaptation of classic technologies in new scenarios.

In order to better understand the role of DPU in the system, you can use a classic computing system model to divide the system into three parts according to the logic function: 1) Data Plane (Data Plane), which is defined as packet analysis and The processed data channel represents the functional part of the calculation and data dense; 2) Control planes (Control Plane), which is defined as algorithm collection for input and output data streams. Control the dense functional part. In addition, the industry usually adds a third level, that is, 3) management plane (MGMT. Plane), which represents system monitoring, faults isolation, online repair and other periodic or occasional applications. In fact, this is also a division under the "Software Definition Network SDN" method. If the road network infrastructure of a city is compared to SDN, then the cross -crossing road is its "data plane", and its road network density and width determine the traffic limit of the road network; "Control plane", its control algorithm advantages and disadvantages, and the reasonable degree of deployment position determine the actual capacity of traffic flow; various speed measurement points, flow monitoring, temporary traffic control, accident congestion and tutoring, etc. are its "management plane". With this set of infrastructure, various users can apply various vehicles (equivalent to users) to carry out transportation services. For different planes, attributes such as parallelism, performance, flexibility, and reliability are usually relatively large. For data plane, prominent demands are performance. Through development data levels, thread -level, task -level parallelism, and highly customized dedicated computing units. All optimized design is performance orientation. For the control plane, the main demand is universal flexibility, which is convenient for the control of data to control the data of the data and hand over the right to the user. The function of managing planes is mainly safe, reliable, and easy to use, facilitate system status monitoring and maintenance, and facilitates the implementation of automated operations and maintenance mechanisms.

Why start with these three planes to look at the role of DPU in the system? Because these three logical planes reflect the content that needs to be followed during the DPU design process. Some people simply understand the DPU as a "burden" for the CPU, and use the DPU as a "variant" of a network card. It is just a passive device. DPU is regarded as a pure algorithm hardware carrier. The image shows people, which is a design that simply pursues strong data planes and weak control surfaces. More typical such as data encryption, image transcoding, AI acceleration card, etc. This is the "1.0 era" of heterogeneous computing.

If we re -examine the distribution of the system function, we will see that the DPU is actually increasingly not like a simple accelerator, but a key component that cooperates with the CPU. The traditional classic computing system, we call the type I (Type-I), is the function of all the management, control, and data surface of the host; Calculating dense algorithms accelerate, so it is mainly uninstalled the calculation load of the data surface, but the control and management are rarely involved. We call it type II (Type-II). A typical representation is that this computing device can only be found from the HOST end, but it is inconvenient to start, off, and task allocation. With the emergence of forms such as smart network cards, in addition to the advantages of the data surface, the device has a complete control surface function. We call it type III (Type-III). For example, the ARM controller runs a lightweight operating system for managing resources on the board; this is also a relatively common type. There is also the last category. Type-IV is the function of DPU undertake all data surfaces, control surfaces, and management planes. Build a computing system. Not long ago, the CIPU (COULD Infrastrucutre Processing Unit) announced by Alibaba Cloud claimed that replacing CPUs became the core hardware of the new generation of cloud computing. Essence

Let's take a look at what DPU can do. We divided the scenes of DPUs into four directions, namely the network, storage, computing, and security. These four directions are actually dependent. In this picture, the part of the adjacent relationship represents a certain dependence on dependencies. ; The computing part involves a lot of PAAS content, and the network part of the network is a more IaaS layer. The storage and security are more in the IaaS and PaaS layers. The more scenarios covering this category is the goal of the current efforts of various DPU manufacturers.

DPU function scene

To achieve this function, we can reflect the DPU product structure of the second -generation architecture we developed. In this architecture, there are several innovative functional units, such as NOE, which is an upgraded version of traditional TOE; DOE is specifically used to accelerate data query, and DOMS, which is a structure of an efficient management film cache data. Essence Other innovative structures also include, Flashnoc's film interconnection technology, as well as a variety of DMAs for specific IO, and so on. Finally, if the DPU is the biggest driving force, it still comes from the demand side. The development trend of data centers has been deployed 20 years ago from local deployment clusters to cloudized resources ten years ago, and then to the current cloud native stage. The infrastructure layer has become thicker and thicker, and the poolization of hardware resources is increasingly strengthened. K8S and other systems have become new "operating systems", and service grids have become the foundation of new network application development. DevOps development and operation and maintenance integrated ... While the "productivity" has been improved, it also directly promotes the needs of computing power. , Especially the computing power requirements of the IaaS and PaaS layers -this is also the main battlefield of DPU.

Can DPU be standardized?

Before answering whether the DPU can be standardized, it is necessary to clear the exact meaning of standardization, and why standardization. In my opinion, the standardization of DPUs involves two aspects: whether the structure of the DPU can be standardized, which affects the problem of the research and development cost of DPU; whether the application of DPU can be standardized, which affects the problem of the application ecosystem of DPU.

Now there is a misunderstanding of knowledge: generally think that DPU is a dedicated processor. Since it is "dedicated", it is inevitable that "customization" can be used. There is no way to talk about it, so I got a arbitrary conclusion: DPU has no industrial value!

In fact, the three concepts of dedication, customization, and standardization do not have a direct causal relationship.

The dedicated emphasis is that the application scenario, the dedicated depends on the rigidity of the demand. Customization is a path selection of technology, often the "birthplace" of innovation and core technology. Standardization is to reduce marginal costs. It usually realizes the value monetization of innovative technologies by establishing or incorporating the industrial ecology and creating scale benefits.

For example, the GPU is undoubtedly a "dedicated" processor, because people are absolutely rigid to the information interaction of graphics and images; the GPU is customized to implement the grating operating processor (ROP), texture processor (TPC), etc. in the GPU, etc. High -customized functional units, as well as super -large -scale data set synchronization parallel processing technology, are customized technologies for pixel -level massive data processing; in the end, through the graphic operation API, CUDA universal programming framework through graphics such as OpenGL, DirectX standardization. Therefore, "dedicated" is not as low as "universal", and "customization" and even solve some technical choices that must be adopted by some applications.

Last year we published an article "DPU: Data -centric Special Processor" in the Chinese Computer Society's communication. There is a picture that reflects the characteristic distribution of the current types of processors. From functional orientation, it is divided into computational dense VS. IO dense, from structural design to control as control as control and data. From this we can see that the current distribution area of ​​DPU is indeed a certain blank. To interpret it simply, when there is a good industrialization pattern in the other three areas, the area where DPU belongs should not be unreasonable.

Our team has also made a little contribution in DPU standardization. The first is to organize the industry's first DPU technology white paper. This white paper comprehensively portrayed the function set of DPUs and the application scenario of DPU, and gave a more common DPU design reference model. On the basis of this year, we also organized the second technical white paper, but the focus of attention was relocated from the DPU reference design to the performance evaluation method of DPU as a reference for subsequent subdivided application design benchmark test procedures.

In my opinion, DPU standardization is a process, not a purpose. The standardization process is largely interacted with marketization. Therefore, the purpose of standardization is marketization, and the progress of marketization will in turn to promote standardized work.

DPU industrialization challenge

The DPU is mainly played on the basic layer and platform layer, which determines that the optimization of DPUs at this stage is mainly performance orientation. This is actually a particularly hard bone. Now the design of some DPUs depends too much on the use of the general core. Although the flexibility is guaranteed, the performance is often not up, and it is impossible to pay for customers at all. Good performance and poor flexibility, customers will try it; otherwise, there is no chance.

Here I will introduce a challenge that everyone will be more personal -product adaptation. DPU needs to adapt to different CPU platforms and different operating systems. "Adaptation" is easy to say, it is difficult to do, facing the predicament of the "index explosion" of the workload. For example, the NOE function in the DPU is the best low -delayed performance in the DPU industry. The 1/2 RTT rings delay of the TCP and UDP on the X86 can reach 1.2us or even lower. In addition to hardware unloading, the Instanta ™ Noe SDK of the Instanta ™ NOE SDK of Yusur Hados is also required to make in -depth optimization for different CPU architectures. Therefore, when we are adapted to the Kunpeng CPU + OpenUler operating system, we need to solve and optimize the differentiation of many ARM architecture and X86 architecture. For example The 1/2 RTT of TCP and UDP reaches 1.6US's leading low latency performance. However, when we thought that we could easily adapt to the "Kunpeng CPU + Kirin Operating System", many new problems occurred, such as the need to solve the difference in the interrupt processing of Kirin, and a new round of performance optimization. In view of this, we proposed a set of automated multi -ecological environment compilation, release, and testing system platform (ADIP). The decomposition of the adaptive work system into two four stages of assembly lines, respectively for software adaptation and DPU for software adaptation and DPUs on the HOST side, respectively Software adaptation. This development integrated platform has supported the adaptation of DPUs to control the number of DPUs in multiple domestic CPUs and OS, and is still in the process of rapid improvement. Although our ADIP process automation needs to be improved, the division of the process stage can already be very efficiently guided to cooperate with the team of engineers.

Automatic multi -ecological environment compilation, release, and test system platform: Hados Adip

The above content only describes our larger challenge in the development of DPU, and shared our engineering solution we proposed in this challenge. In fact, the DPU also faces some other challenges, some of which belong to the common problems faced by the current domestic integrated circuit design industry, such as the supply chain problem of chip manufacturing, the problem of shortage of high -level research and development personnel, and so on. For example, demand is diversified, there are problems such as diversified demand and DPU design function (Mismatch), and the software ecology of DPU is not mature enough.

Is there a "Chinese plan" for the development of DPU?

Is the development of DPU suitable for our own road or "China Plan"? This is also a question we have been thinking, but there is no conclusion. Although the DPU does not distinguish between the "national borders", the industrialization of DPU may still find a way suitable for my country's national conditions.

In the course of the development of the computing system, there are three important factors that determine whether a type of product/technology can achieve the success of commercialization. The first is "performance", depending on the invention of innovation structure, algorithm, innovative technology, and technology adoption. The second is "productivity", which is related to factors such as development efficiency, system compatibility, and learning costs. The third is "cost", which involves scale effects, level of engineering, supply chain, and service costs.

First of all, the performance of the DPU is the design problem on the one hand, is the structure of the DPU excellent, whether the function is perfect, etc. On the other hand, the problem is the problem of DPU chip manufacturing. Judging from the functions and indicators of our DPU design, the DPU we have developed by self -developed can not be said to have no wind compared to some of the DPUs that have been announced, and even have a lead in some single indicators, such as delay. However, our advantages are local technical advantages. NVIDIA and Marvell's products have the functional modules of previous generations of related products. The architecture is more mature, and it has adopted a more advanced (such as 7nm) process, from comprehensive products from comprehensive products In terms of force, objective speaking, there is still a certain advantage. Therefore, the overall pattern of DPU is still a typical "Western Qiang East Weak".

However, China's current computing power demand is the world's strongest. The growth rate of server demand is the largest in the world. At the national level, there is the grand layout of the "computing power infrastructure" in the "new infrastructure". The construction of "computing power network" and so on. This not only provides opportunities for the development of DPU, but also provides new opportunities for the development of the entire information technology and computing technology. The Chinese are good at "crossing the river by touching the stones". We firmly believe and even believe in it. We look forward to cooperating with colleagues in the industry to explore a set of "Chinese solutions" to lead the development of a new technology such as DPU.

- END -

"Move" on the clouds on the online sports meeting of Zhongguancun No. 1 Park

Mars, Fengyun team, Tengyun team; bicycle gangsters, walking mad demon, rope skipping obsessive -compulsive disorder; AI skipping, AI sports field, cloud running ... After resumption of work, Haidian

Nai Xuexi tea reducing the price: routine mystery, sinking expansion, anxiety with inner rolls

Source | Tech PlanetWen | Qiao XueMilk tea, the price is reduced again?After the f...