An inexplicable failure ...
Author:Fresh jujube classroom Time:2022.07.15
Hello everyone, I am Xiaole, an ordinary network engineer.
A few days ago, I saw the news saying that it was Japan, Canada and other places that broke the communication network failure, causing large -scale network interruptions. After I was shocked, I also remembered that not long ago, I also encountered a very weird cyber failure, almost causing major accidents.
This failure, I still have my heart.
Today, I will tell you my story--
The units in my office are a large state -owned enterprise. Usually, I am mainly responsible for the related work of network maintenance.
In the network of our unit, there are various businesses, and some businesses have high requirements for the real -time and reliability of the network.
Because of the age, the network equipment used by most of the units is the equipment of a large foreign manufacturer (let's call it S device S, the same below).
The scale of our unit is extremely huge. Because the private tree -based tree agreement of S Division has been preconceived, it is difficult to replace the entire network for domestic equipment.
The fault occurred one day in this year's epidemic.
On that day, the unit was rotating to work, and there were fewer staff members. When I was near get off work, I was performing inspection tasks. Suddenly, the unit's comprehensive monitoring system began to alert the "Dang Dang Dang", and the dialog box ordered one and another one, and it was endless.
Take a closer look, a lot of the alarm equipment, one of the tips: The core network switch (let's call it the 9) -B machine's IP address is abnormal!
The situation was urgent, and several colleagues in the office hurried downstairs and went straight to the computer room. In the panic, colleagues' shoes almost lost.
When I arrived at the computer room, my colleagues on duty asked the teacher to ask:
"It's almost off work, what are you pinching?"
"Film, we have nothing to do!"
When you come to the cabinet of the core switch B machine, you can take a look: I wipe it, the entire device except the power light, the other lights are not on! What is the situation? Intersection
Colleagues quickly brought the notebook, connected to the console line, and landed on the system. As a result, there is only the ">" symbol on the screen, and there is no familiar command interaction interface at all!
This system is backup of the A and B -machine dual -machine. We quickly use the console cable to connect the A machine -thank you, all the A machine is normal.
Over the years, we will regularly switch the exercise of core devices and verify the independent support network of stand -alone. It seems that it is not done in vain.
There is an A machine top, and the business is finally not interrupted, and we can breathe a sigh of relief.
After psychological and practical, we quickly contacted the warranty company. While waiting, we also tried to make a way in the computer room and made some fault recovery attempts.
Frankly speaking, I have done a lot of web workers for more than ten years, and I have encountered a lot of machine board card failure. The whole device was first encountered for the first time.
I first tried to pull out the engine and insert it back again. The equipment did not respond. Simply, I sacrificed Dafa and directly power off the entire device.
, Four power cords, wait for half a minute, and then insert it back. Good luck, the console interface starts to display self -test. After more than ten minutes, the device starts, and everything returns to normal! Sure enough ... it is best to restart Dafa!
Although the failure has recovered, the cause of the problem must be found. Therefore, Show Tech collects a lot of materials for a lot of materials and sent it to the warranty company. The warranty company goes to the company to open the "case" (reporting the problem and establish a fault form).
As a result, in the process of waiting for the feedback, before a few days, the core switch-A machine also had problems!
The fault phenomenon is completely consistent: the state lights are completely extinguished, and the system has no response.
With the last experience, this time we settled directly. After more than ten minutes, the A machine returned to normal, the cutting of the tree was cut.
This makes people very puzzled -last time it was B, this time it was A. Could it be that this failure is the same as the new crown, and will it be contagious? A machine and B machine have become difficult brothers and brothers? Is the S company's device so unreliable now? It took it more than three years. Why did it go down?
At that time, we even thought of the reason for the sun.
Because, there was a previous type of device that used to use the company's other models to fail the business board. The conclusion given by "Case" is that the recent solar activity is frequent, the sunspots are shining with the spot, causing the internal signal disorders of the device and causing the business board to restart (囧). To this end, I also deliberately collected the website of the National Sciences National Observatory's Sun Event Forecast Center of the Chinese Academy of Sciences.
While we blame the sun, stepping up urging S to follow up "case" as soon as possible!
As a result, "Case" came out, and all of us were speechless.
"Case" said that this is a known bug, the problem lies in solid -state hard disks.
It turned out that on the engine of this 9 -type machine series switches, a certain version of a solid state drive was used. After this hard disk is used for 28224 hours, it will automatically lock, which will cause the engine to be downtime. Note that it is a cumulative hour, and even if it is shut down and restarting, it will not be cleared.
In 28224 hours, the finger was calculated, 1176 days, almost more than 3 years.
These two core network switches that have failed were started three years ago. The difference in the difference may be different from the time to enter the engine room at that time.
In terms of human words, it is: "This machine has a time bomb. It will explode in more than three years!" This is called Shenma? Intersection Intersection Intersection
In addition to speechless, we quickly checked all the running equipment on the Internet. It was found that there were also a few series switches that are in use.
We used the command given by Case to check the cumulative hour. I went, and there was a pair of switches that supported important businesses. By 28224 hours, there were still two days! What's more terrible is that the cumulative time of the switch is exactly the same! In other words, two days later, the two machines are likely to fall down at the same time!
This is our life. It is a devastating disaster for our business operation.
Hurry up carefully. There are two schemes given by the company:
1. Upgrade the NXOS system;
2. Upgrade the firmware of a light SSD.
It is unrealistic to shut down key switches in a short time. So we chose to upgrade the SSD firmware.
On the day of 28224 hours, everyone was sitting in the office, just waiting to be pronounced. I couldn't sit still, just ran to the computer room, squatted in front of the cabinet, waiting for the power line.
Fortunately, at 28225 hours, the system is normal! It seems that upgrade firmware is still useful! Our colleagues cheered instantly!
The above is the entire process of fault. In retrospect, my palms are still sweating.
In fact, the hidden hazard of the company S is extremely great. This 9 -type machine series switch is positioned as a data center -level core network switch, and major enterprises will use it on very important businesses.
Moreover, the core equipment is basically a dual -machine power -up test. Within three years, it will not take the initiative to upgrade the software version. This major defect is very likely to cause dual -machine to down at the same time, and the harm brings is unimaginable!
The most angry is product defects. Because the product has bugs is also normal.
What makes people angry is that S Di knows this bug, but does not tell customers! Didn't they set up so many equipment, did they not establish a customer file? Is there no equipment after -sales tracking? Even if the small device is, is this large -scale key equipment, do you care about it?
As a normal company, after discovering the defects, you should check the product or customer sales record, actively notify the customer, avoid or resolve it as soon as possible? Is it so difficult for the next notice?
I personally think that communication network devices should also establish a recall mechanism like the automotive field. If major defects occur, manufacturers should record the relevant national departments and then start the recall mechanism.
Now, communication network equipment is as important infrastructure as water and electricity, which is related to national security, corporate security and consumer security. Manufacturers have the obligation to establish a more comprehensive tracking and return visit mechanism, supervise the operation of the operation of the equipment to ensure network security.
Well, here is here.
As a network engineer, I told this story mainly to share experience and make everyone a precepts.
In addition, I also hope that the outside world has more understanding and more support. Nowadays, there are many online products, and the failure phenomenon is endless. Sometimes manufacturers also intentionally or unintentionally avoid some product defects and dig out pits for us.
We are already difficult, don't let us carry the pot every time, can it be?
Note: The small music in the text is a pseudonym.
- END -
Talking about SpaceX's imagination in the future: Will the end of commercial aerospace be Mars?
Author | Lan Bin ShengMusk's Mars dream has been done for 20 years.Nine years ago, ...
The Nebula of the Snowman is shot by Hubble. It is so beautiful!
Exploring the previous life of Hubble TelescopeNew images can see a large amount o...