Home > News content

Malformed data packets cause 37 hours of nationwide failure in the United States

via:博客园     time:2019/8/21 8:01:11     readed:82

The US Federal Communications Commission (FCC) criticized CenturyLink for its failure in December 2018, but did not punish it.


According to a new FCC report, CenturyLink's 37-hour nationwide failure in December 2018 interfered with the 911 service of millions of Americans, blocking at least 886 911 calls.

As early as December last year, FCC Chairman Ajit Pai called the failure of CenturyLink's fiber-optic network “completely unacceptable” and vowed to conduct a thorough investigation. The FCC today published the results of the survey, describing how CenturyLink did not follow best practices that could prevent failures. However, Pai has not announced any penalties for CenturyLink.

The report said that the failure was widespread, affecting many other network operators connected to CenturyLink, including Comcast and Verizon. The report summary states:

The failure affects communication service providers, enterprise customers, and consumers who rely on CenturyLink transport services, which route traffic from different providers to the entire country. The failure caused a wide range of disruptions to telephone and broadband services, including 911 telephone calls. Up to 22 million customers in 39 states are affected, including approximately 17 million customers in 29 states, who do not have reliable access to 911. At least 886 911 calls were not connected.

According to the FCC, the 37-hour failure began on December 27, and the cause was a device failure, which was exacerbated by network configuration errors. “FCC claims that CenturyLink estimates that more than 12.1 million phone calls on its network are “blocked or downgraded”.

In addition, CenturyLink's approximately 1.1 million DSL customers are unable to use the service within these 37 hours. According to the FCC, another 2.6 million DSL customers “may face service downgrades”.

Pai is also known as the failure today. “It’s totally unacceptable”, “It’s important for the communication provider to remember the lessons learned from this incident. ”

However, the FCC did not announce penalties and did not even order CenturyLink to take specific steps to upgrade the network. Instead, the FCC says it “will be outreached with stakeholders to promote best practices and liaise with other major transport providers to discuss network practices” and “provide assistance to small providers to help ensure our country.” The communication network remains robust, reliable and resilient. “The FCC said it will issue an announcement, “to remind the company to adopt industry-recognized best practices”.

Although the FCC lifted the broadband control when it abolished the network neutrality rule, it still supervised the fixed-line network of operators such as CenturyLink and had Title II regulatory rights for ordinary operators.

FCC Commissioner Jessica Rosenworcel stated that the report should be completed early; the report should be accompanied by an action plan to avoid repeating the same mistakes. There is no such action plan for this big problem. ”

root cause

According to the FCC report, the problem began on the morning of December 27, when a switch module at the Denver, Colorado site spontaneously generated four malformed management packets.

CenturyLink and Infinera, the provider of the node, told FCC, “They don't know how or why they generated malformed packets. ”

The FCC report explained that the malformed database “usually was discarded immediately because it indicated that the packet was invalid”, but was not immediately discarded in this event:

In this event, the malformed packet includes a fragment of a valid network management packet that is typically generated. Each malformed packet has four attributes that cause a failure:

  • Broadcast destination address, which means that packets are sent to all networked devices;
  • a valid header and a valid checksum;
  • There is no expiration time, which means that the packet will not be discarded because it was created very early;
  • Greater than 64 bytes in size.

According to the FCC, the switch module sends these malformed packets “as network management commands to the line module”, and these packets are “transferred to all networked nodes”. Each node that receives the packet then <; relays the packet to all networked nodes>.

The report continues to state:

Each networked node continues to relay the malformed packets to each node connected to it through a proprietary management channel because the packets appear to be valid and have no expiration time. This process is repeated indefinitely.

The constant transmission of malformed data packets results in an endless feedback loop, which consumes the processing power of the affected nodes, which in turn destroys the ability of the nodes to maintain internal synchronization. Specifically, if an instruction is sent to a pair of line modules, but only one line module actually receives the information, the instructions sent to the output line module are out of sync. Without such internal synchronization, the node loses the ability to route data. As a result of these node failures, the CenturyLink network experienced multiple failures.

Recovery and future changes

CenturyLink was aware of the failure at 3:56 am, until around 10 am, and sent network engineers to Omaha, Nebraska and Kansas City, Missouri, to log in directly to the affected nodes. ” They later found out that the problem was at the Denver node. At 9:02 pm, the company “discovered and removed the module that generated the malformed packet”.

However, the fault has not been ruled out because the malformed packets continue to be replicated and transmitted over the network, generating more packets from one node to another, & FCC writes. Just after midnight, CenturyLink engineers “ began to instruct the node to no longer respond to malformed packets. & rdquo; They also <; disable the proprietary management channel to prevent further transmission of malformed packets. ”

By 5:07 am on December 28, “most of the network” was running normally, but all nodes did not return to normal until 11:36 that night.

Even after all the nodes have returned to normal, “some customers still experience the aftermath of the failure, because CenturyLink continues to reset the affected line modules and replace the line modules that failed to reset successfully,” the FCC said. CenturyLink confirmed that by 12:01 am on December 29th, the network was “stable down”.

Failure to follow best practices

The report stated that several best practices could have prevented failures or reduced negative impacts. For example, the FCC says CenturyLink and other network operators should disable unused system features.

FCC wrote: “In this case, the proprietary management channel is enabled by default so that it can be used when needed. Although CenturyLink does not plan to use this feature, it is not configured or enabled. However, letting the management channel enable a loophole is caused by the network, which causes the malformed data packet to be continuously broadcasted on the network, thus causing this failure. ”

The report also said that CenturyLink could have adopted a more powerful filtering mechanism to prevent malformed packet propagation. CenturyLink uses a “filter designed to address specific risks”. Conversely, CenturyLink could have adopted the “catch-all filter” that only allows the expected traffic to enter.

According to the FCC, CenturyLink should have set the “memory and processor utilization alert” in its network monitoring. Although the malformed packet "slowly overwhelm the processing power of the node", this "doesn't trigger any alerts in the CenturyLink system."

After the incident, CenturyLink “replaced the failed switch module and sent it to Infinera for forensic analysis,” FCC wrote. According to the FCC, Infinera's engineers are still unable to reproduce the problem, but the company involved has taken additional steps to prevent this failure from recurring.

Those additional measures include CenturyLink disabling the proprietary management channel. “Infinera has disabled the channel for the new node on the CenturyLink network and updated the node's product manual. It is recommended to disable the channel when not in use,” FCC said.

The report continues to state:

The service provider and vendor also developed a network monitoring plan for network management events to detect similar events faster. Currently, CenturyLink is updating its node's Ethernet policer to reduce the chances of transmitting malformed packets in the future. The improved Ethernet Controller quickly identifies and terminates invalid packets and prevents them from being propagated to the network. This work is expected to be completed in the fall of 2019.

Today, CenturyLink stated that the failure was caused by a network management card that generated malformed packets; unfortunately, the malformed packets were broadcast on CenturyLink's transmission network. ”

CenturyLink further stated that it has taken many steps to help prevent problems from recurring, including disabling the communication channels that these malformed databases passed during the event and enhancing network monitoring. We value our customers and deeply regret any inconvenience this incident may cause. ”

Impact on operators such as Comcast and Verizon

According to the FCC, the failure has caused a “chain effect” to other suppliers relying on the CenturyLink long-haul transport network.

According to FCC, “faults may affect Comcast's 3,552,495 VoIP users for up to 49 hours and 32 minutes”, and Comcast's phone customers may experience “fast-busy signals or poor call quality, if the call is subject to Affected transmission on the transmission network. ”

The failure also disrupted Comcast's ability to transfer 911 calls in Idaho.

Verizon uses CenturyLink's network to transmit some of its wireless network traffic. “Faults have affected Verizon Wireless' network in several western states, including intermittent service issues in multiple locations,” the FCC said.

The FCC claims that thousands of Verizon customers using Verizon CDMA networks cannot dial 911 during a failure. The 911 service on Verizon LTE is unaffected, “because the LTE network does not use the affected CenturyLink network for transmission. ”

According to the FCC, “CenturyLink failures have also had a minor impact on other service providers. & rdquo; However, these smaller impacts affect millions of people.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments