In any network, it is a given that there will be loss. We’ve been making computers talk to each other for decades now, and that whole time scientists and engineers have been searching for better ways to compensate for weird things the network does to data in flight.
DALL-E, prompted “a lost packet in a computer network. minimal illustration, contemporary style, no text."
It’s been nearly 50 years since TCP was bestowed upon us, and for the most part it’s still the best we can do. Ignoring witchcraft like erasure codes, reliability is usually achieved through detection and retransmission. That is, if the sender thinks the receiver didn’t get a packet, it will just resend it. So, even when you roll your own reliability on top of UDP, it still usually ends up looking pretty similar to TCP. Even with all the improvements we’ve made, it’s still TCP.
Historically, this has worked pretty well. Network loss tends to be either sporadic or persistent. Once you’ve figured out collision and congestion, if you experience loss and a few retries don’t help then you’re probably out of luck.
But, what if you can’t retransmit that packet? What if that specific packet just can’t be delivered? The next packet arrives just fine. But retransmission isn’t helping the first one.
I was working on a project that involved processing high-bandwidth streaming data. Neither the data nor the processing matter, because today we were just playing some recorded data back to test network throughput. We had just upgraded the infrastructure, so we wanted to see what it could do.
A file was stored on one machine, and some software to eat it was running somewhere else. Because we wanted to avoid head-of-line blocking in this application, we had a custom reliability protocol over UDP. We could tolerate a loss if it meant avoiding a latency spike.
We had a huge library of test data, and had been playing most of it back without problems. Occasional loss was normal and expected. However, there was one dataset that we found would always drop the first packet. Normally, we wouldn’t even notice this. But, this dataset had a lower packet rate so the delayed start was perceptible.
The first few times we observed this, we thought nothing of it. The network did something weird. But, it was repeatable. And the reliability protocol had more than enough time to compensate for a random loss event. Why didn’t it in this case?
We were convinced we had discovered a bug. We started turning knobs to figure out under which conditions it could be reproduced. The first thing we tried tuning was the packet size. Originally, we were sending packets with a payload of something like 32 KiB. What happens if it’s only 8 KiB? No bug. 64 KiB? Still bugged.
I figured there must be something wrong with how we’re handling large buffers in either the sender or receiver. I started staring at the code. I analyzed the unit tests for a gap. But both programs were simple enough and tested comprehensively enough that I was at a loss.
We eventually took the new code and ran it on the old infrastructure. It worked there. This is truly an enigma.
Something inspired another engineer to start twiddling bits. Somehow, he figured out that changing either of two adjacent bytes at a certain offset in the file could bypass the bug. I knew this had been a crucial discovery.
I gave up on analyzing the code and convinced someone with more permissions than myself to collect a packet capture from both the source and destination hosts. I opened it in Wireshark and the first thing I saw was an IP packet that failed to be reassembled.
Sometimes, you start a packet capture at a bad time and miss something. That wasn’t the case here. On the sender side, I could see four fragments in the first datagram, sent about five seconds into the recording. On the receiver, I could see the first, third, and fourth fragments. The second fragment was nowhere to be seen.
Yes, the 32 KiB payload required fragmentation even with jumbo frames - I am not sure why that was the default size, and didn’t care to find out. But, I had to figure out why fragmentation was breaking the transmission of this file.
I continued looking at the packet captures - I could see several attempts at retransmission of this problem datagram, all unsuccessful. The second fragment never made it to its destination.
Going back to the other engineer’s discovery, I tried to find those two bytes he had been able to manipulate to evade the bug. Looking back to the sender capture, I found they landed on the third and fourth bytes of the IP payload in the frame that was getting dropped.
As a 16-bit big endian integer, that’s 3784. And it landed right where a UDP port number should go.
Port 3784 belongs to a protocol called Bidirectional Forwarding Detection, or BFD. In short, it allows network devices to send each other high-rate pings in order to detect when a link fails. That should be irrelevant, because our packets were not destined for port 3784. And yet, under fragmentation, that port number ended up in the spot that it would if it were.
Thus, we finally had a hunch. Something on the network is eating this packet, because it erroneously thinks it’s a BFD message.
We tested some other scenarios. What if the destination port were indeed 3784? The failure is reproduced every time, even without fragmentation. If we put those two bytes at the same offset in the third fragment instead? The same. We took this as confirmation of our hypothesis.
Indeed, even without a valid UDP header, this IP fragment was being interpreted as BFD and taken up into the control plane, just on the basis of “yep, that’s where the port number should go.”
The new infrastructure’s switches supported BFD, while the old ones didn’t. We weren’t using the feature, and in fact it was turned off, but the problem occurred nonetheless.
We filed a support case with Cisco, who let us know that they were already aware of the bug. It was related to another issue reported to them by another customer. This was a bit anticlimactic. But, fortunately that meant it was only a couple of weeks before a firmware update was available to fix this.
Along with the firmware update, we were given a link to a security advisory regarding it. Apparently there was actually a DOS vulnerability hiding somewhere behind this bug, now known as CVE-2022-20623.
I am not a security researcher, so I wasn’t concerned with finding a vulnerability but rather just fixing a bug. I wasn’t disappointed to not be first to discover it - I was just happy to have had an interesting problem to solve. Nonetheless, sometimes when I recount this story to friends I may take a bit more credit than I deserve, hinting toward Cisco’s attribution,
This vulnerability was found during the resolution of a Cisco TAC support case.
and taking it as a small claim to fame.