By MacConnect Wednesday, May 05, 2010.
Tags: Blogs ď All Things infrastructure ď
2216 An Internet outage. Nobody likes it. Everyone wants to avoid it. Companies spend millions upon millions of dollars to prevent it, yet as the old saying goes, "sh*t happens".
I will admit that from time to time (thankfully its very rare) we will have a problem on our network that might interrupt the flow of traffic for some or all customers. It happens to everyone. The reasons why can range from something as silly as an accidentally unplugged power cord (this actually happened to us at Telx a few years back) to a software bug that causes a cascading series of problems. Regardless of the cause, after an outage everyone wants to know what happened, and what steps are being taken to ensure that it never happens again. After all, we have books and shoes and cheese to sell online, and there are men and women to be matched and stocks to be traded, and none of it can be interrupted EVER for any reason, right?
This is the short story of one such outage in February of 2009. This did not impact Gotham to any significant degree, but it caused enough widespread panic on the Internet for a few hours to be worth passing along.
When one thinks of the backbone of the Internet, a few names come to mind. Names like Cisco and Juniper Networks pop up first, and rightly so because the VAST majority of packets zooming around on the Internet pass through equipment made by one of those two companies. Not all networks connected to the Internet have the resources to purchase and maintain expensive Cisco or Juniper gear however, so you'll also find an odd assortment of other hardware and software attached to the global network if you look hard enough. In the case of SuproNet, a small-ish service provider in the Czech Republic, we're talking about Microtik software. You might ask yourself, "Who or what is Microtik?". Exactly.
To make a long and mind-numbingly technical story short, SuproNet's Microtik software decided that in order to help its owners manage use of primary and backup Internet connections, it would fiddle with something called autonomous system (AS) prepending. If you really need to know what that is, let me know and I'll call you one night when you're having a hard time falling asleep. You'll be snoring in no time at all. In a nutshell, SuproNet's routers began telling its upstream neighbors that it could be found at the end of a REALLY long chain of AS numbers. Why is this of any consequence?
Imagine calling your best friend and informing him that you have a new mailing address. Seems pretty mundane, but now imagine that your new mailing address was so long that it would fill an entire page of The New York Times. Now imagine that in order for anyone else to know your new mailing address, your friend will have to start a "phone chain", relaying the new information to the next person, who sends it to the next, and so on. Here's the rub. Some of your friends look oddly at your new gargantuan mailing address, but pass it along anyway without incident. Some are so maddened by this new address that they refuse to talk any longer to the person that tried to give it to them. Suddenly your network of friends and family members is fragmented, some people are refusing to listen to others, and the normally friendly banter that passes among them comes grinding to a halt.
Well, this is exactly what happened when a comically long AS prepend from SuproNet caused a huge number of Internet routers to stop talking to each other. Since the Internet is basically routers talking to each other, this is a bad thing. For a period of approximately an hour, while network engineers around the world tried to figure out what was going on, things ground to a near standstill in many non-significant places. Network routing information was being changed thousands of times every second as routers began refusing to talk to one another and we wound up with a noticeable period of time during which a large chunk of the Internet turned to mush.
Picture yourself during this outage. Your website is down, your customers are screaming at you, you're screaming at your service provider, you're vowing to move all your stuff to a bulletproof network that never goes down. We've all been there. Odds are we'll be there again. In this case you'd have learned that the root of the problem was in a small office halfway around the world, that a bit of $99 software exposed a previously unknown weakness present in a very large number of routers costing thousands of times more than that, and that even the largest most expensive networks on the planet turned out not to be "bulletproof" at all.
The Internet has certainly come a long way since 1992, but it can still be unexpectedly fragile at times. It would serve all of us well to keep that in mind, especially when marketing our services and making promises that people we've never met (and never will) can trample all over without even realizing they're doing it.
Founded in 1996, MacConnect is the first and largest Mac-Centric ISP on the planet. Providing world class hosting solutions that are as easy as your Mac, MacConnect is the first choice for any Mac user in need of web, email and application hosting. Find us online at MacConnect.com
Tags:
Blogs ď
All Things infrastructure ď