General Musing

blaze your trail

On the Border of the Internet #risk

with 3 comments

The age old lie told by ISP support desks: ” The Internet is down,” was briefly reality again yesterday.

The past couple of days I’d been seeing and hearing comments that there was a disturbance in the force of the Internet. Initially a NANOG message was posted about a general malaise or instability in the Internet, some humorous quips were posted in response and the matter was soon forgotten.

A network operator looking with hindsight said that they had been able to see more than normal numbers of updates coming on BGP which is normally an indicator of network instability being solved by rerouting round the problem. That is all part of the normal operation of the Internet. And sometime yesterday morning as the east coast of the US was getting to work the looming disaster struck.

Juniper network devices started core dumping and restarting due to a bug in the code which handled the BGP UPDATE messages as another large updated was arriving. The self healing properties of the Internet broke and the Internet went with it. The Great Juniper Outage of 2011 was born.

Avoidable?

Almost certainly. The reliance on the hardware of one specific vendor on the part of large ISPs – backbone carriers – creates a single point of failure which is bad – mkay. A fail over situation should always be in place, not just at the ISPs. Companies who rely on the Internet for business should take this into account too. A recent outages at some of companies I consulted said that by placing their faith in one specific vendor they had created a single point of failure which had caused some high profile repercussions.

Do you have a single point of failure?

Written by Daniël W. Crompton (webhat)

November 8, 2011 at 7:13 am

Posted in hardware, risk, technology

Tagged with , , , ,

3 Responses

Subscribe to comments with RSS.

  1. System administration policies on “homogeneous” environments can also lead to such problems. While it is easier to maintain a bunch of similar systems, it also increases the chance of one fault bringing down all the systems. In one company we had redundant DNS servers, so if one failed the other automatically took over. They were identical hardware with identical OS/software. Some random defect in the OS/DNS sever took server A down. Well, the other machine went down at the exact same time due to the exact same reason.

    Lesson? Failover is important, but a failover system which is identical to the primary may not help at all.

    mortoray

    November 8, 2011 at 9:04 am

  2. I remember my/our first “the internet is down” from Bernard at UPC in the early 90s. He’d been told to use this as a stock answer, working on the principle that however little helpdesk workers knew about internet, the punters knew even less till I phoned >;)

    Credit where it’s due, UPC (Dutch) helpdesk is now manned/womaned by people who know what they’re talking about…

    bill crompton

    November 10, 2011 at 9:00 pm


Please Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: