Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The key bit:

> But, at 4:43 a.m. on July 8, a piece of code was introduced that deleted a routing filter. In telecom networks, packets of data are guided and directed by devices called routers, and filters prevent those routers from becoming overwhelmed, by limiting the number of possible routes that are presented to them.

> Deleting the filter caused all possible routes to the internet to pass through the routers, resulting in several of the devices exceeding their memory and processing capacities. This caused the core network to shut down.

Lesson no 1: Do not design your system to have a single point of failure.



> But, in the early hours, the company’s technicians had not yet pinpointed the cause of the catastrophe. Rogers apparently considered the possibility that its networks had been attacked by cybercriminals.

I mean, if you just pushed a config change and the whole network goes kaput, take a look at the config change before you start suspecting hackers.


I heard that the teams were having trouble communicating with each other and so the ones who pushed the config might not have been the ones looking for hackers.

This is why some hospitals still use the old pager systems to contact people in the city. One hospital-owned antenna on a battery can coordinate a lot of people. I don’t know what the equivalent to that would be in this case though.


Ham radio.

It still works, you know?

Also, pagerduty works over wifi...


Radiocommunications Regulations in Canada indicate that “radiocommunications in support of industrial, business or professional activities” cannot be done over ham radio (Section 47, Subsection C, Para iii).

I would imagine using it for running a nurse-doc comms would constitute professional use.


I have my Ham license in America and so it may be different for our maple syrup friends up north. But one thing that is stressed repeatedly throughout preparing for the exam is that when an emergency happens most of the restrictions fly out the door as the emphasis is on helping address the emergency. So I am assuming the FCC equivalent wouldn't be too peeved.

Of course I'm sure this does run the line of if this is emergency or not.


An telecommunications outage, by itself, is not an immediate threat to life or limb to you, and you'd need to exhaust other reasonable options to communicate before that emergency provision comes into play.

(Many hospitals do also have ham radio operators as part of the disaster plans and operate during their drills.)

INAL, contact ARRL for farther questions.


I’m skeptical the FCC would be upset if the alternative is 911 service being unavailable for most of the country.


The alternative would have been to pull the SIM card to access 911 through any available cell network, but not many knew about this.


In this case, the option would be to have a whole bunch of care providers in the hospital trained and licensed for ham radio.

May whatever deity you worship help the instructor teaching atmospheric propagation to some of the staff I’ve worked with in the past.


Let's not make ham radio operation out to be some complex Voodoo that requires months of training. You tune your radio to a given frequency and press the ptt button.

Yes sure, there's repeaters, and other details to know about, but tx/rx on a single frequency isn't exactly the rocket science that sad hams make it out to be.


>works over wifi

even when the internet is down?


A remote devops guy is enjoying some quiet time at home when...

https://youtu.be/-sQqc6xy_og?t=42


I don't think it immediately went 'kaput' -- and then internal communications would have also been affected so that would delay communicating the failure as well. So, it might have been hard to initially tie it to the config change.


> I mean, if you just pushed a config change and the whole network goes kaput, take a look at the config change before you start suspecting hackers.

They probably did it from home/overseas. Can't check what you did after you dropped the country itself offline.


Possibly more key:

"In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced."

I read this as: we were halfway through a big project to replace 6509s with new Juniper switches but that takes a few years, so in the meantime we sometimes make configuration changes to each router that have different behaviors.


At least Junos has working commit confirmed with working rollback. Can't say the same thing for the alternative.


That really can't be the lesson. You can have redudancy, but there's still some central spot where that redundancy gets configured and controlled, and you can mess that up. So sure, maybe then you'd have multiple controllers and have them vote or something to see who wins. But then that configuration needs to be known to them and you could mess that up. No matter how much you design, there's still a point at the top of the stack where it's a human being typing numbers into a text file or something equivalent.

And the Big Failures are always around this spot in the stack. Things like routing topology control (also top-level DNS configuration, another famous fat-finger lever) are "single points of failure" more or less by definition.


The router config is not the top of the stack though. There should be orchestration systems that push out changes to these devices and can roll them back when required.


In general, you can't do any kind of automated rollback when the network goes down (same deal for DNS, which is why I mentioned it -- similar datacenter-wide failures have been due to stuff like stampeding initialization herds, etc...). I get what you're saying, I'm just pointing out that it's a little naive. No matter how high you make your stack of abstractions, there's still a human being at the top of the stack with fat fingers.


Sure but that’s why I think the take-away here is to design the system such that you can do such automated rollbacks. For example, the devices should also be connected to some sort of isolated management network. Granted, pushing out a config to the routers could also break their connections to said management network, but it is less likely anyways.


I was thinking that the config file was edited in vi with no program to check the config file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: