> But, at 4:43 a.m. on July 8, a piece of code was introduced that deleted a routing filter. In telecom networks, packets of data are guided and directed by devices called routers, and filters prevent those routers from becoming overwhelmed, by limiting the number of possible routes that are presented to them.
> Deleting the filter caused all possible routes to the internet to pass through the routers, resulting in several of the devices exceeding their memory and processing capacities. This caused the core network to shut down.
Lesson no 1: Do not design your system to have a single point of failure.
> But, in the early hours, the company’s technicians had not yet pinpointed the cause of the catastrophe. Rogers apparently considered the possibility that its networks had been attacked by cybercriminals.
I mean, if you just pushed a config change and the whole network goes kaput, take a look at the config change before you start suspecting hackers.
I heard that the teams were having trouble communicating with each other and so the ones who pushed the config might not have been the ones looking for hackers.
This is why some hospitals still use the old pager systems to contact people in the city. One hospital-owned antenna on a battery can coordinate a lot of people. I don’t know what the equivalent to that would be in this case though.
Radiocommunications Regulations in Canada indicate that “radiocommunications in support of industrial, business or professional activities” cannot be done over ham radio (Section 47, Subsection C, Para iii).
I would imagine using it for running a nurse-doc comms would constitute professional use.
I have my Ham license in America and so it may be different for our maple syrup friends up north. But one thing that is stressed repeatedly throughout preparing for the exam is that when an emergency happens most of the restrictions fly out the door as the emphasis is on helping address the emergency. So I am assuming the FCC equivalent wouldn't be too peeved.
Of course I'm sure this does run the line of if this is emergency or not.
An telecommunications outage, by itself, is not an immediate threat to life or limb to you, and you'd need to exhaust other reasonable options to communicate before that emergency provision comes into play.
(Many hospitals do also have ham radio operators as part of the disaster plans and operate during their drills.)
Let's not make ham radio operation out to be some complex Voodoo that requires months of training. You tune your radio to a given frequency and press the ptt button.
Yes sure, there's repeaters, and other details to know about, but tx/rx on a single frequency isn't exactly the rocket science that sad hams make it out to be.
I don't think it immediately went 'kaput' -- and then internal communications would have also been affected so that would delay communicating the failure as well. So, it might have been hard to initially tie it to the config change.
"In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced."
I read this as: we were halfway through a big project to replace 6509s with new Juniper switches but that takes a few years, so in the meantime we sometimes make configuration changes to each router that have different behaviors.
That really can't be the lesson. You can have redudancy, but there's still some central spot where that redundancy gets configured and controlled, and you can mess that up. So sure, maybe then you'd have multiple controllers and have them vote or something to see who wins. But then that configuration needs to be known to them and you could mess that up. No matter how much you design, there's still a point at the top of the stack where it's a human being typing numbers into a text file or something equivalent.
And the Big Failures are always around this spot in the stack. Things like routing topology control (also top-level DNS configuration, another famous fat-finger lever) are "single points of failure" more or less by definition.
The router config is not the top of the stack though. There should be orchestration systems that push out changes to these devices and can roll them back when required.
In general, you can't do any kind of automated rollback when the network goes down (same deal for DNS, which is why I mentioned it -- similar datacenter-wide failures have been due to stuff like stampeding initialization herds, etc...). I get what you're saying, I'm just pointing out that it's a little naive. No matter how high you make your stack of abstractions, there's still a human being at the top of the stack with fat fingers.
Sure but that’s why I think the take-away here is to design the system such that you can do such automated rollbacks. For example, the devices should also be connected to some sort of isolated management network. Granted, pushing out a config to the routers could also break their connections to said management network, but it is less likely anyways.
> But, at 4:43 a.m. on July 8, a piece of code was introduced that deleted a routing filter. In telecom networks, packets of data are guided and directed by devices called routers, and filters prevent those routers from becoming overwhelmed, by limiting the number of possible routes that are presented to them.
> Deleting the filter caused all possible routes to the internet to pass through the routers, resulting in several of the devices exceeding their memory and processing capacities. This caused the core network to shut down.
Lesson no 1: Do not design your system to have a single point of failure.