The key bit: > But, at 4:43 a.m. on July 8, a piece of code was introduced that ...

erentz · on July 25, 2022

> But, in the early hours, the company’s technicians had not yet pinpointed the cause of the catastrophe. Rogers apparently considered the possibility that its networks had been attacked by cybercriminals.

I mean, if you just pushed a config change and the whole network goes kaput, take a look at the config change before you start suspecting hackers.

pitched · on July 25, 2022

I heard that the teams were having trouble communicating with each other and so the ones who pushed the config might not have been the ones looking for hackers.

This is why some hospitals still use the old pager systems to contact people in the city. One hospital-owned antenna on a battery can coordinate a lot of people. I don’t know what the equivalent to that would be in this case though.

_ugfj · on July 25, 2022

Ham radio.

It still works, you know?

Also, pagerduty works over wifi...

xattt · on July 25, 2022

Radiocommunications Regulations in Canada indicate that “radiocommunications in support of industrial, business or professional activities” cannot be done over ham radio (Section 47, Subsection C, Para iii).

I would imagine using it for running a nurse-doc comms would constitute professional use.

buscoquadnary · on July 26, 2022

I have my Ham license in America and so it may be different for our maple syrup friends up north. But one thing that is stressed repeatedly throughout preparing for the exam is that when an emergency happens most of the restrictions fly out the door as the emphasis is on helping address the emergency. So I am assuming the FCC equivalent wouldn't be too peeved.

Of course I'm sure this does run the line of if this is emergency or not.

terinjokes · on July 26, 2022

An telecommunications outage, by itself, is not an immediate threat to life or limb to you, and you'd need to exhaust other reasonable options to communicate before that emergency provision comes into play.

(Many hospitals do also have ham radio operators as part of the disaster plans and operate during their drills.)

INAL, contact ARRL for farther questions.

macintux · on July 26, 2022

I’m skeptical the FCC would be upset if the alternative is 911 service being unavailable for most of the country.

xattt · on July 27, 2022

The alternative would have been to pull the SIM card to access 911 through any available cell network, but not many knew about this.

xattt · on July 26, 2022

In this case, the option would be to have a whole bunch of care providers in the hospital trained and licensed for ham radio.

May whatever deity you worship help the instructor teaching atmospheric propagation to some of the staff I’ve worked with in the past.

mdtusz · on July 26, 2022

Let's not make ham radio operation out to be some complex Voodoo that requires months of training. You tune your radio to a given frequency and press the ptt button.

Yes sure, there's repeaters, and other details to know about, but tx/rx on a single frequency isn't exactly the rocket science that sad hams make it out to be.

notatoad · on July 25, 2022

>works over wifi

even when the internet is down?

Arrath · on July 25, 2022

A remote devops guy is enjoying some quiet time at home when...

https://youtu.be/-sQqc6xy_og?t=42

interestica · on July 26, 2022

I don't think it immediately went 'kaput' -- and then internal communications would have also been affected so that would delay communicating the failure as well. So, it might have been hard to initially tie it to the config change.

Scoundreller · on July 26, 2022

> I mean, if you just pushed a config change and the whole network goes kaput, take a look at the config change before you start suspecting hackers.

They probably did it from home/overseas. Can't check what you did after you dropped the country itself offline.

xbar · on July 25, 2022

Possibly more key:

"In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced."

I read this as: we were halfway through a big project to replace 6509s with new Juniper switches but that takes a few years, so in the meantime we sometimes make configuration changes to each router that have different behaviors.

foobiekr · on July 25, 2022

At least Junos has working commit confirmed with working rollback. Can't say the same thing for the alternative.

ajross · on July 25, 2022

That really can't be the lesson. You can have redudancy, but there's still some central spot where that redundancy gets configured and controlled, and you can mess that up. So sure, maybe then you'd have multiple controllers and have them vote or something to see who wins. But then that configuration needs to be known to them and you could mess that up. No matter how much you design, there's still a point at the top of the stack where it's a human being typing numbers into a text file or something equivalent.

And the Big Failures are always around this spot in the stack. Things like routing topology control (also top-level DNS configuration, another famous fat-finger lever) are "single points of failure" more or less by definition.

zeroimpl · on July 25, 2022

The router config is not the top of the stack though. There should be orchestration systems that push out changes to these devices and can roll them back when required.

ajross · on July 26, 2022

In general, you can't do any kind of automated rollback when the network goes down (same deal for DNS, which is why I mentioned it -- similar datacenter-wide failures have been due to stuff like stampeding initialization herds, etc...). I get what you're saying, I'm just pointing out that it's a little naive. No matter how high you make your stack of abstractions, there's still a human being at the top of the stack with fat fingers.

zeroimpl · on July 26, 2022

Sure but that’s why I think the take-away here is to design the system such that you can do such automated rollbacks. For example, the devices should also be connected to some sort of isolated management network. Granted, pushing out a config to the routers could also break their connections to said management network, but it is less likely anyways.

msie · on July 25, 2022

I was thinking that the config file was edited in vi with no program to check the config file.