There's a reddit thread going on that was discussing an interesting edge case in regards to 911 and the outage.
In Canada 911 service is accessible without a SIM card; if your phone doesn't have a SIM, any cell tower should in theory still accept and route the call as normal. However in Roger's case, because the cell towers and their authentication mechanisms were still operational, any 911 call with a Roger's SIM card in it would route through Roger's network and only Roger's network; the one that couldn't service any calls. In essence, your 911 service was completely cut off.
The workaround is to pull the SIM card prior to making the 911 call, but it leaves an interesting question about what you're supposed to do in an eSIM world where pulling the SIM is not possible but you're again in this kind of situation.
To be clear I don't know if this is a real problem or not, but it is an interesting thought either way.
I use one paperclip per week, and the only time I have one with me is when I'm doing that one thing on Monday morning. Other than that, there's one bowl full of them which is actually in the same room as the pile of miscellaneous stuff that includes the real sim ejector gadgets.
I carry at least 2 paperclips with me (1 live, 1 backup). They don't add much to the space or weight of my EDC, but they can come in handy in the time of need.
> The workaround is to pull the SIM card prior to making the 911 call, but
The actual “but” clause is nobody is going to be able to diagnose the problem just from the terminal device. It’s like when our cable service goes down* the kids shout out “WiFi is down” (while to me it’s working fine).
* this happens a lot because I live in Palo Alto which, despite having deployed fibre close to every address in the city back in the 1980s was never able to actually, you know, deploy IP service.
Didn't actually put fiber close to every address. Maybe certain parts of the city but really not much.
I live in midtown where the city hasn't gotten around to undergrounding the wires. In theory that means it's easier to put in fiber. In reality it means no fiber and cable internet that get flaky when it rains or if the wind blows.
I think it’s officially within 1/2 a mile, but closer everywhere but the hills.
I looked at using the fiber to connect my house directly into the PAIX (I confounded an ISP years ago). PA was happy to rent me dark fibre which I could light up however I liked. They charged by the distance (I was about 1.5 km from the PAIX). IIRC it was going to be about $25K for the cap ex and then 2K/month to use it (plus my equipment). My wife sensibly said no.
The correct mitigation is a codification to REQUIRE (if this isn't already the case) the calls to go through. From a technical standpoint the failure to route to a given provider to terminate the call should be treated as an overflow and it should proceed down alternative service options, even from other providers.
>Staffieri says Rogers has made meaningful progress on a formal agreement between carriers to switch 911 calls to one another’s networks automatically, even in the event of an outage on any single carrier’s network. He says the company is physically separating its wireless and internet services to create an “always on” network that meets a higher standard of reliability.
It didn't work, I tried to manually connect to other service providers (Roaming) and the connection was rejected. I liked to believe the others service providers automatically blocked Rogers customers to keep their service alive.
His point was that you could do that for 911, not for actual roaming service.
I'm no expert but I feel like roaming still require some sort of "authentication" over the original network and obviously Rogers would have failed on that part.
I think that would work. Just tried it and while I got “zero bars”, going into field test mode showed I was sitting connected to another roaming network with limited service.
I would think the best workaround (until carriers/handsets fix this) is to change your cell network - which you can also do if you have poor signal in an area: On iPhone, go to Settings > Cellular > eSIM name > Network Selection and turn off “Automatic”. Then pick a different network.
Admittedly, this is neither fast nor easy/intuitive. As pointed out in a sibling comment, the fault lies with Apple and Google for not forcing the 911 call on a different network after it fails.
The fix carriers are discussing sounds like finding a way to re-route 911 calls over other networks on the same tower, or some other failover mechanism. It’s not clear if that would be simpler than having a handset try again on a different network.
A much bigger entity, but I know the European Bank has 2 datacenters (Amsterdam and Rome), connected by 2 separate backbones going through completely separate geographic areas and they switch from one datacenter to another every 6 months to test out their DR. In fact, they have been doing that for decades.
Why switch 100% instead of continuously running both with transparent failover in case of failure? It seems like you risk a major outage every 6 months with the current scheme unless I’m missing something.
It was told me years ago, so I'm not sure of the details. What they are doing is probably switching the traffic to the main DC, not switching off completely one.
the story i heard, which i'm going to irresponsibly repeat without knowing the truth of, is that Interac did have a backup ISP but it was one of the smaller ISPs who buys access to rogers fibre, so their backup went down because rogers was down.
This definitely showed me that a cashless society is a bad idea. Thankfully I had some cash on hand otherwise would have been stuck for food and gas. It also caused a lot of issues at work as all my coworkers need to drive to do their job and some had no cash on hand. Of course that same day the chip on my visa stopped working and a new one was ordered in the mail. But if not for the cash on hand it would have been a bad day.
After a very frustrating experience with fraud prevention shutting off my CC in Paris, I called and very pointedly made sure they weren't going to fuck me again on my next trip. Day 2 in London, biggest Visa network outage in recent memory happened.
What was most surprising was how long it took some store clerks to flip their point of sale systems into queuing mode so that people could keep buying things. There was a period where we just went from shop to shop trying to buy the same thing until we found one that didn't decline the card.
Not only that. They often are pathologically incapable of giving straight answers (I am talking about business dealings). It was / still is a major pain in the ass for me as I am very straight guy and have no desire to decipher what their evasive answers really mean. There are some rare exceptions though. Dealing with XXX Motors Canada almost shocked me in comparison because when their guy said they would do something it was like set in stone. They went out of their way to meet my needs. Dealing with Americans was a pleasure for me as well. They say A and it means A and it will be done.
Disclaimer - I am Canadian but was born in USSR, came here some 30 years ago.
Who/what do you consider to be a "Canadian" in 2022?
In Toronto and the surrounding cities, for example, about half of the population are foreign-born.
A significant proportion of the population of many other Canadian urban areas are foreign-born.
About 20% of the overall Canadian population are foreign-born.
Many of these people are from cultures that are quite different to anything resembling "traditional" Canadian culture.
Acquiring Canadian citizenship later in life doesn't necessarily change a person's values, attitudes, and so on.
A lot of Canadian-born individuals have one or both parents who are foreign-born, which also can have an impact on one's values, attitudes, and behaviour.
Even among those with multi-generational ties to Canada, there were already significant cultural/values/attitude/behavioural differences among the various groupings.
Ultimately, in any given interaction with somebody in Canada today, there's a good chance you're dealing with somebody whose ties to "traditional" Canadian culture are limited, or even non-existent.
Maybe some kind of a relatively cohesive "Canadian" culture or identity existed at one point, several decades ago, but I don't think that's the case any longer, especially in the urban areas.
I am frequently that person giving evasive answers.
Why? Avoid blame and I do not speak up when my manager leaves me in a bad position.
My priority in many meetings is merely to not get nailed down on something that I can be whacked with later. Better to avoid all accountability, as Canada doesn't really reward doing a good job.
No, it has to do with our relatively flat social structure. Nobody can win, so may as well focus on not losing. Canada is not a country that will let you win big for having big accomplishments.
I live in the east end of Regina and had an interesting experience with someone driving east and looking for a Royal Bank. The only Royal I could think of is downtown, both our phones were on the Rogers network and I didn’t even know where to find a phone book.
It was amazing to think that one provider could go out and that would render me completely useless in my own city. I’ve outsourced memory to my phone for so long that I only know places I frequent.
He ended up using a TD. I failed prairie hospitality.
"Although Bell and Telus offered to help, Rogers quickly determined that it would not be able to transfer its customers to its rivals’ networks because certain elements of the Rogers network, such as its centralized user database, were inaccessible as a result of the outage."
It sounds like their control/management plane (with the user's database) was dependent on their data plane. So a data plane outage was more challenging to mitigate than it should have been in a decoupled architecture. Good lesson for any architecture.
The fact that the CTO was replaced so quickly is not just a sign of scapegoating (which I am sure it is), but also the truth is that so many of these C-level tech executives are utterly incompetent at their jobs, and it takes a crisis for many to realize it. The C-levels at so many large companies are akin to high-level political appointments at government agencies. It's a good thing the lower levels know what's going on. In any crisis, the know-nothing C-levels are first to be ejected, where they can go look for their next appointment.
At the level of CTO of a major telecom, I don't his firing as scapegoating, it's accountability. The buck has to stop with someone, and the CTO is responsible for the technical infrastructure of the company.
I think parent's point was that absent a crisis the same (incompent) CTO would still be in his job. So if that's what it takes to boot a CTO, who will then land in another executive position, then it's a poor metric for ensuring technology orgs (in old companies) are well-run.
If the current CTO is fired, the next CTO will pay more attention to reliability. This was clearly an organizational failure. Blame free retrospectives have a purpose, but if there is no accountability at the top level, there can simply be repeated failures regardless of the "lessons learned."
Rogers is also known to historically have had a somewhat 'flatter' and less mature network architecture as compared to their peers, Telus and Bell. Likely owing to their roots as a cable television provider as opposed to landline telephony - ie: dumb pipe vs. less-dumb pipe.
Lots of things went wrong here, to name a few:
- lack of rigour in their process that allowed a critical modification without understanding its effects in context
- lack of monitoring and metrics to alert them immediately of the problem
- lack of emergency rollback capability to revert back to the known-good config
- lack of independent business-continuity comms channels. if you're a Rogers CTO or senior responsible adult and you don't have a backup Telus or Bell SIM card (and vice versa), you learned why you need one.
Also lack of downtime procedures. They seemed to have assumed that nothing could go wrong when you should assume that everything could go wrong.
From their own documentation, in mid-2015 (kinda late, but better than never), the providers issued eachother eachother's SIMs to use in these situations. Who qualified exactly is a mystery.
> CKOT-FM and CJDL-FM (Tillsonburg) were not able to air live programming from 4:45AM EDT to 10:58PM on the same day. During that time, evergreen programming was aired from an MP3 player at the base of the transmitter until an Internet connection was re-established between the studio and the transmitter site
I'm impressed they went to these lengths to get services up!
Bah - your 3rd example is just a single site, and the second was actually caused by BGP. I'm not convinced BGP wasn't the problem with the first either ;-)
BGP is something that makes everything more resilient when it's achieving its goals, not like an unknown assassin. Large-scale BGP harm almost always comes from a trusted, good-willed peer making an honest mistake.
Both my home internet and mobile are through Rogers (for the time being). I had no access to the internet for 15 hours, 6am to 9pm. Couldn't do my job as a remote developer.
And all through the day I kept thinking to myself "I bet someone pushed an update to prod, causing this. And I'm glad that this time it wasn't me."
> But, at 4:43 a.m. on July 8, a piece of code was introduced that deleted a routing filter. In telecom networks, packets of data are guided and directed by devices called routers, and filters prevent those routers from becoming overwhelmed, by limiting the number of possible routes that are presented to them.
> Deleting the filter caused all possible routes to the internet to pass through the routers, resulting in several of the devices exceeding their memory and processing capacities. This caused the core network to shut down.
Lesson no 1: Do not design your system to have a single point of failure.
> But, in the early hours, the company’s technicians had not yet pinpointed the cause of the catastrophe. Rogers apparently considered the possibility that its networks had been attacked by cybercriminals.
I mean, if you just pushed a config change and the whole network goes kaput, take a look at the config change before you start suspecting hackers.
I heard that the teams were having trouble communicating with each other and so the ones who pushed the config might not have been the ones looking for hackers.
This is why some hospitals still use the old pager systems to contact people in the city. One hospital-owned antenna on a battery can coordinate a lot of people. I don’t know what the equivalent to that would be in this case though.
Radiocommunications Regulations in Canada indicate that “radiocommunications in support of industrial, business or professional activities” cannot be done over ham radio (Section 47, Subsection C, Para iii).
I would imagine using it for running a nurse-doc comms would constitute professional use.
I have my Ham license in America and so it may be different for our maple syrup friends up north. But one thing that is stressed repeatedly throughout preparing for the exam is that when an emergency happens most of the restrictions fly out the door as the emphasis is on helping address the emergency. So I am assuming the FCC equivalent wouldn't be too peeved.
Of course I'm sure this does run the line of if this is emergency or not.
An telecommunications outage, by itself, is not an immediate threat to life or limb to you, and you'd need to exhaust other reasonable options to communicate before that emergency provision comes into play.
(Many hospitals do also have ham radio operators as part of the disaster plans and operate during their drills.)
Let's not make ham radio operation out to be some complex Voodoo that requires months of training. You tune your radio to a given frequency and press the ptt button.
Yes sure, there's repeaters, and other details to know about, but tx/rx on a single frequency isn't exactly the rocket science that sad hams make it out to be.
I don't think it immediately went 'kaput' -- and then internal communications would have also been affected so that would delay communicating the failure as well. So, it might have been hard to initially tie it to the config change.
"In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced."
I read this as: we were halfway through a big project to replace 6509s with new Juniper switches but that takes a few years, so in the meantime we sometimes make configuration changes to each router that have different behaviors.
That really can't be the lesson. You can have redudancy, but there's still some central spot where that redundancy gets configured and controlled, and you can mess that up. So sure, maybe then you'd have multiple controllers and have them vote or something to see who wins. But then that configuration needs to be known to them and you could mess that up. No matter how much you design, there's still a point at the top of the stack where it's a human being typing numbers into a text file or something equivalent.
And the Big Failures are always around this spot in the stack. Things like routing topology control (also top-level DNS configuration, another famous fat-finger lever) are "single points of failure" more or less by definition.
The router config is not the top of the stack though. There should be orchestration systems that push out changes to these devices and can roll them back when required.
In general, you can't do any kind of automated rollback when the network goes down (same deal for DNS, which is why I mentioned it -- similar datacenter-wide failures have been due to stuff like stampeding initialization herds, etc...). I get what you're saying, I'm just pointing out that it's a little naive. No matter how high you make your stack of abstractions, there's still a human being at the top of the stack with fat fingers.
Sure but that’s why I think the take-away here is to design the system such that you can do such automated rollbacks. For example, the devices should also be connected to some sort of isolated management network. Granted, pushing out a config to the routers could also break their connections to said management network, but it is less likely anyways.
Basically they configured all their routers to install all the routes in the GRT?
Seems like an innocent and unavoidable mistake.
My opinion: You know how Juniper routers require you to commit a change? That should be enforced by a dedicated firmware and there should be multiple rollback confirmations (optional). So you commit a change and confirm, an hour later you have to confirm again else a hard reset and rollback takes place with a protocol in place where the rollback can be delayed when other devices in connected segments are also doing a rollback.
Avoidable mistakes are still mistakes, and mistakes happen, even when they shouldn't.
There is no one thing you could do to avoid catastrophic failure. The best thing you can do is break your system up entirely and simply never do any change which can affect the entire system. Which is still very hard to do.
Imagine you have a route that crosses all of Canada. For each end of Canada you want at least 3 separate networks that can each carry 1/2 of the traffic. On failure of one network, somebody has to take that failed network's traffic and send half to each of the two remaining networks. But if somebody fucks up and sends that traffic to one of the networks, that network will be flooded and inoperable, and the first network also will be inoperable, leaving only one network with 1/3 the country's traffic online. Whatever customers designed their own stuff to depend on the first or second network is also screwed. This is just one of a million different scenarios that needs complex controls to prevent a (mostly) catastrophic failure.
There is nobody using formal proofs to show the design makes sense, so it's just humans winging it and hoping their contingencies work out.
Only if you have incompetent people. Everyone overlooks things but forgetting to commit a change is on you. Ideally the rollback would revert the network state.
> Deleting the filter caused all possible routes to the internet to pass through the routers, resulting in several of the devices exceeding their memory and processing capacities. This caused the core network to shut down.
I don't get this. If a switch can handle only x number of routes, it won't get "overwhelmed" if you try to add more, it would just refuse to accept any more routes. Why would the network go down completely? These things are designed with extremely robust software, you simply start getting error logs saying that the device has run out of memory, but the operation continues. Most of the heavy lifting is done in hardware chips, so the software just can't program new routes, but the routing itself should continue unaffected.
> it won't get "overwhelmed" if you try to add more, it would just refuse to accept any more routes
It not how it worked in old Cisco routers - when more and more routes are added at some point it runs out of fast memory (TCAM) and becomes either too slow or not operational at all. Unless you manually configure route number limit it would install more routes into FIB than it can handle. So one have to monitor TCAM utilization and based on this configure various limits to prevent its exhaustion.
May be TCAM exhaustion is still a problem even in modern Cisco equipment - I've moved to another field and don't know much about it.
I've personally tested this scenario on relatively newer Cisco devices and it literally just prints a message on the console to the tune of "TCAM exhausted, refusing to add route A.B.C.D/X" and continues working without problems. I think there is a config on some models to install the route in the slow path, but by default it just doesn't use the route.
It's funny that all of Canada's internet providers have marketed all their bundling of services as an advantage (bundle and save money!) I'd love for someone to spoof the ads to show that all you're doing is helping create a single point of failure instead of redundancy.
Copied from the docx - pruning the bits leading into and out of hidden content.
The Outage
The configuration change deleted a routing filter and allowed for all possible routes to the Internet to pass through the routers. As a result, the routers immediately began propagating abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic. As a result, the Rogers network lost connectivity to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.
The Recovery
To resolve the outage, the Rogers Network Team assembled in and around our Network Operations Centre (“NOC”) and re-established access to the IP network. They then started the detailed process of determining the source of the outage, leading to identifying the three Distribution Routers as the cause. Once determined, the team then began the process of restarting all the Internet Gateway, Core and Distribution Routers in a controlled manner to establish connectivity to our wireless (including 9-1-1), enterprise and cable networks which deliver voice, video and data connectivity to our customers. Service was slowly restored, starting in the afternoon and continuing over the evening. Although Rogers continued to experience some instability issues over the weekend that did impact some customers, the network had effectively recovered by Friday night.
what was the root cause of the outage (including what processes, procedures or safeguards failed to prevent the outage, such as planned redundancy or patch upgrade validation procedures);
Like many large Telecommunications Services Providers (“TSPs”), Rogers uses a common core network, essentially one IP network infrastructure, that supports all wireless, wireline and enterprise services. The common core is the brain of the network that receives, processes, transmits and connects all Internet, voice, data and TV traffic for our customers.
Again, similar to other TSPs around the world, Rogers uses a mixed vendor core network consisting of IP routing equipment from multiple tier one manufacturers. This is a common industry practice as different manufacturers have different strengths in routing equipment for Internet gateway, core and distribution routing. Specifically, the two IP routing vendors Rogers uses have their own design and approaches to managing routing traffic and to protect their equipment from being overwhelmed. In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced.
The Rogers outage on July 8, 2022, was unprecedented. As discussed in the previous response, it resulted during a routing configuration change to three Distribution Routers in our common core network. Unfortunately, the configuration change deleted a routing filter and allowed for all possible routes to the Internet to be distributed; the routers then propagated abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their memory and processing capacity and were then unable to route and process traffic, causing the common core network to shut down. As a result, the Rogers network lost connectivity internally and to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.
how did the outage impact Rogers’ own staff and their ability to determine the cause of the outage and restore services;
At the early stage of the outage, many Rogers’ network employees were impacted and could not connect to our IT and network systems. This impeded initial triage and restoration efforts as teams needed to travel to centralized locations where management network access was established. To complicate matters further, the loss of access to our VPN system to our core network nodes affected our timely ability to begin identifying the trouble and, hence, delayed the restoral efforts.
Despite these hurdles, our preestablished business continuity plans enabled staff to converge at specific rally points. Those equipped with emergency SIMs on alternate carriers that enabled our teams to switch carriers and assist in the initial coordination efforts. Further, we rapidly relocated our employees to two of our main offices in the GTA (# #). The critical network employees were able to gain physical access to our network equipment. Other essential employees were able to use alternate SIM cards, as per our “Alternate Carrier SIM Card Program” (described in Rogers(CRTC)11July2022-1.xiii below). Other employees were able to work from # #. Together, these groups were able to establish the necessary team to identify the cause of the outage and recover the network.
what contingencies, if any, did Rogers have in place to ensure that its staff could communicate with each other particularly in the early hours of the outage;
On July 17th, 2015, the Canadian Telecom Resiliency Working Group (“CTRWG”), formerly called Canadian Telecom Emergency Preparedness Association, established reciprocal agreements between Rogers and Bell, and between Rogers and TELUS, to exchange alternate carrier SIM cards in support of Business Continuity. This is to allow TSPs to communicate within their organizations in the event of loss of their respective networks. Bell, Rogers and TELUS took the lead to provide SIM cards to all CTRWG members.
# #.
When it was realized that Rogers entire core network was offline, employees started swapping out our Rogers SIM cards with our alternate carrier SIM Cards. This previously established contingency plan allowed us to begin communicating within our organization in the early hours of the outage and to start restoring services.
The opposite--- they are effectively covering for each other. They (the orgs, more generally) are complicit in allowing this dysfunctional rent-seeking monopoly to continue and expand. This was a political failure long before before the "coding error" took place.
I would strongly argue that Comcast would be a huge step backwards even for the Canadians. "Comcast Is America's Most Hated Company" [0]. Roger's 10.2 million wireless subscribers lost connectivity for maybe a couple of days. Comcast has been failing at servicing more than 31 million broadband internet subscribers for decades. Comcast has also done everything they can to destroy net neutrality. The sooner Comcast is a thing of the past, the better.
Having dealt with most major Canadian ISPs and Comcast, I think the only Canadian ISP I would rank above them is Telus. Maybe Bell depending on the province.
The implication in the article is that this happened during a scheduled change window (the 6th change in a set of 8), but it's also just the strong implication. I'm still of the apopleptically hawkish view it was an attack, and equivocal language just reinforces that belief, but charitably I'm less and less attached to that view.
Let's say this was a mistake. When they tested the config in a pre-prod environment, they would have noticed the redistribution of those routes. If they were pasting the config into a router instead of sending it via scp, maybe there's a mouse buffer/paste error, but you don't do that on a core device anymore. Judging by the response time, they weren't doing this onsite at the console either and their OOB access didn't work because it probably used the bunged cell network. Not negligent, but certainly risky.
I have personally made routing mistakes that caused peer resets, like mistyping a prefix on a static route that overrode the route to a peer interface over an exchange, and I have also overloaded circuits because of misguided path prepends. I haven't had enable privs on a core router in probably 20 years, but this wholesale redistribution is caused when you take one peers routes and announce them to another with you as the origin, known as "announcing the internet." As I remember, this isn't a single filter.
Not to muddy it, and it doesn't help my attack theory much, and please call this out as bullshit with a correction if you recognize it to be as such, but someone who used to work there told me something to the effect of their architecture had combined some of the older ASNs from previous Rogers acquisitions and that was part of the problem, whereby they essentially have the old AS's speaking iBGP to each other internally within Rogers, with only the single Rogers AS facing the internet as a peer, sort of the way the some of the old mega ASNs. It implies there were different policy domains with different levels of maturity inside the main AS, and those internal peers don't filter routes they recieved from the core. When you have a network that has grown organically over the last couple of decades, this happening at least once is inevitable. Legacy policies just redistributed what they were sent, and between flapping, dampening penalties, transiting and backhauling mobile voice traffic over IP networks, and backup and OOB reachability, this one of those perfect storms where for years the network both worked magically, and then one day failed just as magically.
Undoubtedly, there are probably some senior engineers sitting around saying, "I told you this would happen, and we've been trying to tell you for years," but when you know this stuff, it's your responsibility to also be persuasive.
I was very attached to the attack theory for a bunch of reasons, but the conceptual chain of events above makes more sense to me, and it is a lot more like a rubber gasket on a fuel tank expanding too rapidly an unexpectedly cool day. Obvious in hindsight and simple to explain, but invisible up until the moment of ignition.
Botching a prod update -- that happens to the best of us. And sorry to say I doubt Rogers would attract the best of us but I digress. But only kind of digress.
When the human waste hits the circular cooler device, that's when we find out how well an organization is built and managed.
Who is surprised here it took them this long to recover? The necessary no-blame culture, temporary removal of decision barriers, all-hands-on-deck to get back to normal, all the things the lucky of us take for granted are guaranteed to be missing here.
> Who is surprised here it took them this long to recover?
They have no independent management infrastructure. Their management network rides logically on top of their main physical network. So when the latter went down you probably had people going to the physical locations of various DCs. But then you have to coördinate between them, and if your employees use your own cell phone service for communication, and that's down…
So the first couple of hours was figuring out what went wrong, then it was having people scramble to locations to plug in serial consoles, then it was buying a bunch of SIM card from other companies, and finally folks managed to wrap their heads around the big picture and come up with a way to bootstrap the network.
Sounds like some employees already had those for such an event, which is impressive. From the article:
> some employees started swapping out their SIM cards for Bell or Telus SIM cards that they had received back in 2015 as part of an emergency contingency plan established between the wireless carriers.
An independent management infra would also help to assess as well as recover. There might be a need to require either such an independent network, or one that fully does not reuse your own internal network as a backup link for the management interface.
"The average system administrator salary in Canada is $71,981 per year or $36.91 per hour. Entry-level positions start at $61,425 per year, while most experienced workers make up to $94,400 per year." (https://ca.talent.com/salary?job=system+administrator)
They can be cost effective only if you can travel in quiet parts of the year. I was also mainly referring to buying things in Europe/US/UK with a salary paid in a weaker currency.
I doubt any of the IT infrastructure in Canada is build by the “best of us”. And it shows, banking, healthcare, telecom, they are all stuck somewhere in the 90s.
The best of us can't be bothered with those insignificant things! They're too busy building algorithms to squeeze every penny out of advertisement dollars on social media.
I am a dev with 4 years experience. I haven’t been able to get a bank, telecom, or especially health care org to offer me more than 80K Canadian a year. 64K USD.
These are sleepy places where you go to work 15 hours a week and on demanding days call in sick.
And can you blame them? That's where the money is.
And even if you are into charity, it's usually more efficient for you as an individual to make more money and then donate to whatever GiweWell lists at the top of their list of most bang-for-your-back charities. Instead of working in a soup kitchen, or a Canadian bank.
oh I am absolutely not blaming anyone! The free market is an efficient machine in this case. It simply means that there is room for innovation in those industries like healthcare and banking. We have already started to see this innovation with companies like One Medical and Square.
In general, I found it helpful to not look at 'free markets' as a binary Yes/No thing, but as a continuum with lots of different aspects. Free markets are best, but markets that are only slightly freer are (typically) already slightly better. Incremental improvements help.
For example, mobile phone connectivity is still very regulated, but in many countries competition seems to be much fiercer there than for home broadband.
About the last: the (partial) privatisation of British Rail was a mess. Far from a textbook case. But still ridership numbers that had been in seeming terminal decline picked up in absolute terms, and rail's relative share of the overall transportation market increases total.
The impact of deregulation of (American) airlines has been tremendous. Flying is now cheaper than ever. But to address my point: flying is still very strictly regulated, just less so than before (especially in the areas of pricing and competition).
Despite common complaints on the Internet, the US is actually still really, really good at running railroads. It's just that their area of competence is freight rail. Passenger rail in the US is anemic.
Guess which part is in government hands, and which one is comparatively free market and less regulated? However, neither area is completely free market or completely state controlled.
Another example: a few years ago Germany legalized long distance busses. The old ban was a hangover from the Nazi era.
https://www.dw.com/en/regulations-eased-on-long-distance-bus... is an overview article from just before the legalization of busses came into effect. I leave it as an exercise for the reader to find some sources from afterwards. Google Translate might be helpful, if you don't read German.
And, of course, legalization and deregulation here just mean different (and a bit less) regulation. There's still plenty of rules. We are talking about Germany, after all.
I like OneMedical (I'm a customer) but I don't know how much "innovation" it's done in healthcare. That's not to say it's their fault, healthcare is a startup graveyard, but you can't really go to OneMedical for anything beyond the most trivial, and you can book some stuff / send messages using an app.
Kaiser arguably did more innovation than OneMedical, as far as I can tell.
Isn't it funny, though, how many people say they can't help working soulless jobs because the capitalist machine controls their lives, but they also say they refuse to feel the least bit invested in their jobs because because they don't let their lives be controlled by the capitalist machine?
Working purely for money and being emotionally detached from your work is one of the bad consequence of capitalism -- in fact it's the most hellish consequence of capitalism that most people working software development jobs are likely to personally experience. I don't understand why people working at adtech companies talk about being emotionally checked out all day as a healthy way to adapt to the system.
I grew up in East Germany. Socialist workers' paradise!
I can tell you that jobs sucked compared to comparatively capitalist West Germany.
(Basically, capitalism might or might not be bad; but incremental improvements to it have a much better track record that trying to switch to a completely different system.)
I agree! I work for a company that makes a product that I think is a positive force in the world rather than a destructive one. I work for less money than I could at Google or Facebook or Amazon and allow myself to get emotionally invested in the work, and I think having the opportunity to make that choice has made my life much better. I know this is a luxury that not everybody gets under capitalism, but I'm always confused that so many people who could have it willingly pass it up to make more money doing a job that they can only justify by pretending they don't have a choice.
It’s actually worse than that. I would hardly count myself among the best, but I’m certainly experienced and would absolutely love to try and fix the absolutely appalling state of health care records for example.
But besides there being no money in it, between regulatory hurdles, fragmented ecosystems and lack of incentives for individual doctors or hospitals to be good at sharing or maintaining these types of records, I could give the software away for free and still fail to gain traction.
In all of these spaces I don’t see how it progresses without government regulation and/or incentives, because the incumbents have literally no reason to improve.
having to deal with legacy systems and interoparabality is a significantly bigger issue than developer pay as evidenced by the very well working digital infrastructure in much poorer countries that just leapfrogged the old stuff.
But in poorer countries, those companies probably pay relatively well. Sure they make less than Canadian developers at big banks, but they relatively make more so they would be better in many cases.
they pay relatively well in Canada too, and for arguments sake, the US which has probably the best paid devs across the board has a financial infrastructure from hell when it comes to payments or money transfers.
It's just tech debt. If you've ever been to China, it's crazy how well WeChat works. There's just nothing like it in the EU or the US because we have 20 different decades old things for everything that people can't or won't replace.
When I was deciding on where to immigrate to in 2006, I had job offers from the UK, Canada and the USA.
I ruled out the USA because of the CFAA. I do not like second guessing everything I do online whether it violates a badly written federal law. Of course, United States v. Elcom Ltd. didn't make me much happier. Three Felonies A Day came out later and it just strengthens all this. We also should mention the almost complete lack of social net and the insane health care system.
A friend has one of his jobs with a telecom company. Everything there is political. He bets that the day the outage happened, as many devs as possible would have tried to call in sick as that is what happens at his firm (Telus/Bell).
I would have. I call in sick during prod failures at one of my gov jobs.
In jobs with no real rewards for excellence but decent job security, a crisis is a reason to get out the door.
We have this where I work and I have mixed feelings about it. On one hand, it's certainly freeing that we don't get fired for mistakes. On another, it's strange to feel that the people who objectively caused the headache for everyone else go without serious attention to remediation.
IMO, we don't have to name names, but Root Cause Analysis should include the decision tree that the primary motivator chose. No blame culture also hampers a lot of insider-threat analysis.
I think the idea behind no-blame culture is that a single person shouldn't be able to cause a massive incident on their own (assuming no malice), so finding a victim to put the blame on isn't helpful anyway. Also, if you start blaming people, everyone will go in permanent cover-my-ass-mode and this will usually just lead to some poor junior, who couldn't play the game yet, getting the full blow.
Of course, letting people run rampant and acting everything is fine is not great, but there's a good chance it might be an organizational problem and, if not, a HR problem (which is not necessarily related to a specific incident).
I think the idea behind no-blame is that there are basically three ways that an outage could be the fault of an individual: maliciousness, incompetence, or bad process. In the first two cases, the individual should be fired, and in the third case, blaming the individual is counterproductive as it not only doesn't do anything to mitigate the issue but also distracts from the real cause. The only gray area I can think of is where it's not not clear from a single incident whether an individual is actually incompetent or if they're "borderline competent" (maybe more competent at other aspects of the job role and it's a balance to figure out if they're net value is worth keeping around), but in that case, I think that's also probably better to be dealt with at the management level and keep out of the root cause analysis.
> MO, we don't have to name names, but Root Cause Analysis should include the decision tree that the primary motivator chose.
I definitely agree with this! I actually have seen cases where people have asked for revisions to a root cause analysis not due to a lack of naming individuals but a lack of sufficient explanation for what the anonymous individual actually did to cause the outage, which I think is fair feedback to give.
> No blame culture also hampers a lot of insider-threat analysis.
Yeah, if no-blame culture is being used as a shield for maliciousness or incompetence, I think that could have a harmful effect on things. I guess I always just interpreted "no-blame" as actually meaning "no blame for honest mistakes", but I guess that's too long to be catchy.
If incompetent people got in the position where they cause a production outage whose fault is that? Sure they should be moved / privileges stripped -- consequences are fine -- but once again, in my opinion, aside from maliciousness an outage is always an organizational mistake.
I have twenty years of experience in the framework we work with and I can't roll a production hotfix without QA approval.
What this comes down to is: if prod goes down, I should be able to focus on getting it back up and not worrying whether I have a job tomorrow. That's not conducive to success.
No-blame culture and no-consequence culture are not the same thing. Just because you don’t get fired doesn’t mean the org should allow someone the capability to make the same mistake, whether that’s access/permissions/change in org etc
no blame works without exceptions when the team makes it impossible to screw up big (test coverage, multiple reviews, modularity, canaries, DR strategy etc.)
if you do retros and the same thing keeps coming up and it's always coming from the same person, then the manager will have to solve that
In Canada 911 service is accessible without a SIM card; if your phone doesn't have a SIM, any cell tower should in theory still accept and route the call as normal. However in Roger's case, because the cell towers and their authentication mechanisms were still operational, any 911 call with a Roger's SIM card in it would route through Roger's network and only Roger's network; the one that couldn't service any calls. In essence, your 911 service was completely cut off.
The workaround is to pull the SIM card prior to making the 911 call, but it leaves an interesting question about what you're supposed to do in an eSIM world where pulling the SIM is not possible but you're again in this kind of situation.
To be clear I don't know if this is a real problem or not, but it is an interesting thought either way.