Question to Google employees: Why do you guys suffer global outages? This is you...

illumin8 · on Nov 12, 2018

Google has a global SDN (software-defined network) that gives them some unique and beneficial capabilities, like being able to onboard traffic in the closest CDN POP and letting it ride over the Google backbone to the region your systems are running in.

The problem is that running a global SDN like this means if you do something wrong, you can have outages that impact multiple regions simultaneously.

This is why AWS has strict regional isolation and will never create cross-region dependencies (outside of some truly global services like IAM and Route 53 that have sufficient redundancy that they should (hopefully) never go down).

Disclaimer: I work for AWS, but my opinions are my own.

tehlike · on Nov 11, 2018

2 outage in 5 years sounds pretty low, to be honest.

Disclaimer: google employee in ads, who worked on many many fires throughout the years, but talking from my personal perspective and not from my employer. I am sure we are striving to have 0, but realistically, i have seen many that says things happen. Learn, and improve.

noselasd · on Nov 11, 2018

The issue people have with it is that it's global, not regional, indicating that there are dependencies in the entire architecture that people does not expect to be there.

B-Con · on Nov 12, 2018

There are many other possible causes for global outages, that specific one is not high on my list of likely culprits.

blazespin · on Nov 11, 2018

Yes, hello, canaries anyone

tehlike · on Nov 11, 2018

Plenty of bugs happen despite canaries.

hexchain · on Nov 12, 2018

Just like YouTube a few weeks ago?

tehlike · on Nov 12, 2018

Yt, ads, a bunch of services. Sure.

yeukhon · on Nov 11, 2018

5 years? I remember a major outage maybe in the past year.

kenhwang · on Nov 11, 2018

I believe there was a multi-hour global YouTube/Bigtable/Cloud SQL/Datastore outage in October.

Then there was the global load balancer outage in July.

Looking though the incident history, there were essentially monthly multi-region or global service disruptions of various services.

origami777 · on Nov 12, 2018

Most feature rich cloud? I think that title belongs to AWS.

jkaplowitz · on Nov 12, 2018

You're right in terms of breadth officially covered. But if you look at the features where they both officially have support, there are many examples where the GCP version is more reliable and usable than the AWS version. Even GKE is an example of this, despite the outage in node pool creation that we're discussing here. Way better than EKS.

(Disclosure: I worked for Google, including GCP, for a few years ending in 2015. I don't work or speak for them now and have no inside info on this outage.)

illumin8 · on Nov 12, 2018

I think you're going to have to back up a claim like this with some facts.

GKE being the exception, since it was launched a couple years before EKS. AWS clearly has way more services, and the features are way deeper than GCP.

Just compare virtual machines and managed databases, AWS has about 2-3x more types of VMs (VMs with more than 4TB of RAM, FPGAs, AMD Epyc, etc.), and in databases, more than just MySQL and PostgreSQL. When you start looking at features you get features that you just can't get in GCP, like 16 read-replicas, point in time recovery, backtrack, etc.

Disclaimer: I work for AWS but my opinions are my own.

jkaplowitz · on Nov 12, 2018

Each platform has features the other platform doesn't, even though AWS has more.

Some of GCP's unique compelling features include live VM migration that makes it less relevant when a host has to reboot, the new life that has recently been put into Google App Engine (both flexible environment and the second generation standard environment runtimes), the global load balancer with a single IP and no pre-warming, and Cloud Spanner.

In terms of feature coverage breadth I started my previous comment by agreeing that AWS was ahead, and I still reaffirm that. But if you randomly select a feature that they both have to a level which purports to meet a given customer requirement, the GCP offering will frequently have advantages over the AWS equivalent.

Examples besides GKE: BigQuery is better regarded than Amazon Redshift, with less maintenance hassle. And EC2 instance, disk, and network performance is way more variable than GCE which generally delivers what it promises.

One bit of praise for AWS: when Amazon does document something, the doc is easier to find and understand, and one is less likely to find something out of date in a way that doesn't work. But GCP is more likely to have documented the thing in the first place, especially in the case of system-imposed limits.

To be clear, I want there to be three or four competitive and widely used cloud options. I just think GCP is now often the best of the major players in the cases where its scope meets customer needs.

illumin8 · on Nov 12, 2018

Redshift is not a direct competitor with BigQuery. It's a relational data warehouse. BigQuery more directly competes with Athena, which is a managed version of Apache Presto, and my personal opinion is that Athena is way better than BigQuery because I can query data that is in S3 (object storage) without having to import it into BigQuery first.

Disk and network performance is extremely consistent with AWS so long as you use newer instance types and storage types. You can't reasonably compare the old EBS magnetic storage to the newer general purpose SSD and provisioned IOPS volume types, and likewise, newer instances get consistent non-blocking 25gbps network performance.

I'm not so sure I would praise our documentation; it is one of the areas that I wish we were better at. Some of the less used services and features don't have excellent documentation, and in some cases you really have to figure it out on your own.

GCP is a pretty nice system overall, but most of the time when I see comparisons, when GCP looks better its because the person making the comparison is comparing the AWS they remember from 5-6 years ago with the GCP of today, which would be like comparing GAE from 2012 with today.

jkaplowitz · on Nov 12, 2018

The comments I made about Redshift vs BigQuery and about disk/network/etc reflect current opinions of colleagues who use AWS currently (or recently in some cases) and extensively, not 5-6 year old opinions. Even my own last use of AWS was maybe 2-3 years ago, when Redshift was AWS's closest competitor to BigQuery and when I saw disk/network issues directly.

You're right that Athena seems like the current competitor to BigQuery. This is one of those things that are easy to overlook when people made the comparison as recently as a couple of years ago (before Athena was introduced) and Redshift vs BigQuery is still often the comparison people make. This is where Amazon's branding is confusing to the customer: so many similar but slightly different product niches, filled at different times by entirely different products with entirely unrelated names.

When adding features, GCP would usually fill adjacent niches like "serverless Redshift" by adding a serverless mode to Redshift, or something like that, and behavior would be mostly similar. Harder to overlook and less risky to try.

Meanwhile, when Athena was introduced, people who had compared Redshift and BigQuery and ruled out the former as too much hassle said "ah, GCP made Amazon introduce a serverless Redshift. But it's built on totally different technology. I wonder if it will be one of the good AWS products instead of the bad ones." (Yes, bad ones exist. Amazon WorkMail is under the AWS umbrella but basically ignored, to give one example.)

And then they go back to the rest of their day, since moving products (whether from Redshift or BigQuery) to Athena would not be worth the transition cost, and forget about Athena entirely.

On the disk/network question, no I didn't see performance problems with provisioned IOPS volume types, but that doesn't matter: for GCE's equivalent of EBS magnetic storage, they do indeed give what they promise, at way less cost than their premium disk types. There's no reason it isn't a fair comparison.

And for the "instance" part of my EC2 performance comment, I was referring to a noisy neighbor problem where sometimes a newly created instance would have much worse CPU performance than promised and so sometimes delete and recreate was the solution. GCE does a much better job at ensuring the promised CPUs.

I'm glad AWS and GCP have lots of features, improve all the time, and copy each other when warranted. But I don't think the general thrust of my comparison has gone invalid, even if my recent data is more skewed toward GCP and my AWS data is skewed toward 2-3 years old. Only the specifics have changed (and the feature gap narrowed with respect to important features).

electrum · on Nov 14, 2018

Presto is not an Apache project (although it is open source under the Apache License).

hacknat · on Nov 12, 2018

Yeah. Perhaps feature rich was an overstatement. I meant that when GCP does do a product it works like I’d expect it to work and has the features I need. Not always the case with a AWS, particularly around ELBs and VPCs.

_wmd · on Nov 11, 2018

It is a natural effect of building massive yet flat homogeneous systems, failures tend to be greatly amplified.

Most of what you can read of Google's approach will teach you their ideal computing environment is a single planetary resource, pushing any natural segmentation and partitioning out of view.

toomuchtodo · on Nov 11, 2018

> I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective.

It's the opposite really: the expectation that service providers have no unexpected downtime is unrealistic, and it's strange this idea persists.

Twirrim · on Nov 11, 2018

(disclaimer: I work for another cloud provider)

I agree, in general, outages are almost inevitable, but global outages shouldn't occur. It suggests at least a couple of things:

1) Bad software deployments, without proper validation. A message elsewhere in this post on HN suggest that problems have been occurring for at least 5 days, which makes me think this is the most likely situation. If this is the case, presumably given this is multiple days in to the issue, rolling back isn't an option. That doesn't say good things about their testing or deployment stories, and possibly their monitoring of the product? Even if the deployment validation processes failed to catch it, you'd really hope alarming would have caught it.

or:

2) Regions aren't isolated from each other. Cross-region dependencies are bad, for all sorts of obvious reasons.

toomuchtodo · on Nov 11, 2018

That shouldn't, but they do. S3 goes down [1]. The AWS global console goes down, right after Prime Day outages [2]. Lots of Google Cloud services go down [3, current thread]. Tens of Azure services go down hard [4].

Are software development and release processes improving to mitigate these outages? We don't know. You have to trust the marketing. Will regions ever be fully isolated? We don't know. Will AWS IAM and console ever not be global services? We don't know.

Blah blah blah "We'll do better in the future". Right. Sure. Some service credits will get handed out and everyone will forget until the next outage.

Disclaimer: Not a software engineer, but have worked in ops most of my career. You will have downtime, I assure you. It is unavoidable, even at global scale. You will never abstract and silo everything per region.

[1] https://www.theregister.co.uk/2017/03/01/aws_s3_outage/

[2] https://www.cnbc.com/2018/07/16/aws-hits-snag-after-amazon-p...

[3] https://www.cnet.com/news/google-cloud-issues-causes-outages...

[4] https://www.datacenterknowledge.com/uptime/microsoft-blames-...

ignoramous · on Nov 12, 2018

Can't speak for Google, but Facebook and Salesforce chose Cells for HA.

http://highscalability.com/blog/2012/5/9/cell-architectures....

toomuchtodo · on Nov 12, 2018

Doesn't look like it was all that helpful to Facebook (as of 1542038976). Facebook.com errors out currently.

> Facebook Platform Appears to be down

> A check of https://developers.facebook.com/status/dashboard/ returns an error and I'm unable to login with facebook to some of my mobile apps.

https://news.ycombinator.com/item?id=18434262

doodliego · on Nov 12, 2018

Look how frequent and detailed Amazon's update logs are in that first Register article. Multiple updates throughout the day going into some detail.

Latteland · on Nov 12, 2018

When I was at google, the big outages were almost always bad routing to the service - it was never that the service couldn't handle the load, and bad service instances were kind of hidden. my service did have some problems on new releases, but because we had multiple instances we could just redirect traffic to the instances we hadn't updated, so they stayed up.

manigandham · on Nov 11, 2018

The major issue is that outages are global instead of regional, effectively making it impossible to design around using the typical region/zone redundancy.

tw04 · on Nov 12, 2018

Because they sell themselves as being far more reliable than internal IT. If they weren't selling on uptime, people probably wouldn't be quite so critical of downtime.

toomuchtodo · on Nov 12, 2018

As a technology practitioner, it is your failing if you believe them.

tw04 · on Nov 12, 2018

Let me know the next time you hear about the CIO of a fortune 500 asking his technology practitioners to validate what he read in Gartner and heard from Diane Greene.

toomuchtodo · on Nov 12, 2018

My advice would be to find opportunities to get paid to tell people the right answer, not to implement the wrong answer against your better judgement. Hot job market right now, more jobs than talent, all that jazz.

If you're stuck implementing a suboptimal solution, that's not your fault, and not the intent of my above comment.

engineeringwoke · on Nov 12, 2018

Lots of wisdom in this comment

rodgerd · on Nov 12, 2018

The pitch from cloud vendors always includes the idea that the cloud is more reliable than any in-house shop can achieve. So the expectation is set by the vendors.

chumboslice · on Nov 12, 2018

2 outages in 5 years.

5. Years.

Nothing to see here, move along.

talonx · on Nov 12, 2018

2 "global" outages. If it had been limited to a service, or a region, there would be nothing to see.

dvdgsng · on Nov 12, 2018

it was limited to GKE, wasn't it?

talonx · on Nov 12, 2018

Global here refers to the geographical spread of the service, GKE in this case, measured in regions, not the number of services.

Edit: I saw your point a bit late. It was limited to GKE, which makes my initial comment about "service" incorrect, and it was global, which keeps my comment about "region" correct. On a related note, an SRE from GKE posted on Slack that GCE was out of resources and so GKE faced resource exhaustion as well [1][2] - so it _might_ have been a multi-service outage.

1.https://googlecloud-community.slack.com/messages/C0B9GKTKJ/c...

2. https://googlecloud-community.slack.com/archives/C0B9GKTKJ/p...

hartem_ · on Nov 11, 2018

I’d be curious to know what alternatives are you considering at this point?

hacknat · on Nov 12, 2018

Azure and AWS.

tomcam · on Nov 12, 2018

I believe this is a fair question. I’d really like to understand what Google thinks about this.

asdfasgasdgasdg · on Nov 12, 2018

Not to minimize here (well, yes, a little), but this was a UI-only outage, from what I can tell. You could still create the pools from the command-line. It doesn't seem unreasonable to have a single, global UI server, as long as the API gateway is distributed and not subject to global outages.

rlancer · on Nov 12, 2018

Was certainly not UI only

asdfasgasdgasdg · on Nov 12, 2018

OK. Perhaps I misunderstood. In the status page, it says:

Affected customers can use gcloud command [1] in order to create new Node Pools. [1] https://cloud.google.com/sdk/gcloud/reference/container/node...

That led me to believe that only the web UI was affected.