It's a fantastic talk that taught me several important principles about measuring performance.
Gil Tene also developed HdrHistogram, which has been ported to a bunch of languages, and is a great, lightweight way to collect accurate performance histograms: http://hdrhistogram.org/.
> A common pattern in these systems is that there's some frontend, which could be a service or some Javascript or an app, which calls a number of backend services to do what it needs to do.
I think an important idea here is that you should be trying to measure the experience of a user (or as close as possible). If there is a slow service somewhere in your stack, but has no impact on user experience, then who cares? Conversely, if users are complaining that the app feels sluggish, then it doesn't matter if all your graphs say that everything is OK.
I find it helpful to split up graphs/monitoring into two categories: 1) if these graphs look fine then the service is probably fine, and 2) if problems are being reported then these graphs might give an insight into why things are going wibbly. In general, we alert on the former and diagnose with the latter. Of course, its nigh on impossible to get perfect metrics that track actual user experience, but we've definitely found it worthwhile to try and get as close as possible to it.
---
Another fun problem with using summary statistics is they can easily "lie" if the API can do a variable amount of work. For example, if you have a "get updates API" that is called regularly to see updates since the last call, then you end up with two "modes": 1) small amount of time between calls and so super fast and 2) a large amount of time between calls and so is slow. Now, in any given time period the vast majority of the calls are going to be super quick, but every user will hit the slow case the first time they open the app for the first time that day. This results in summary statistics that all but ignore those slow API calls when opening the app.
The flip side is that errors often pollute service latency statistics. If your service is capable of serving a fast failure, for example by returning 503 instantly for all requests when it is overloaded, you need another dimension in your statistics to handle that.
Another point often missed is the diagnostic value of tail measurements. One of the first things I do at any job is replace the 90th percentile with the maximum in all plots.
Sure, it gets messier, and definitely less visually appealing, but the reaction by others has uniformly been "Did we have this data available all along and just never showed it?!"
It's also worth mentioning that even in a system where technically tail latencies aren't a big problem, psychologically they are. If you visit a site 20 times and just one of those are slow, you're likely to associate it mentally with "slow site" rather than "fast site".
My experience has always been people dismissing it when I show them the worst case. "Well, yeah, but that's timeouts downstream and etc etc and so of course it's hitting worst case". They then never have an answer when I ask "How do you know that?" or, the times it has happened, "How do you explain it taking longer than the configured request timeout?"
Generally I want 90th, 99th, and max on the same graph/plot. Otherwise there's less sense of the overall distribution.
90th percentile is the fast path. You notice overall effects of growth, load, regressions, etc. 99th percentile (or sometimes 99.9th) is often the actual SLA/SLO so it should be on the plot. Max, as you point out, is necessary to actually see the long tail.
The other thing that's important is to make sure the timeseries and plotting software isn't averaging the max value; it's easy to miss spikes of latency because someone thought the graphs looked nicer with avg(1 min) vs max(1 min) or a tsdb is automatically rewindowing/aggregating with the wrong function. To a lesser extent even percentile buckets can be deceiving if they're too big. 99th percentile with 5-min buckets can miss significant but brief (~3 second) degradations.
Max, as you point out, is necessary to actually see the long tail.
What a strange way to put it. No one wants to see the long tail. The long tail is the means, not the end.
To a lesser extent even percentile buckets can be deceiving if they're too big. 99th percentile with 5-min buckets can miss significant but brief (~3 second) degradations.
Now that's the actual reason to care about max. Otherwise you have values (and often negative user experiences) getting lost in the aggregation.
One of the first things I do at any job is replace the 90th percentile with the maximum in all plots.
Really? I find percentiles to be more informative. Looking at the maximum is like looking at an error log, it is basically throwing out all performance data except data from a tiny slice of time. A bad maximum shows you that a service failed at least once, but you oftenalready knew or expected that. A bad 90th percentile tells you that a lot of users are experiencing poor performance.
With fat tailed distributions, like the ones our latencies tend to look like, the signal is in the tail. The central/common observations are just noise and tell you little to nothing about how the system performs.
That said, I do look at percentiles in addition to the max. But I find the 99th, 99.9th, and the max together tells me a lot more about the system performance than the uninformative stuff at the 90th and below.
min(max_time, const_reasonable_max) is also a good graph if your software supports it. It stops the outliers from polluting the view. After all, your user will leave/refresh after a few seconds, so only matters your response took longer than a minute - it's irrelevant it was 20minutes.
Due to the systems in question themselves timing out relatively soon, this is what I in practise end up looking at anyway.
It's a good point, even though it cuts both ways: given some assumptions about the tail behaviour of latencies, the 20 minute extreme event is a treasure trove for estimating the probabilities of smaller tail events.
For most of the services I've looked this closely at, maximum would be a proxy for load. Or in the case of a benchmark suite you'd expect max to increase with the number of iterations. What do you find the maximum useful for?
This is why I don't like require.js, one script requires this script which requires that script. If there is one hiccup somewhere down the line it causes the whole page to have to wait. One of my clients had their website made and wondered why it was so slow (the designers said it needed a faster server) but I found out it was requesting hundreds of .js files in roughly 10 waves. Causing the whole page to take up to 10 seconds to load completely.
It sounds like you and your clients need to learn more about modern JS bundling techniques. Making a waterfall of requests can be okay in development but should never ship to production.
A lovely thing about tail latency in the chains that the post talks about is how one service being slow can cascade. Especially in serial chains, when on component is slow, the rest are waiting, in the meantime using resources like memory, sockets, cpu cycles, that could be used to service other requests. In the worst cases, those other services can start responding slowly to other requests, resulting in further degradation.
Having circuit breakers and carefully tuning timeouts can help.
I'd take an 'unavailable' lasting for a short time while connections exponentially back off with jitter over minutes or hours of degraded service for everyone.
This is the #1 thing that people get wrong. It's something that otherwise smart software engineers get wrong because they don't have enough of a background in data analytics.
The problem has two parts. One part of the problem is that once you reduce an observation to summary statistics, you can't go back. The other part of a problem is that web services usually generate too much data if you don't summarize.
I'd still try to store each event with all associated measurements as one item, using either a dedicated time series db like InfluxDB or Timescale or even using something like Elasticsearch. If one event is one HTTP request, and your service receives a million visits every day you still just have to store one million events per day which is 12 events per second.
I’m used to dealing with much larger numbers, maybe that’s the difference. 12 HTTP requests is something I can handle with a ten year old laptop on a home DSL connection.
Typically you also want to record backend information. One HTTP request at the front end might be twenty or a hundred backend requests.
Can’t think of a way to reduce this to guidelines. I’ve seen a bunch of different errors. Mostly it’s a bad understanding of the math, or a bad understanding of what you’d want to do with the collected data.
Just one example that comes to mind: suppose you want average (mean) latency. So you record, for every minute, the average latency for requests during that minute. Now you want to calculate the average latency for an entire day… but it’s impossible. You can calculate the daily average of the per-minute averages, but you’re averaging over minutes, rather than averaging over requests.
This is just a simple example, but you can see how a seemingly innocuous decision sabotaged your ability to run the query you want.
Average is actually one of the "nicer" summary statistics to work with because you can recalculate it over a different aggregation level if you kept the volume. It's statistics like median, percentile that you have to worry about. In my experience, people over-rely on averages just because they're easy to re-aggregate when they really should be using something else.
Yeah, median and percentiles are awful. You usually want them only at the very top-end, as the final summary of some system. Like, “page renders within 500ms for 90% of users”.
So, well meaning engineers will collect percentiles on individual components in the system, with the idea that if you want percentiles at the end, just collect percentiles everywhere. You’re left with garbage data that can’t be aggregated in a way that makes any sense.
That's not at all true. Any meaningful aggregation requires storing the empirical distribution of data for individual components. A percentile histogram is an excellent approximation to this, when done properly.
Histograms can be added and aggregated. You can always estimate the mean from a histogram.
So if you compute a histogram of latency values every minute, you can add them up for successive minutes to get the distribution over larger intervals.
I think you actually agree with what I'm saying... which is that discarding the histogram and e.g. just collecting the 90%ile will leave you unable to do the analysis you want.
Latencies are patently not normally distributed. I could offer you 100:1 odds on that bet.
But the problem with averaging averages is that it makes no sense mathematically. The average (well, arithmetic mean, to be technical) has a well-defined meaning and interpretation in terms of the original data points. When you start averaging averages (or dear Lord percentiles!) you lose the last glimmer of interoperability you had.
They're Poisson distributed, by construction. Queuing service time is actually the go to example when first introducing the Poisson distribution in an intro to statistics class. No need to make a bet. Anything that can't be negative isn't normally distributed.
The bigger issue here, aside from the fact that you may be taking a simple average when you need to be taking a weighted average to aggregate your individual point estimates, as others have pointed out, is that you can estimate a sufficiently symmetric Poisson distribution with a normal distribution, but response latency is not symmetric, specifically because time travel isn't possible and you can't ever get negative response times to balance out the tail events on the right side of the distribution. Tail events with respect to latency can only be bad, not good. They're a bet with a hard cap on winnings but potentially unbounded losses. As such, you really need to consider what the worst case is and the expected frequency of that worst case.
Service latencies are typically not Poisson distributed, partially because their queuing systems are non-trivial, but mostly because there's typically a lot more going on than just queues.
Thank you, that is quite interesting! As you can see, even though I have an applied math degree, I still can't just analytically infer a distribution. You need empirical results. This reminds me of the Jellyfish network topology for data centers (https://www.researchgate.net/publication/51943691_Jellyfish_...), where they found completely random route selection to be faster and more scalable than any other existing solution.
I suspect, naively, that latency purely from network appliance queueing probably gives a mixture of Poissons in a straight line bus topology. But add in arbitrary routing and the much more complicated schedulers of the servers themselves, and you get no well-behaved distribution at all.
It's especially interesting to compare the SLAs given by data center providers that tend to always use 99th or even 99.9th percentile worst case performance, with application metrics using arithmetic averages.
If all you have is 50 averages, there is no way to combine them to get the average of the underlying data, because you don't have the original "length(x)".
A popular way to do it wrong is to use summary statistics that regularly emit a quantile, for example don't collect the p90 from every server every minute. That discards far too much information. Instead, collect the CDF of service times. CDFs can be combined in a lossless way, and they carry their own denominators. Basically if you are using the typical Prometheus setup, you should be getting the right data.
To go to the next level I would suggest also emitting the sum and sum of squares, which will make your summary statistic far more useful.
> A popular way to do it wrong is to use summary statistics that regularly emit a quantile, for example don't collect the p90 from every server every minute. That discards far too much information. Instead, collect the CDF of service times. CDFs can be combined in a lossless way, and they carry their own denominators. Basically if you are using the typical Prometheus setup, you should be getting the right data.
And choose the right quantization for histograms/cdfs.
> To go to the next level I would suggest also emitting the sum and sum of squares, which will make your summary statistic far more useful.
Histograms should also give you the sum. I haven't seen sum of squares in many monitoring setups; is this for underlying metrics that are differences/errors?
The general guideline is to store enough data that all measures (mean, quantile, min, max) are commutative over different queries by ensuring that measure(Xs) = measure(map(measure(x), arbitrary_partition(Xs))) for any set of metrics Xs and partitioning function. In other words, no matter how you measure different distinct subsets of an original metric the measure of all those subsets gives the same result as measuring the whole set. This is only possible if enough data is stored in the metric about the underlying values in the set.
This allows aggregation for efficiency, usually either by partitioning the data by short time windows (1 second or more usually) and storing measures over each duration or by storing a sampled subset of original values, both of which can later be queried to produce measures over arbitrary periods of time.
A few measures can be supported exactly with pre-aggregation (min, max, mean, variance), some can be supported with bounded error (quantiles) or statistical error (pretty much any measure over sampled values).
This reminded me of the similar cascading effect you get with availability across the aggregate of many services as discussed in "The Calculus of Service Availability": https://queue.acm.org/detail.cfm?id=3096459
One complaint here is that serial/parallel is not the way to think about most modern architectures. In modern architectures you are typically working with decoupled event buses which invert the relationship with dependencies. In this case, you become resilient to many negative impacts of tail latency as you're inherently eventually consistent.
I don’t know that I’d call this modern so much as just a different school of thought.
Using streaming/event buses and getting rid of the chain of calls in front of the initial request definitely has upsides. Syncing data all over the place, with consumers being forced to interpret that data from outside of their domain has large downsides too though.
[0] https://www.infoq.com/presentations/latency-response-time/