> Choosing Summary Statistics This is the #1 thing that people get wrong. It's s...

heipei · on April 20, 2021

I'd still try to store each event with all associated measurements as one item, using either a dedicated time series db like InfluxDB or Timescale or even using something like Elasticsearch. If one event is one HTTP request, and your service receives a million visits every day you still just have to store one million events per day which is 12 events per second.

klodolph · on April 20, 2021

I’m used to dealing with much larger numbers, maybe that’s the difference. 12 HTTP requests is something I can handle with a ten year old laptop on a home DSL connection.

Typically you also want to record backend information. One HTTP request at the front end might be twenty or a hundred backend requests.

nvilcins · on April 20, 2021

What are good guidelines to tackle this?

klodolph · on April 20, 2021

Can’t think of a way to reduce this to guidelines. I’ve seen a bunch of different errors. Mostly it’s a bad understanding of the math, or a bad understanding of what you’d want to do with the collected data.

Just one example that comes to mind: suppose you want average (mean) latency. So you record, for every minute, the average latency for requests during that minute. Now you want to calculate the average latency for an entire day… but it’s impossible. You can calculate the daily average of the per-minute averages, but you’re averaging over minutes, rather than averaging over requests.

This is just a simple example, but you can see how a seemingly innocuous decision sabotaged your ability to run the query you want.

Maybe my advice is “learn calculus”.

noodlenotes · on April 20, 2021

Average is actually one of the "nicer" summary statistics to work with because you can recalculate it over a different aggregation level if you kept the volume. It's statistics like median, percentile that you have to worry about. In my experience, people over-rely on averages just because they're easy to re-aggregate when they really should be using something else.

klodolph · on April 20, 2021

Yeah, median and percentiles are awful. You usually want them only at the very top-end, as the final summary of some system. Like, “page renders within 500ms for 90% of users”.

So, well meaning engineers will collect percentiles on individual components in the system, with the idea that if you want percentiles at the end, just collect percentiles everywhere. You’re left with garbage data that can’t be aggregated in a way that makes any sense.

kqr · on April 21, 2021

That's not at all true. Any meaningful aggregation requires storing the empirical distribution of data for individual components. A percentile histogram is an excellent approximation to this, when done properly.

hasmanean · on April 21, 2021

Agreed.

Histograms can be added and aggregated. You can always estimate the mean from a histogram.

So if you compute a histogram of latency values every minute, you can add them up for successive minutes to get the distribution over larger intervals.

klodolph · on April 22, 2021

I think you actually agree with what I'm saying... which is that discarding the histogram and e.g. just collecting the 90%ile will leave you unable to do the analysis you want.

kqr · on April 22, 2021

That does sound like something I'd say. Well spotted!

anonymousDan · on April 20, 2021

Sorry dumb question, but does this (averaging the averages) only not work because you can't guarantee the latency is normally distributed?

kqr · on April 20, 2021

Latencies are patently not normally distributed. I could offer you 100:1 odds on that bet.

But the problem with averaging averages is that it makes no sense mathematically. The average (well, arithmetic mean, to be technical) has a well-defined meaning and interpretation in terms of the original data points. When you start averaging averages (or dear Lord percentiles!) you lose the last glimmer of interoperability you had.

nonameiguess · on April 20, 2021

They're Poisson distributed, by construction. Queuing service time is actually the go to example when first introducing the Poisson distribution in an intro to statistics class. No need to make a bet. Anything that can't be negative isn't normally distributed.

The bigger issue here, aside from the fact that you may be taking a simple average when you need to be taking a weighted average to aggregate your individual point estimates, as others have pointed out, is that you can estimate a sufficiently symmetric Poisson distribution with a normal distribution, but response latency is not symmetric, specifically because time travel isn't possible and you can't ever get negative response times to balance out the tail events on the right side of the distribution. Tail events with respect to latency can only be bad, not good. They're a bet with a hard cap on winnings but potentially unbounded losses. As such, you really need to consider what the worst case is and the expected frequency of that worst case.

mjb · on April 20, 2021

Service latencies are typically not Poisson distributed, partially because their queuing systems are non-trivial, but mostly because there's typically a lot more going on than just queues.

For one great deep dive on the topic, check out https://drkp.net/papers/latency-socc14.pdf

nonameiguess · on April 20, 2021

Thank you, that is quite interesting! As you can see, even though I have an applied math degree, I still can't just analytically infer a distribution. You need empirical results. This reminds me of the Jellyfish network topology for data centers (https://www.researchgate.net/publication/51943691_Jellyfish_...), where they found completely random route selection to be faster and more scalable than any other existing solution.

I suspect, naively, that latency purely from network appliance queueing probably gives a mixture of Poissons in a straight line bus topology. But add in arbitrary routing and the much more complicated schedulers of the servers themselves, and you get no well-behaved distribution at all.

It's especially interesting to compare the SLAs given by data center providers that tend to always use 99th or even 99.9th percentile worst case performance, with application metrics using arithmetic averages.

kqr · on April 21, 2021

> Anything that can't be negative isn't normally distributed.

Human height?

simiones · on April 20, 2021

The basic problem is something like this:

- Minute 1: 10 requests, 5 with 10ms latency, 5 with 500ms latency; avg latency 255ms

- Minute 2: 1000 requests, 900 with 10ms latency, 100 with 500ms latency; avg latency 59ms

Average of averages: 157ms

Average latency: 97ms

You can make them arbitrarily different.

squeaky-clean · on April 20, 2021

But you can make this work if you also just store the number of samples per bucket, right?

> (255ms + 59ms) / 2 = 157ms

But

> ((10 * 255ms) + (1000 * 59ms)) / 1010 = 60.94ms

Which is the same answer I get as averaging all the samples without bucketing. (I can't get 97ms though)

klodolph · on April 20, 2021

The example is an illustration of “if you collect the wrong data, you can’t calculate the statistics you want.”

So if your response is, “can’t you just collect the correct data in the first place?” Well, that’s the hard part.

hnaccy · on April 20, 2021

The averages are weighted equally even though your traffic pattern may not be.

nerdponx · on April 20, 2021

Average is "sum(x)/length(x)".

If all you have is 50 averages, there is no way to combine them to get the average of the underlying data, because you don't have the original "length(x)".

jeffbee · on April 20, 2021

A popular way to do it wrong is to use summary statistics that regularly emit a quantile, for example don't collect the p90 from every server every minute. That discards far too much information. Instead, collect the CDF of service times. CDFs can be combined in a lossless way, and they carry their own denominators. Basically if you are using the typical Prometheus setup, you should be getting the right data.

To go to the next level I would suggest also emitting the sum and sum of squares, which will make your summary statistic far more useful.

benlivengood · on April 20, 2021

> A popular way to do it wrong is to use summary statistics that regularly emit a quantile, for example don't collect the p90 from every server every minute. That discards far too much information. Instead, collect the CDF of service times. CDFs can be combined in a lossless way, and they carry their own denominators. Basically if you are using the typical Prometheus setup, you should be getting the right data.

And choose the right quantization for histograms/cdfs.

> To go to the next level I would suggest also emitting the sum and sum of squares, which will make your summary statistic far more useful.

Histograms should also give you the sum. I haven't seen sum of squares in many monitoring setups; is this for underlying metrics that are differences/errors?

benlivengood · on April 20, 2021

The general guideline is to store enough data that all measures (mean, quantile, min, max) are commutative over different queries by ensuring that measure(Xs) = measure(map(measure(x), arbitrary_partition(Xs))) for any set of metrics Xs and partitioning function. In other words, no matter how you measure different distinct subsets of an original metric the measure of all those subsets gives the same result as measuring the whole set. This is only possible if enough data is stored in the metric about the underlying values in the set.

This allows aggregation for efficiency, usually either by partitioning the data by short time windows (1 second or more usually) and storing measures over each duration or by storing a sampled subset of original values, both of which can later be queried to produce measures over arbitrary periods of time.

A few measures can be supported exactly with pre-aggregation (min, max, mean, variance), some can be supported with bounded error (quantiles) or statistical error (pretty much any measure over sampled values).