This is the #1 thing that people get wrong. It's something that otherwise smart software engineers get wrong because they don't have enough of a background in data analytics.
The problem has two parts. One part of the problem is that once you reduce an observation to summary statistics, you can't go back. The other part of a problem is that web services usually generate too much data if you don't summarize.
I'd still try to store each event with all associated measurements as one item, using either a dedicated time series db like InfluxDB or Timescale or even using something like Elasticsearch. If one event is one HTTP request, and your service receives a million visits every day you still just have to store one million events per day which is 12 events per second.
I’m used to dealing with much larger numbers, maybe that’s the difference. 12 HTTP requests is something I can handle with a ten year old laptop on a home DSL connection.
Typically you also want to record backend information. One HTTP request at the front end might be twenty or a hundred backend requests.
Can’t think of a way to reduce this to guidelines. I’ve seen a bunch of different errors. Mostly it’s a bad understanding of the math, or a bad understanding of what you’d want to do with the collected data.
Just one example that comes to mind: suppose you want average (mean) latency. So you record, for every minute, the average latency for requests during that minute. Now you want to calculate the average latency for an entire day… but it’s impossible. You can calculate the daily average of the per-minute averages, but you’re averaging over minutes, rather than averaging over requests.
This is just a simple example, but you can see how a seemingly innocuous decision sabotaged your ability to run the query you want.
Average is actually one of the "nicer" summary statistics to work with because you can recalculate it over a different aggregation level if you kept the volume. It's statistics like median, percentile that you have to worry about. In my experience, people over-rely on averages just because they're easy to re-aggregate when they really should be using something else.
Yeah, median and percentiles are awful. You usually want them only at the very top-end, as the final summary of some system. Like, “page renders within 500ms for 90% of users”.
So, well meaning engineers will collect percentiles on individual components in the system, with the idea that if you want percentiles at the end, just collect percentiles everywhere. You’re left with garbage data that can’t be aggregated in a way that makes any sense.
That's not at all true. Any meaningful aggregation requires storing the empirical distribution of data for individual components. A percentile histogram is an excellent approximation to this, when done properly.
Histograms can be added and aggregated. You can always estimate the mean from a histogram.
So if you compute a histogram of latency values every minute, you can add them up for successive minutes to get the distribution over larger intervals.
I think you actually agree with what I'm saying... which is that discarding the histogram and e.g. just collecting the 90%ile will leave you unable to do the analysis you want.
Latencies are patently not normally distributed. I could offer you 100:1 odds on that bet.
But the problem with averaging averages is that it makes no sense mathematically. The average (well, arithmetic mean, to be technical) has a well-defined meaning and interpretation in terms of the original data points. When you start averaging averages (or dear Lord percentiles!) you lose the last glimmer of interoperability you had.
They're Poisson distributed, by construction. Queuing service time is actually the go to example when first introducing the Poisson distribution in an intro to statistics class. No need to make a bet. Anything that can't be negative isn't normally distributed.
The bigger issue here, aside from the fact that you may be taking a simple average when you need to be taking a weighted average to aggregate your individual point estimates, as others have pointed out, is that you can estimate a sufficiently symmetric Poisson distribution with a normal distribution, but response latency is not symmetric, specifically because time travel isn't possible and you can't ever get negative response times to balance out the tail events on the right side of the distribution. Tail events with respect to latency can only be bad, not good. They're a bet with a hard cap on winnings but potentially unbounded losses. As such, you really need to consider what the worst case is and the expected frequency of that worst case.
Service latencies are typically not Poisson distributed, partially because their queuing systems are non-trivial, but mostly because there's typically a lot more going on than just queues.
Thank you, that is quite interesting! As you can see, even though I have an applied math degree, I still can't just analytically infer a distribution. You need empirical results. This reminds me of the Jellyfish network topology for data centers (https://www.researchgate.net/publication/51943691_Jellyfish_...), where they found completely random route selection to be faster and more scalable than any other existing solution.
I suspect, naively, that latency purely from network appliance queueing probably gives a mixture of Poissons in a straight line bus topology. But add in arbitrary routing and the much more complicated schedulers of the servers themselves, and you get no well-behaved distribution at all.
It's especially interesting to compare the SLAs given by data center providers that tend to always use 99th or even 99.9th percentile worst case performance, with application metrics using arithmetic averages.
If all you have is 50 averages, there is no way to combine them to get the average of the underlying data, because you don't have the original "length(x)".
A popular way to do it wrong is to use summary statistics that regularly emit a quantile, for example don't collect the p90 from every server every minute. That discards far too much information. Instead, collect the CDF of service times. CDFs can be combined in a lossless way, and they carry their own denominators. Basically if you are using the typical Prometheus setup, you should be getting the right data.
To go to the next level I would suggest also emitting the sum and sum of squares, which will make your summary statistic far more useful.
> A popular way to do it wrong is to use summary statistics that regularly emit a quantile, for example don't collect the p90 from every server every minute. That discards far too much information. Instead, collect the CDF of service times. CDFs can be combined in a lossless way, and they carry their own denominators. Basically if you are using the typical Prometheus setup, you should be getting the right data.
And choose the right quantization for histograms/cdfs.
> To go to the next level I would suggest also emitting the sum and sum of squares, which will make your summary statistic far more useful.
Histograms should also give you the sum. I haven't seen sum of squares in many monitoring setups; is this for underlying metrics that are differences/errors?
The general guideline is to store enough data that all measures (mean, quantile, min, max) are commutative over different queries by ensuring that measure(Xs) = measure(map(measure(x), arbitrary_partition(Xs))) for any set of metrics Xs and partitioning function. In other words, no matter how you measure different distinct subsets of an original metric the measure of all those subsets gives the same result as measuring the whole set. This is only possible if enough data is stored in the metric about the underlying values in the set.
This allows aggregation for efficiency, usually either by partitioning the data by short time windows (1 second or more usually) and storing measures over each duration or by storing a sampled subset of original values, both of which can later be queried to produce measures over arbitrary periods of time.
A few measures can be supported exactly with pre-aggregation (min, max, mean, variance), some can be supported with bounded error (quantiles) or statistical error (pretty much any measure over sampled values).
This is the #1 thing that people get wrong. It's something that otherwise smart software engineers get wrong because they don't have enough of a background in data analytics.
The problem has two parts. One part of the problem is that once you reduce an observation to summary statistics, you can't go back. The other part of a problem is that web services usually generate too much data if you don't summarize.