If you are writing an article about time series databases, and you don't mention KDB - straight to jail. It is the grandfather of time series and predates influx by about a decade. It is still the fastest out there too. It is used by about every major financial and trading institution in the US and Europe.
Everybody thinks TSDB are something new-ish, but they've been around since the days of APL. All you youngins disappointment me every time you write about time series, vector languages, or data-oriented programming and entirely neglect all the work that comes under the APL/Vector umbrella. SOA and DOD have been around for 50+ years, and they didn't start with c++ or Pandas.
Now the creator, Arthur Whitney, has a new one out called Shakti that is even faster (but has also ditched from the "niceties" of Q.
I control-F'ed and didn't find it before I wrote that comment. I searched for KDB and APL, but found nothing.
EDIT: yeah that sentence is new. I think some of the sentences around Influx being the "first" have been adjusted too now saying it is the first "mainstream" and called other db's niche. it's also a little off. KDB+ is more popular in large banks and trading firms than HFT places.
If the author ends up reading this... as someone who is ignorant but curious about time series databases, it was frustrating to see an example of what relational data looks like, but then no example of what time series data looks like.
It might take a bit more context or setup, but this article is an opportunity to educate folks who aren't deep in the same trench.
Edit: in fairness, this article doesn't describe itself as "a Postgres developer's guide to getting started with time series databases" and maybe that's what I was unfairly hoping I would find.
The naive relational time series table looks like:
propertyid, timestamp, value
Then add a covering index so your reads only ever hit the index. This works completely fine for low billions of rows. After that suggest using clickhouse. It is less general but at large enough scale you need to make some tradeoffs. Completely fine to start with a relational DB in many cases though.
The article mentioned stock prices, so let's use your schema:
LLY.NYSE, 1726377148, 924.38
SHOP.NYSE, 1726377216, 72.45
SHOP.NYSE, 1726377245, 72.41
LLY.NYSE, 1726377284, 924.39
LLY.NYSE, 1726377310, 924.36
Okay, so you're appending values that capture a moment in time for a given index, with the understanding that you're almost never going to revise a tuple. So far as we're concerned, time only flows in one direction. Periodicity is determined by the client application but could be throttled due to a bottleneck, resulting in fewer writes.
I can imagine how storing in blocks based on time ranges and the fact that time number goes up makes this very easy to index.
With even this simple example, it's clear what this type of datastore could be useful for, even though I'm not convinced you couldn't do pretty exciting things in Redis or sqlite3.
That said, I'm still significantly confused why the hell you can only store numeric values. It just seems like a very arbitrary constraint.
If you use a bulk insert pattern you might be able to get ~50K inserts per second. If you have to insert one record at a time, it will likely drop down to ~3K or so.
The key is the covering index, you don't want to hit the actual table at all. You should re-organize the index periodically as well in order to ensure optimal query times.
This is suboptimal in many ways as the records are tiny yet there is per row overhead. However, it works fine to low billions of rows in general. I think it would work fine in SQLite as well. Redis is a very different animal however so a different strategy would be needed.
Migrating the same schema to ClickHouse when you need it (I have no affiliation with them), will give you 50X lower data footprint, increase your ingest rates by 10X and your query times by about 2X. However, if you don't really need that, don't bother adding another component in your infrastructure (imo). Also, ClickHouse is much more of a purpose built Ferrari. I.e. you can't expect it to do all kinds of "off schema" type queries (i.e. give me all the stock values on Jan 10, @ 10:32:02.232am). In a relational DB you can add indexes and get results reasonably well. In ClickHouse schema == query pattern essentially. Naturally, you should know what you are doing with both models but in ClickHouse, don't treat it like a "magic data box" that will can handle anything you throw at it (beyond a certain scale at least).
The constraint is not inherent to TSDB generally. Influxdb supports string values for example, as mentioned in the article. You also have log databases, like Loki, that are designed like a TSDB, except they only store strings.
My guess is that constraining to numbers greatly simplifies the implementation, especially so for the query language and aggregation functions.
There are some compression algos (e.g. gorilla) which do a good job at lossless compression. However the key is not to try to store arbitrary data types in the same table (don't use "OO" ideas here).
Non-TSDB databases can do an adequate job — up to a certain point. However, getting good performance comes down to query access paths.
For enormous amounts of data, you want your data to be sequential (a few batch reads are generally faster than lots of random ones) and sorted. Databases like Postgres don't store rows in index order, and their indexes are geared towards small fetches as opposed to big ranges of sequential data. (Postgres does have BRIN indexes, but they're not that great for this.) Other databases do support clustered tables, but they're still page-based underneath and still not suboptimal for other reasons. It's a good learning exercise to compare traditional databases with something like ClickHouse, which was s designed from the ground up to store data in this optimized way. (It's a general-purpose OLAP database, so columns can be any kind of data, not just numbers, and data doesn't need to be time-based; so it's useful for much more than "pure" TSDBs.)
As for numbers: For metric data you're storing numeric measurements. The whole point is applying aggregations to the data, like calculating an average or a percentile over a certain time period. Storing numbers also has the advantage of allowing downsampling. For many purposes you don't need 1s granularity for all your data. So as you "roll" data over into old time periods, you can downsample by storing the data points aggregated as mean/min/max/etc. New data can be exact down to the second, while old data can be exact down to the hour, for example.
You absolutely need a covering index in a relational DB. This way the data is read entirely from the index. The table just essentially goes along for the ride. The extra storage is a little distasteful from a conceptual / academic perspective but it works ok below a certain scale (low B rows). Beyond that use ClickHouse (but understand how it works - don't treat it as a "magic data box").
> That said, I'm still significantly confused why the hell you can only store numeric values. It just seems like a very arbitrary constraint.
"In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order."
A a numeric datapoint is by far the largest use case, timeseries databases optimize for that. Most allow you to have several labels ofcourse, i.e. you could store country border crossings as
I don't believe there is any restriction in InfluxDB but I can't remember for sure. Of course in a relational DB you can absolutely store strings but create a separate table for it (same in ClickHouse). If you have a somewhat restricted set of strings, ClickHouse will laughably compress it down to almost nothing using the low cardinality approach.
However, strings aren't as common as usually you are a value of some sort.
Being familiar with both KDB and somewhat so with InfluxDB .. it strikes me as a challenging space. I suspect there just isn't much money in it.
Oddly InfluxDB has raised amounts of VC money approaching KDB parent companies current market cap. I know VC raised and market cap are not directly comparable, but what is the hoped-for enterprise value at exit of NewCo if the IncumbentCo is worth Z?
My take is that orgs with real revenue-generating time series data challenges, budget and dev staff to cook up solutions have long ago bought KDB licenses or rolled their own in-house column store.
Orgs using time series DBs for telemetry/observability/etc type of "back office" problems (where you are willing to downsample/conflate/offload history) either don't want to pay a dime, or want a fully formed SaaS solution with pretty GUI, alerting, etc like a DataDog they will like $10M/year to.
Also K (1993) and A+ (1988) although the former only became public in 1998 or so, and the latter in 2003 - they were only available inside Morgan Stanley in the beginning IIRC.
I was working in K2 in 1999, and it was public then. This is back when it had the built in GUI and dependency graph. I remember the first time somebody showed it to me, and I didn't think it was very impressive. It took about 2-3 months of messing with it before I realized how very wrong I was.
We used a customized version of OpenTSDB at eBay to do monitoring of everything on the platform, including the search infrastructure. This was during the time eBay built a private cloud.
RRDTool was the generalised version of the TSDB that was born for the specific use case, MRTG. It could be used for anything. Hopefully I've remembered that correctly!
I remember when a bunch of WSO2 contribs broke off, made their own (AGPL) ESB, and created a 'ESB Performance' website that cherry picked specific scenarios to make WSO2 look worse than their product.
This was made funnier because the 'consultant' (really a buddy of management that used us as an experiment for all sorts of things, at least the code was readable, learned lots of languages fixing things lol) wound up having us go with the AGPL product, and it wasn't until I said to the team 'hey, lets try using the WSO2 docs as a cross reference' before we got productive.
I do miss the 'dramatic readings' of UltraESB docs, learned a colleague understood Pashto [0].
[0] - After a specific 'verse', the colleague pointed out one of the odd phrases in the docs that prompted this practice, was a colloquialism in that language.
A bit opinionated and partial version of history (graphite was big before Influxdb and Prometheus came out, and the landscape is bigger than what they mentioned) but it might be a good enough starting point to learn about this.
They are definitely using a more narrow definition of time series database than I expected - they don't mention rrd for example, which is GPL and been around for decades.
Graphite[1] has interesting plugin ecosystem like Grafana[2] & has been around since 2009. Architecture of Open Source Applications book[3] has done a nice overview of the architecture. I really like how it reuses linux filesystem as data store, making archiving very simple.
I'm using VictoriaMetrics (VM) to store basic weather data, like temperature and humidity. My initial setup was based on Prometheus however it seemed very hard to set a high data retention value, default was something like 15 days if I recall correctly.
Since I would actually like to store all recorded values permanently, I could partially achieve this with VM which let me set a higher threshold, like 100 years. Still not 'forever' as I would have liked, but I guess me and my flimsy weather data setup will have other things than the retention threshold to worry about in 100 years.
Would be nice to learn the reason why an infinite threshold is not allowed.
Mimir [1] is what we use where I work. We are very happy with it, and we have very long retention. Previously, our Prometheus setup was extremely slow if you went past today, but Mimir partitions the data to make it extremely fast to query even long time periods. We also used Thanos for a while, but Mimir apparently worked better.
I have done mimir deployments. I am generally very happy with mimir. It's very cost efficient. It does require someone to know enough to admin it though.
I didn't pick thanos because I really like the horizonal scaled blob store architecture the Grafana crew put together.
They're out of tech posters' zeitgeist but AFAIK they are each still maintained and fulfilling people's needs. Just not as much commentary or front-of-mind-share.
Usually performance and storage concerns. You can set effectively infinite retention on Prometheus, but after a long enough period you're going to like querying it even less.
Most TSDBs aren't built for use cases like "query hourly/daily data over many years". Many use cases aren't looking further than 30 days because they're focused on things like app or device uptime and performance metrics, and if they are running on longer time frames they're recording (or keeping, or consolidating data to) far fewer data points to keep performance usable.
My biggest beef with VictoriaMetrics is I can't delete and re-ingest a range of data. i.e. if I discover yesterdays data was messed up due to some issue in downstream data processing, I can't delete yesterdays data and re-ingest the cleaned up data. (Or even just delete yesterdays bad data)
The only workaround I've found so far is to dump out the whole timeseries which could go back months/years, delete the timeseries, fix/delete data in the dump and re-ingest the whole thing. This really, really sucks.
I haven't actually used Timescale but it was on my list to try after various gripes with how InfluxDB worked and limitations on tags.
Broadly speaking I can say I'd trust TSDB over Influx for a lot of cases since it's using Postgres under the hood, and while I don't know precisely what issues we had adopting Influx, I know we had some as we 'scaled up' tho and that was when stuff like tag limits were brought up a lot (Interestingly, our app was 'barely passing' because we metric'd a lot but we also kept stuff just short enough for everyone to be OK with it, mostly because the metrics we got were very useful for things)
AWS interestingly launched a managed InfluxDB option under the “Timestream” product this year.
I’ve personally found TimescaleDB to be a much easier to work with platform, even if there is more overhead in defining a schema and enabling compression, etc.
I've been using Influx v1 on a small scaleway instance for a while.
I loved it and the tooling around it as well.
I would get notifications on Telegram if events needing attention happened.
But v3 is still under development so I'll explore Prometheus.
I might expect that for a specialized task, specialized database technology would outperform general relational databases. But what specific technologies have significant impacts on time series db efficiency?
I wonder if the in-house metrics systems at big tech firms like Google and Facebook should be counted as ‘proprietary’ for these purposes. I suppose not because one can’t really pay to get them internally.
Can anyone comment why these are superior to ClickHouse? I really like the ClickBench which compares the various products performance (and open source).
I have no affiliation with ClickHouse, but in my experience, everything I have tried (regular relational DB (Postgres), InfluxDB, TimescaleDB is significantly inferior to it. However, I wouldn't bother with it unless you have enough scale to justify it (imo that is > low billions of rows).
Everybody thinks TSDB are something new-ish, but they've been around since the days of APL. All you youngins disappointment me every time you write about time series, vector languages, or data-oriented programming and entirely neglect all the work that comes under the APL/Vector umbrella. SOA and DOD have been around for 50+ years, and they didn't start with c++ or Pandas.
Now the creator, Arthur Whitney, has a new one out called Shakti that is even faster (but has also ditched from the "niceties" of Q.
https://shakti.com/