Making indexes smaller is nice even when you have a ton of storage, as then more can fit into the hot set. However as someone who runs TB of databases, "just provision more storage" is always a valid option. Especially if you are outside the cloud. If you have your own hardware, new enterprise NVMe SSDs are about $80/TB, and DDR4 RAM is around $1.20/GB. Four hours of engineering time (very roughly $1000) buys either 800 GB of RAM or 12 TB of storage.
Well, there's a multiplier between the nominal capacity and the capacity you actually need to purchase for your whole system. You're not buying just that 1 TB. You probably want at least two live failover servers at least, maybe more. Plus however many layers of backups and disaster recovery; a year of weekly backups makes your 1 TB into 50, even if it's offline storage.
My own company struggles with that - throwing more storage on the live db is easy, so we've kept doing that for years, but pushing around multi-terabyte backups is getting cumbersome and we're going to have to slim down the data in prod even at the cost of engineer effort.
> a year of weekly backups makes your 1 TB into 50, even if it's offline storage.
20TB (~18GB usable) drive is $370 + whatever power it will use
6 of them in RAID6 will give you 72TB backup storage, add extra copy and say extra drive as spare and that's 13*370 = $4810 USD for ~5 years + whatever power it will use, ~50-55W if idling most of the time and just writing backup once a day
I think people got completely disconnected from how it actually costs to run something because of vastly higher resell value of storage and bandwidth in the cloud.
> My own company struggles with that - throwing more storage on the live db is easy, so we've kept doing that for years, but pushing around multi-terabyte backups is getting cumbersome and we're going to have to slim down the data in prod even at the cost of engineer effort.
...but that's definitely a bigger part of that issue, cloud or not. Backup is slow, restore is slow, anything that involves any of the two is also slow
I hear this point from time to time, but you have to realise that consumer-grade hardware would never be used in any reasonable company. The business risk of not having a support contract in place is too high, so you need a support contract, which means buying supportable hardware, which means enterprise-grade gear, which means enterprise pricing. Your 50TB just turned into a 10TB array in a disk shelf plus some tape storage, at probably 20-30 grand total, plus support.
If you decide to be scammed by enterprise hardware vendors instead of hiring competent people because your management is a bunch of hacks that's your company problem.
Anyway Segate Exos SAS 20TB drive is $409, you can re-do math for "true enterprise" grade hardware. "Enterprise" drive from SAN vendor will be a normal hard drive with some bits in firmware changed so SAN can reject non-SAN-vendor rebrands. They die just the same, we had stacks of them to prove it.
Hell, some of our Intel enterprise NVMes died suspiciously quickly compared to other but that might be a fluke...
Engineering hours: there's a good chance you pay for that once, and that solves the problem.
10 SSDs: will require rack space, electricity, PCIe slots, timely replacement, management software... most of these expenses will be recurring expenses. If done once, sometimes the existing infrastructure can amortize these expenses (i.e. you might have already had empty space in a rack, you might have already had spare PCIe slots, etc.), but amortization will only work in small numbers.
Another aspect of this trade-off: systems inevitably lose performance per unit of equipment as they grow due to management expenses and increased latency. So, if you keep solving problems by growing the system, overall, the system will become more and more "sluggish" until it becomes unserviceable.
On the other hand, solutions which minimize the number of system resources necessary to accomplish a task increase overall performance per unit. In other words, create a higher-quality system, which is an asset in its own right.
I mean I'm also proponent of "doing it right" most of the time, but throwing hardware at the problem is mighty fine solution in sub-TB areas
> 10 SSDs: will require rack space, electricity, PCIe slots, timely replacement, management software... most of these expenses will be recurring expenses. If done once, sometimes the existing infrastructure can amortize these expenses (i.e. you might have already had empty space in a rack, you might have already had spare PCIe slots, etc.), but amortization will only work in small numbers.
10NVMe SSDs fit into 1U.
The cloud always comes out terrible on hardware. The cloud is not selling hardware, it's terrible to buy any hardware in cloud, it sells spike capacity, and it sells software running on it.
There are many services in cloud that would be a lot of work to create (even if you need just small part of functionality) on-site but if you don't use it and just need to throw raw hardware at the problem consistently, a rack with hardware will be far cheaper solution even after incurring the engineering cost to deploy and automate it.
We have 7 racks and time to manage hardware is some minuscule fraction, and time to manage automation on it is still small part of our time, and we're just ops team of 3.
Depending on the scale and complexity, if you make no effort to control resource usage, costs grow exponentially, sometimes even without growth in business because the requirements just become more complex. For a certain optimization, you may save 1TB today, but end up saving 2TB some years down the road; just a few of these decisons you could be looking at a difference of a magnitude or more, sometimes even at larger scale. Overall there's always a balance to be struck.
> DDR4 RAM is around $1.20/GB. Four hours of engineering time (very roughly $1000) buys either 800 GB of RAM or 12 TB of storage.
$1000 doesn't buy you 800GB more of the RAM that your server likely takes, which is likely to run $3-6/GB ($3 for uncertified, $6 for motherboard manufacturer certified RAM), plus $0.50-$1/GB for the DIMM slots over and above your baseline RAM config.
It can still sometimes be smart to "throw hardware at it", but I don't think you can go from 128 GB of RAM today to 1 TB of RAM tomorrow for $1K in most cases.
Just provision more storage is usually the answer on the cloud as well unless someone has a concrete idea as to how to use less storage. Although the math is a bit trickier since it's a monthly price rather than a one time cost (although if your database is growing at a constant rate then that one time provisioning is really just a monthly price on-prem as well).
Even on RDS 20GB's price is neglectible for most companies and doesn't worth the added efforts and engineers salaries of looking into that.
From DBA perspective, that's a cool find. Thanks for sharing.
> Four hours of engineering time (very roughly $1000)
In my proximity I don't know anyone who is doing any sort of Engineering work, be it dev/sysadmin/ML/datascience/whatever. I'm so astonished by the numbers of 250$/hour of fulltime job (consulting likely a different story).
I'm very much curious, what kind of Engineering role that could be, in which industry/country/company, if you can name a few and/or share some other info.
I typically calculate the cost of engineering hours as about 2x their salary costs. This captures the additional employer-paid taxes, benefits, etc, as well as ensuring that I don't under-value the cost, which is typically the bigger threat when providing estimates.
I know good SREs (fancy sysadmin plus some devops, but with more class) on the scale from around $120k/yr up to about $300k/yr. This is in the USA. Everyone I know started in SF but since the COVID diaspora, they are all around the US now.
True, but it's not as though the 10X expense is for nothing. A SAN enables a lot more than local storage. I didn't mean to suggest that there wasn't a way to achieve lower costs on storage hardware directly, just that in practice, for a variety of reasons, that quoted cost is not representative.
And in case of database it doesn't give you much. You still need 2 nodes for redundancy, if you don't need a rack's worth of storage for that database, using SAN for that is waste of money.
I know the corporate idea of "everything on SAN, nothing outside of SAN" very well, it's just plainly solving political problems, not techncial ones.
Your understanding of SAN usage is narrow, and incorrect in the case of my organization.
One major benefit that SAN-backed database storage provides us is, when combined with VMs, enables us to spin up another database instance against already existing data in the SAN (i.e. a staging DB that looks at prod data).
The post makes mention of B-tree de-duplication that is present in PostgreSQL 13, but not 12, the version they're using; at the same time, they're noting that the vast majority of values in some of their foreign key indexes are NULL.
I have to wonder if B-tree de-duplication would have helped with that particular case? The PostgreSQL 13 documentation seems to imply it, as far as I can tell[0] (under 63.4.2):
> B-Tree deduplication is just as effective with “duplicates” that contain a NULL value, even though NULL values are never equal to each other according to the = member of any B-Tree operator class.
I don't think it would be as effective as a partial index as applied in the post, I'm just curious.
In previous discussion of this article, an HN user did the math: pg12 is 16 bytes per NULL and pg13 is 6.32 bytes per NULL. https://news.ycombinator.com/item?id=25989467 So definitely some pretty significant savings there.
Wow, thanks for this great writeup! I thought this was really useful not just from the point of the "partial index 'find'" that is the focus of the post (but that was a great point and thing to be aware of), but from the general overview of good techniques and things to be aware of in postgres if you're worried about using space inefficiently. Saving this one for reference!
One minor "word to the wise", cause I thought this was a great post but also has the potential to be misused: if you work at a startup or early stage company, it is nearly always the better decision to throw more disk space at a storage problem like this than worry about optimizing for size. Developers are expensive, disk space is cheap.
> if you work at a startup or early stage company, it is nearly always the better decision to throw more disk space at a storage problem like this than worry about optimizing for size. Developers are expensive, disk space is cheap
This is great advice. In general at the beginning it is better to keep things as simple as possible.
At one fast growing startup I worked at, one of the founders insisted we kept upgrading just one machine (plus redundancy and backups). It was a great strategy! Kept the architecture super simple, easy to manage, easy to debug and recover. For the first 5 years of the company, the whole thing ran on one server, while growing exponentially and serving millions of users globally.
After seeing that, it’s clear to me you should only upgrade when needed, and in the simplest, most straightforward way possible.
There’s a lot of emphasis here about just adding storage, but that’s missing the fact that unnecessary indexing affects writes and reads performance also (and potentially quite significantly).
Using a partial index where it’s obviously matches the use case (like when a majority of values are null) is just correct modeling and should not be considered a premature optimization or waste or developer time.
> There’s a lot of emphasis here about just adding storage, but that’s missing the fact that unnecessary indexing affects writes and reads performance also (and potentially quite significantly).
Again, for a startup or early stage company, spending developer time on read and write performance is almost certainly a waste. (Obviously if it's bad enough to be a blocker then fix it, but if it's bad enough to be a blocker then you've noticed it already).
I spent last week freeing up 200GB from our 600GB db with just reindex and pg_repack. The worst offender was a 17GB (of data) table that had 142GB of indexes. Reindex took it down to 21GB. The table indexing is crazy and has multiple indexes over different sets of columns.
A contributing factor for the huge index I think was the distribution of data. It's had inserts, updates and deletes continuously since 2015. Data is more likely to be deleted the older it gets so there's more data from recent years, but about 0.1% of the data is still from 2015. I think maybe this skewed distribution with a very long tail meant vacuum had a harder time dealing with that index bloat.
Wow, they are not kidding, this is truly "we saved 20g with one weird trick". We get a lot of requests for users wanting to use these unusual Postgresql index forms that are largely unheard of in old school Oracle / SQL Server shops, I didn't realize they were indexing NULL values either.
The fact that this is news to you makes me feel better about not considering it myself. I suspect we have a few large indexes that could be given the same treatment.
The first large scale project I worked on, my team couldn't figure out why operations had slowed down as the data set grew.
Indexes have log(n) insertion time per record. If you had 1000 records in your test database, as you approach 65k your insertion time goes up by 60% (2^10 vs 2^16 records). Success slows everything down, and there are only so many server upgrades available.
Add a couple new indexes for obscure features the business was looking for, and now you're up to double.
Double a very small number. If that's seriously an issue use a faster disk. So many people trying to run a DB on EBS with less IOPS than my PC from 2015.
I manage plenty of DBs with hundreds of millions of records and 40+ indexes per table/collection...
Insertion should still be extremely fast for such a small index, right?
Doing binary search over a b-tree page is <100 cycles. b-tree traversal over 100M records should still be measured in microseconds and binary search over that should be in microseconds too if not in 100s of nanoseconds.
‘It depends’ - don’t forget about locking overhead, needing to flush multiple pages/leafs as thresholds get hit, disk I/O wait times, etc.
At the beginning it’s rarely noticeable, but it does exist and under heavy load or with a lot of index writes needing to happen it can cause very noticeable overhead.
Yes, Postgres keeps track of the most common values. If it knows 0 is common and "where val = 0" will keep 90% of rows it might choose to use a table scan instead of the index.
An index is most efficient when it represents a unique set of values. It's still very useful to have an index for grouping, but if the groups represent a severe minority then you will end up wasting a lot of space and cycles searching through the index.
If the zeroes are most of the common value, it could be slower to use the index when searching for zeroes. How so? Reading values that aren't included in the index require following a reference back to the table row. If 90% of the rows in your query are zeroes, you'd be better of not using that column in your query planning. A naive filter, possibly even after a table scan, is likely to be faster than using the index.
Where can you learn these DBA skills for Postgres? If it were SQL Server I'd say do a MS cert prep course and get the certification, but I don't imagine that exists for pg.