An unexpected find that freed 20GB of unused index space (2021)

aftbit · on Aug 28, 2023

Making indexes smaller is nice even when you have a ton of storage, as then more can fit into the hot set. However as someone who runs TB of databases, "just provision more storage" is always a valid option. Especially if you are outside the cloud. If you have your own hardware, new enterprise NVMe SSDs are about $80/TB, and DDR4 RAM is around $1.20/GB. Four hours of engineering time (very roughly $1000) buys either 800 GB of RAM or 12 TB of storage.

vikingerik · on Aug 28, 2023

Well, there's a multiplier between the nominal capacity and the capacity you actually need to purchase for your whole system. You're not buying just that 1 TB. You probably want at least two live failover servers at least, maybe more. Plus however many layers of backups and disaster recovery; a year of weekly backups makes your 1 TB into 50, even if it's offline storage.

My own company struggles with that - throwing more storage on the live db is easy, so we've kept doing that for years, but pushing around multi-terabyte backups is getting cumbersome and we're going to have to slim down the data in prod even at the cost of engineer effort.

ilyt · on Aug 29, 2023

> a year of weekly backups makes your 1 TB into 50, even if it's offline storage.

20TB (~18GB usable) drive is $370 + whatever power it will use

6 of them in RAID6 will give you 72TB backup storage, add extra copy and say extra drive as spare and that's 13*370 = $4810 USD for ~5 years + whatever power it will use, ~50-55W if idling most of the time and just writing backup once a day

I think people got completely disconnected from how it actually costs to run something because of vastly higher resell value of storage and bandwidth in the cloud.

> My own company struggles with that - throwing more storage on the live db is easy, so we've kept doing that for years, but pushing around multi-terabyte backups is getting cumbersome and we're going to have to slim down the data in prod even at the cost of engineer effort.

...but that's definitely a bigger part of that issue, cloud or not. Backup is slow, restore is slow, anything that involves any of the two is also slow

Daneel_ · on Aug 29, 2023

I hear this point from time to time, but you have to realise that consumer-grade hardware would never be used in any reasonable company. The business risk of not having a support contract in place is too high, so you need a support contract, which means buying supportable hardware, which means enterprise-grade gear, which means enterprise pricing. Your 50TB just turned into a 10TB array in a disk shelf plus some tape storage, at probably 20-30 grand total, plus support.

ilyt · on Aug 29, 2023

If you decide to be scammed by enterprise hardware vendors instead of hiring competent people because your management is a bunch of hacks that's your company problem.

Anyway Segate Exos SAS 20TB drive is $409, you can re-do math for "true enterprise" grade hardware. "Enterprise" drive from SAN vendor will be a normal hard drive with some bits in firmware changed so SAN can reject non-SAN-vendor rebrands. They die just the same, we had stacks of them to prove it.

Hell, some of our Intel enterprise NVMes died suspiciously quickly compared to other but that might be a fluke...

danielheath · on Aug 28, 2023

Weekly full backups are quite large vs eg ZFS snapshots, right?

Dylan16807 · on Aug 28, 2023

Are you putting indexes into your backups?

nine_k · on Aug 28, 2023

Backing up indexes makes sense if you care about restoring the DB quickly.

krab · on Aug 28, 2023

pg_basebackup is, like any physical replication of Postgres.

crabbone · on Aug 28, 2023

This is not at all a good way to compare.

Engineering hours: there's a good chance you pay for that once, and that solves the problem.

10 SSDs: will require rack space, electricity, PCIe slots, timely replacement, management software... most of these expenses will be recurring expenses. If done once, sometimes the existing infrastructure can amortize these expenses (i.e. you might have already had empty space in a rack, you might have already had spare PCIe slots, etc.), but amortization will only work in small numbers.

Another aspect of this trade-off: systems inevitably lose performance per unit of equipment as they grow due to management expenses and increased latency. So, if you keep solving problems by growing the system, overall, the system will become more and more "sluggish" until it becomes unserviceable.

On the other hand, solutions which minimize the number of system resources necessary to accomplish a task increase overall performance per unit. In other words, create a higher-quality system, which is an asset in its own right.

ilyt · on Aug 29, 2023

I mean I'm also proponent of "doing it right" most of the time, but throwing hardware at the problem is mighty fine solution in sub-TB areas

> 10 SSDs: will require rack space, electricity, PCIe slots, timely replacement, management software... most of these expenses will be recurring expenses. If done once, sometimes the existing infrastructure can amortize these expenses (i.e. you might have already had empty space in a rack, you might have already had spare PCIe slots, etc.), but amortization will only work in small numbers.

10NVMe SSDs fit into 1U.

The cloud always comes out terrible on hardware. The cloud is not selling hardware, it's terrible to buy any hardware in cloud, it sells spike capacity, and it sells software running on it.

There are many services in cloud that would be a lot of work to create (even if you need just small part of functionality) on-site but if you don't use it and just need to throw raw hardware at the problem consistently, a rack with hardware will be far cheaper solution even after incurring the engineering cost to deploy and automate it.

We have 7 racks and time to manage hardware is some minuscule fraction, and time to manage automation on it is still small part of our time, and we're just ops team of 3.

dchftcs · on Aug 28, 2023

Depending on the scale and complexity, if you make no effort to control resource usage, costs grow exponentially, sometimes even without growth in business because the requirements just become more complex. For a certain optimization, you may save 1TB today, but end up saving 2TB some years down the road; just a few of these decisons you could be looking at a difference of a magnitude or more, sometimes even at larger scale. Overall there's always a balance to be struck.

sokoloff · on Aug 28, 2023

> DDR4 RAM is around $1.20/GB. Four hours of engineering time (very roughly $1000) buys either 800 GB of RAM or 12 TB of storage.

$1000 doesn't buy you 800GB more of the RAM that your server likely takes, which is likely to run $3-6/GB ($3 for uncertified, $6 for motherboard manufacturer certified RAM), plus $0.50-$1/GB for the DIMM slots over and above your baseline RAM config.

It can still sometimes be smart to "throw hardware at it", but I don't think you can go from 128 GB of RAM today to 1 TB of RAM tomorrow for $1K in most cases.

whoiscroberts · on Aug 28, 2023

You mean you can’t buy ram off Newegg sorted by cheapest and just plop it in your server?

rockostrich · on Aug 28, 2023

Just provision more storage is usually the answer on the cloud as well unless someone has a concrete idea as to how to use less storage. Although the math is a bit trickier since it's a monthly price rather than a one time cost (although if your database is growing at a constant rate then that one time provisioning is really just a monthly price on-prem as well).

paulddraper · on Aug 28, 2023

Neither one will save you fro slower writes due to unnecessary indices.

_a_a_a_ · on Aug 28, 2023

> new enterprise NVMe SSDs are about $80/TB, and DDR4 RAM is around $1.20/GB

sounds incredibly cheap

krembo · on Aug 28, 2023

Even on RDS 20GB's price is neglectible for most companies and doesn't worth the added efforts and engineers salaries of looking into that. From DBA perspective, that's a cool find. Thanks for sharing.

CoolCold · on Aug 29, 2023

> Four hours of engineering time (very roughly $1000)

In my proximity I don't know anyone who is doing any sort of Engineering work, be it dev/sysadmin/ML/datascience/whatever. I'm so astonished by the numbers of 250$/hour of fulltime job (consulting likely a different story).

I'm very much curious, what kind of Engineering role that could be, in which industry/country/company, if you can name a few and/or share some other info.

aftbit · on Aug 29, 2023

I typically calculate the cost of engineering hours as about 2x their salary costs. This captures the additional employer-paid taxes, benefits, etc, as well as ensuring that I don't under-value the cost, which is typically the bigger threat when providing estimates.

I know good SREs (fancy sysadmin plus some devops, but with more class) on the scale from around $120k/yr up to about $300k/yr. This is in the USA. Everyone I know started in SF but since the COVID diaspora, they are all around the US now.

$250/hr estimate -> $125/hr salary -> roughly $250k/yr

My estimate might have been a big high, but really what I'm trying to drive home is that people are almost always more expensive than hardware.

pgwhalen · on Aug 28, 2023

I'm not super in the know about it, but I believe the total cost of SAN storage at my employer (on prem) is ~10X that.

ilyt · on Aug 29, 2023

No wonder. SSD gives you raw storage. You're losing some (at worst could be even 50%) for redundancy just to start.

SAN Flash controllers are hideously expensive because they need to push tens/hundreds worth of SSDs to tens of servers.

If you can solve redundancy other way (say via pair of replicated databases), a pair of servers with local storage will be far more $ effective

pgwhalen · on Aug 30, 2023

True, but it's not as though the 10X expense is for nothing. A SAN enables a lot more than local storage. I didn't mean to suggest that there wasn't a way to achieve lower costs on storage hardware directly, just that in practice, for a variety of reasons, that quoted cost is not representative.

ilyt · on Sept 1, 2023

And in case of database it doesn't give you much. You still need 2 nodes for redundancy, if you don't need a rack's worth of storage for that database, using SAN for that is waste of money.

I know the corporate idea of "everything on SAN, nothing outside of SAN" very well, it's just plainly solving political problems, not techncial ones.

pgwhalen · on Sept 1, 2023

Your understanding of SAN usage is narrow, and incorrect in the case of my organization.

One major benefit that SAN-backed database storage provides us is, when combined with VMs, enables us to spin up another database instance against already existing data in the SAN (i.e. a staging DB that looks at prod data).

codetrotter · on Aug 28, 2023

> Four hours of engineering time (very roughly $1000)

Why $1000 for four hours?

Does it cost your company $150,000 per month per engineer?

sokoloff · on Aug 28, 2023

How do you get from $250/hr to $150K for ~160 hours?

codetrotter · on Aug 28, 2023

Because I just woke up from sleeping so I made a mistake in the calculation. My bad heh

telios · on Aug 28, 2023

The post makes mention of B-tree de-duplication that is present in PostgreSQL 13, but not 12, the version they're using; at the same time, they're noting that the vast majority of values in some of their foreign key indexes are NULL.

I have to wonder if B-tree de-duplication would have helped with that particular case? The PostgreSQL 13 documentation seems to imply it, as far as I can tell[0] (under 63.4.2):

> B-Tree deduplication is just as effective with “duplicates” that contain a NULL value, even though NULL values are never equal to each other according to the = member of any B-Tree operator class.

I don't think it would be as effective as a partial index as applied in the post, I'm just curious.

[0]: https://www.postgresql.org/docs/13/btree-implementation.html

luhn · on Aug 28, 2023

In previous discussion of this article, an HN user did the math: pg12 is 16 bytes per NULL and pg13 is 6.32 bytes per NULL. https://news.ycombinator.com/item?id=25989467 So definitely some pretty significant savings there.

hn_throwaway_99 · on Aug 28, 2023

Wow, thanks for this great writeup! I thought this was really useful not just from the point of the "partial index 'find'" that is the focus of the post (but that was a great point and thing to be aware of), but from the general overview of good techniques and things to be aware of in postgres if you're worried about using space inefficiently. Saving this one for reference!

One minor "word to the wise", cause I thought this was a great post but also has the potential to be misused: if you work at a startup or early stage company, it is nearly always the better decision to throw more disk space at a storage problem like this than worry about optimizing for size. Developers are expensive, disk space is cheap.

nico · on Aug 28, 2023

> if you work at a startup or early stage company, it is nearly always the better decision to throw more disk space at a storage problem like this than worry about optimizing for size. Developers are expensive, disk space is cheap

This is great advice. In general at the beginning it is better to keep things as simple as possible.

At one fast growing startup I worked at, one of the founders insisted we kept upgrading just one machine (plus redundancy and backups). It was a great strategy! Kept the architecture super simple, easy to manage, easy to debug and recover. For the first 5 years of the company, the whole thing ran on one server, while growing exponentially and serving millions of users globally.

After seeing that, it’s clear to me you should only upgrade when needed, and in the simplest, most straightforward way possible.

jtc331 · on Aug 28, 2023

There’s a lot of emphasis here about just adding storage, but that’s missing the fact that unnecessary indexing affects writes and reads performance also (and potentially quite significantly).

Using a partial index where it’s obviously matches the use case (like when a majority of values are null) is just correct modeling and should not be considered a premature optimization or waste or developer time.

lmm · on Aug 29, 2023

> There’s a lot of emphasis here about just adding storage, but that’s missing the fact that unnecessary indexing affects writes and reads performance also (and potentially quite significantly).

Again, for a startup or early stage company, spending developer time on read and write performance is almost certainly a waste. (Obviously if it's bad enough to be a blocker then fix it, but if it's bad enough to be a blocker then you've noticed it already).

hyperman1 · on Aug 28, 2023

In a similar vein, I got good value from these scripts:

https://github.com/NikolayS/postgres_dba

I managed to free 10% a.k.a about 100 GB of storage by reorganizing table column ordering in a big table.

efxhoy · on Aug 28, 2023

I spent last week freeing up 200GB from our 600GB db with just reindex and pg_repack. The worst offender was a 17GB (of data) table that had 142GB of indexes. Reindex took it down to 21GB. The table indexing is crazy and has multiple indexes over different sets of columns.

A contributing factor for the huge index I think was the distribution of data. It's had inserts, updates and deletes continuously since 2015. Data is more likely to be deleted the older it gets so there's more data from recent years, but about 0.1% of the data is still from 2015. I think maybe this skewed distribution with a very long tail meant vacuum had a harder time dealing with that index bloat.

dang · on Aug 28, 2023

Discussed at the time:

An unexpected find that freed 20GB of unused index space in PostgreSQL - https://news.ycombinator.com/item?id=25988871 - Feb 2021 (78 comments)

halayli · on Aug 28, 2023

I highly recommend pganalyze.com to discover unused indexes, optimization opportunities, and high latency queries.

zzzeek · on Aug 28, 2023

Wow, they are not kidding, this is truly "we saved 20g with one weird trick". We get a lot of requests for users wanting to use these unusual Postgresql index forms that are largely unheard of in old school Oracle / SQL Server shops, I didn't realize they were indexing NULL values either.

aidos · on Aug 28, 2023

The fact that this is news to you makes me feel better about not considering it myself. I suspect we have a few large indexes that could be given the same treatment.

hinkley · on Aug 28, 2023

The first large scale project I worked on, my team couldn't figure out why operations had slowed down as the data set grew.

Indexes have log(n) insertion time per record. If you had 1000 records in your test database, as you approach 65k your insertion time goes up by 60% (2^10 vs 2^16 records). Success slows everything down, and there are only so many server upgrades available.

Add a couple new indexes for obscure features the business was looking for, and now you're up to double.

winrid · on Aug 28, 2023

Double a very small number. If that's seriously an issue use a faster disk. So many people trying to run a DB on EBS with less IOPS than my PC from 2015.

I manage plenty of DBs with hundreds of millions of records and 40+ indexes per table/collection...

vjerancrnjak · on Aug 28, 2023

Insertion should still be extremely fast for such a small index, right?

Doing binary search over a b-tree page is <100 cycles. b-tree traversal over 100M records should still be measured in microseconds and binary search over that should be in microseconds too if not in 100s of nanoseconds.

lazide · on Aug 28, 2023

‘It depends’ - don’t forget about locking overhead, needing to flush multiple pages/leafs as thresholds get hit, disk I/O wait times, etc.

At the beginning it’s rarely noticeable, but it does exist and under heavy load or with a lot of index writes needing to happen it can cause very noticeable overhead.

vjerancrnjak · on Aug 28, 2023

Yes, but these bits don’t scale with log(n) index tree. They depend on something other than the index data structure.

lazide · on Aug 29, 2023

Usually load.

hinkley · on Aug 28, 2023

This conversation is about unrepresentative sample data.

vjerancrnjak · on Aug 28, 2023

Are you saying that the index is made on columns which only have few unique values so you need to scan everything all the time?

Given that you’ve included the log(b) claim I thought you were referring to a normally functioning btree.

A situation where 65k rows end up slowing down things severely should have more to do with other database features (foreign keys and similar).

pletnes · on Aug 28, 2023

Does this principle apply if you have one or a few values that are very common? E.g an integer column with 90% of them being 0?

lozenge · on Aug 28, 2023

Yes, Postgres keeps track of the most common values. If it knows 0 is common and "where val = 0" will keep 90% of rows it might choose to use a table scan instead of the index.

convivialdingo · on Aug 28, 2023

An index is most efficient when it represents a unique set of values. It's still very useful to have an index for grouping, but if the groups represent a severe minority then you will end up wasting a lot of space and cycles searching through the index.

teddyh · on Aug 28, 2023

Couldn’t you create separate indexes for the 0 and non-0 cases?

recursive · on Aug 28, 2023

If the zeroes are most of the common value, it could be slower to use the index when searching for zeroes. How so? Reading values that aren't included in the index require following a reference back to the table row. If 90% of the rows in your query are zeroes, you'd be better of not using that column in your query planning. A naive filter, possibly even after a table scan, is likely to be faster than using the index.

teddyh · on Aug 28, 2023

Would an index of the non-0 rows still help?

recursive · on Aug 28, 2023

Yes. If your WHERE clause is searching for only the ones for instance.

ChristianGeek · on Aug 28, 2023

Only if you used null to represent the common value, which generally isn’t a good idea.

kristianp · on Aug 29, 2023

Where can you learn these DBA skills for Postgres? If it were SQL Server I'd say do a MS cert prep course and get the certification, but I don't imagine that exists for pg.

ilyt · on Aug 29, 2023

I've saved like 40% of space by dumping PG 9.6 DB and restoring it on 13...

Pxtl · on Aug 29, 2023

So deduplication didn't see the NULLs because null != null so they had to manually partially remove the index on the nullable column as a workaround.

Insert my rant on the stupidity of non-reflexively-equal nulls here.

voiper1 · on Aug 28, 2023

(2021) I thought the picture looked familiar.

issafram · on Aug 29, 2023

TIL that partial indices are a thing

sltkr · on Aug 28, 2023

tl;dr: partial indexes.