I didn't vote it down but there's huge tradeoffs here: 1. Row bloat. 2. Bad plan...

hinkley · on Dec 19, 2023

1. Most of the problem with row 'bloat' is complexity of indexes, row deletion increases the number of required indexes, and lack of those indexes is a footgun

2. Got a citation for that? Pretty sure #3 makes that not true, but I'm open to be educated.

3. Yes, and? You're going to have those indexes anyway. That's the point.

4. You don't have to have all rows in a single table just because you took delete out of any user-facing query logic.

5. I think you need some more examples.

hobs · on Dec 19, 2023

And the number of rows in the table, effectively bloating it for all time.

Deleting rows should generally be via the same type of query patterns you use to find them, or that's pretty weird.

I think #4 is something that most people dont implement - ime they just tombstone in their own table, and then all of the other problems are pretty rampant.

Having separate indexes for deletes might be a thing, but taking a history of every change to a thing in a relational way is still a big pain because of the schema cost, merging changes in an audit table when its really a slowly changing dimension is really weird to most query patterns.

To #2 - it depends on your query engine, but if every query does have filtered indexes there's still a cost to having two tables in one table (ignoring #4)

hinkley · on Dec 19, 2023

> And the number of rows in the table, effectively bloating it for all time.

Particularly in the case of growth oriented companies, if your user base is growing exponentially, then half of your records are less than a year old anyway. This bloat is not the problem you make it. And as I said elsewhere, a row that is tombstoned can be deleted offline at some cadence that doesn’t break your other workflows. For instance when nearly all of its siblings are also dead.

hobs · on Dec 19, 2023

Users per second and rows per second are not really that strongly correlated - plenty of services create an excessive amount of noise for a fairly low amount of "real rows".

You can push on devs to fix things like pointless logical updates but its really easy to have a rogue process create copies of your entire dataset each time it runs.

arthurcolle · on Dec 19, 2023

I agree with all of your comments - this is why I asked initial question because literally all his initial comments seems perfectly valid. I appreciate the discourse, and I work on this stuff every day so I was just confused