Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Distributed SQLite: Paradigm shift or hype? (kerkour.com)
252 points by mooreds on April 9, 2024 | hide | past | favorite | 157 comments


LiteFS author here. I don't disagree with any points in the article but perhaps a reframing could help. I previously wrote a tool called Litestream that would do disaster recovery for a single-node SQLite server and I still think it's a great default option for people starting new projects. Unless you're doing very database-specific things, most SQL will carry over between SQLite and Postgres and MySQL, especially if you add ORMs in the mix. Pick the one that gets you writing code the fastest and you can switch down the road if you need it.

Rather than a paradigm shift or hype, I see distributed SQLite as an extension of a path that devs can go down. With Litestream, the most common complaint I got was that devs were worried that they couldn't horizontally scale with SQLite and they'd be stuck. While you probably won't hit vertical scaling limits of SQLite on most projects, it still caused concern. So LiteFS became a "next step" that a dev could take if they ever got to that point. It doesn't need to be your starting point.

As for the "hacky" solution of txid, I'm not sure why that's hacky. Your application isn't required to use it or the optional built-in proxy but it's available if it fits your application's needs. It also works for plugging legacy applications into distributed SQLite without retrofitting the code. The proposed solution of caching seems orthogonal to the discussion of distributed application data. I don't think any database provider would suggest to avoid caching when it's appropriate but there's plenty of downsides of caching. Hell, it's one of the two hardest problems in computer science.


>most SQL will carry over between SQLite and Postgres and MySQL, especially if you add ORMs in the mix

I think this goes underappreciated, or rather the opposite is overstated.

Sure there are some edge cases that don't work the same, but most apps won't hit those.

My _biggest_ gripe with SQLite so far is the lack of column reordering like other DBs. And my simplistic understanding is that the others do it exactly the same way as you'd do it manually with SQLite - table gets _replaced_ with an identical table with the data correctly ordered and the data is shoved into the new table.


If you want a more convenient way to do column reordering (and other advanced alter table operations) in SQLite my sqlite-utils CLI tool can do this:

    sqlite-utils transform data.db mytable \
      -o id -o title -o description
That will change the order of the columns in the specified table such that id, title and description come first.

The same command can handle many other operations such as changing column types, renaming columns or assigning a new primary key.

https://sqlite-utils.datasette.io/en/stable/cli.html#transfo...


> Sure there are some edge cases that don't work the same, but most apps won't hit those.

That really depends on your modelling style. If you like things like types, SQL-side processing (eg using functions), or covering indexes, then you’ll hit issues every five minutes in sqlite.

SQLite really wants the logic (including consistency logic) in the application, just compare the list of aggregate functions in postgres versus sqlite, or consider that you have to enable FKs on a per-connection basis.

Which I guess is why ORMs help a lot: they are generally based on application-side logic and LCD database.


I'm pretty sure SQLite has covering indexes. And the relatively new strict mode should enforce at least basic types (though if you want to enforce your own rules for things like dates you're still on your own).


> I'm pretty sure SQLite has covering indexes.

I checked to be sure I had not missed it, and didn’t find anything. You have expressions and conditions, but no covering. Obviously you can kinda emulate it by adding the columns you want to cover to the key, but…

> though if you want to enforce your own rules for things like dates you're still on your own

That’s what I was talking about, having richer types, and the ability to create more (especially domains).

Strict tables provides table stakes of actually enforcing the all-of-5-types sqlite has built-in. Afaik a strict mode is something that’s still being discussed if it ever becomes reality.


SQLite has what they "call a covering index", see point 9 here: https://www.sqlite.org/optoverview.html

My impression is that this mechanism is less general than what one finds in full-fat client-server SQLite databases.


Ya if my reading is correct this is the poor man's covering index: if all the requested data is in the index key the query will not hit the table, so you can add additional fields at the end of the key to get index-only scans (at a cost, also some flexibility cost e.g. doesn't work with unique indexes).

I guess it's less of an issue in sqlite than in databases with richer datatypes in the sense that all datatypes are ordered and thus indexable.


can you elaborate? I was living under impression that what sqlite has, is exactly what covering index is...


With a "proper" covering index (an INCLUDE clause in SQL Server or Postgres for example) you add data to the index value. This means it can be retrieved just by looking into the index but

- it's not constrained (e.g. to be orderable)

- it does not affect the behaviour of the index, so you can have covering data in a UNIQUE index, or in a PK constraint (although for the latter one might argue a clustered index is superior)

- it only takes space in leaf nodes, not interior nodes, so you can have better occupancy of interior node pages, less pages to traverse during lookup, and they have better cache residency

- and finally the intent is clearer, when you put everything in the key it does not tell the reader what's what and why it there, and thus makes it harder to evaluate changes


In PostgreSQL a covering index can be configured which includes extra information from columns that aren't part of the searchable index itself. It's documented quite well here: https://www.postgresql.org/docs/current/indexes-index-only-s...

    CREATE INDEX tab_x_y
    ON tab(x) INCLUDE (y);


"Obviously you can kinda emulate it by adding the columns you want to cover to the key"

Right, that seems like a good solution to me.


It’s a workaround, but it bloats the interior pages of the index with the covering data, which increases the size of the index and makes lookup less efficient (as they have to traverse more interior pages, and since there are more pages those are less likely to remain in cache).


I use SQLite in my personal projects, not professionally. I was wondering if you could elaborate on what you mean by 'consistency logic' in the context of SQLite.


Why would you want to recorder columns? SQLite reads in a whole record at a time to access any column.


Because as in structs padding slack can lead to a surprising amount of overhead.


That's not the case:

"SQLite does not pad or align columns within a row. Everything is tightly packed together using minimal space."

https://sqlite.org/forum/info/06ad7f81fea46401


One reason to reorder columns with SQLite is that if a column is usually null or has the default value, SQLite will not store the column at all if it is at the end of the row. It only saves a couple of bytes per column, but it is a reason to get these columns at the end.


AFAIK position has nothing to do with nulls, a null is a 0 byte in the header and has no payload in the row: https://www.sqlite.org/fileformat.html#record_format


Continue reading that section:

"Missing values at the end of the record are filled in using the default value for the corresponding columns defined in the table schema."

If you have a table with 5 columns and you only insert the first 3 columns (based on create table column order) because the last 2 values are null or default, SQLite will only insert 3 type bytes in the header. However, if the first column (in create table order) is the one you omit, SQLite has to include its type byte, even if the value is null.


I reorder columns all the time for neater readability of "select *" queries.


Not true.


column reordering is simple to fix with this migration script https://david.rothlis.net/declarative-schema-migration-for-s...

if you are using Zig (and like to live on the bleeding edge), you can also just use my library which includes similar script and also a simple query builder https://github.com/cztomsik/fridge?tab=readme-ov-file#migrat...


I think the bigger issue for many is that tooling, infra(provider), in-house knowledge/skill/experience as well as optimizations may differ quite a lot.

Of course, this will differ a lot between projects.


SQLites handling of dates is pretty kludgy.

It stores them as strings, so to do something like extract just the year from a date, you have to do 'CAST(substr(game_date,0,5) AS INTEGER).'

Hackish and error prone.


It is a fairly low level abstraction, but one that does not require a verbose api. There is nothing error prone or hackish about what you have written, it will work for all inputs, it is just low level. You are just used to having other people write this code for you and give you a library. With newer versions of SQLite you could also write

CAST(strftime(“%Y”, game_date)) as INTEGER

Which is somewhat higher level and less easily mistyped


That's much better, thanks. In case I ever need to do years < 1000 or > 10000 :-)

Still, having that all over a query looks ugly. SQL is can be unreadable enough as it is without all the joins/table renaming.

I just want something more readable like EXTRACT(year from date), like you can in Postgres et al.

Would also be nice if there was a native timestamp like there is in, pretty much every other database.

I'm sensitive to "feature creep" but this doesn't seem like too big of an ask.


I agree it's less obviously correct, but I bet you could add the extension to sqlite if you feel strongly about it. As an aside '%y' is documented to only work in sqlite for years >= 0000 and <= 9999, so it would behave exactly the same as the code you wrote. especially because you already didn't have to worry about years less than 1000 because the ISO8601 format used for serializing dates in sqlite normalizes them with leading zeros.

for instance `select date(-50000000000, "unixepoch");` returns `0385-07-25`

Interestingly %Y doesn't seem to handle negative dates either if you need to handle BC, so I guess that is one downside for both. This is one reason I sometimes prefer to use low level code even when it is less obviously correct with a cursory glance, because abstractions may not mean what you think they mean, or even worse, may be lying to you. At least with low level code I can reason about how it would behave under certain edge cases I might care about.


The paradigm shift that's going to come with distributed SQLite isn't to the edge, it's going to be to users devices. I believe the DX of building Local-first apps is going to hit the criticality point in the next year or so and its popularity is going to explode.

With dynamic partial replication you can synchronise a subset of your database to a SQLite db on your users device, eliminating the network from the ui interaction loop. Then with the emerging eventually constant syncing systems, many using CRDTs, it's possible to have conflict free eventual consistency. Just read and write to a local database and it will sync with your central server or other clients in the background, both in realtime or after working offline.

I work on one such system at ElectricSQL, but there are many people building variants of this such as Evolu, SQLsync, CR-SQLite, and PowerSync.

That's all not to say SQLite on the edge isn't really damn cool!


You've been able to build these apps for years now, hell PouchDB was released in 2012. I don't think anything significant is going to push for more of these apps--the DX isn't really that great compared to remote write--there are more abstractions, less caching, way more corner cases, and I call BS on your conflict free utopia.


I'm using Pouch/Couch on a web app, and the sync is really good DX - I got all the functionality I wanted, including easy undo, in a day or so. So yeah, what more DX would actually make people adopt local-first?

I don't think it's a dev problem, but a business one. No consumer is demanding local-first, and no company wants to give up that profitable data. We're doing it because we're purposely not interested in users' data and resiliency is part of our value—but we're still holding some of it for consumer convenience.


Local First (afaiu it) doesn’t have to focus on privacy, though it definitely enables it as a product use case. Outside the privacy focus, it doesn’t remove access to the profitable data (only delays it) while enabling user access patterns that aren’t reliant on 24/7 availability of remote data stores. I think there’s a clear business/product benefit here in that consumer app usage and performance aren’t tied to the users current network quality, and service availability requirements can be less stringent which directly translates to $$ (which can be part of everyone’s value). It probably introduces unique tradeoffs though, and I expect you have a good idea of what those are given you use the pattern. I imagine a data heavy local app syncing with a remote backend unpredictably is its own headache.

WRT Pouch/Couch, and this take is probably annoying to a user of it like you, but it’s an ecosystem that you need to have a reason to get into - IOW it ain’t your grandmas SQL :)


You're right, as I understand it myself, it's more about overcoming network burps, a major point of user friction. I guess local-first is not the exact term for what I'm thinking, which is that an app should not need the Internet to work, but it can add conveniences, foremost being able to get to your setup from another device. That's how some of my most used tools like to-do list and notes work. In this scenario, the server could be dumbly storing the data without understanding it, and the user could point to another data store.

I'm generally a Postgres user not because I love SQL but because I hate drama. That said, my Pouch/Couch use case is fairly simple and probably well aligned with the intent so far and so the paradigm clicks for me.


I think you’re on the right track, though local-first doesn’t necessarily mean you give up on centralized data. That’s where the sync ultimately ends up, right?


All of this makes me think we'll just be building meteor apps but with sqlite instead of mongo and minimongo whaha


Do you think SQLite performance on user devices changes anything for UX/DX? i.e. if SQLite were 10x faster than today on user devices in the future.


What do you think about non-SQL CRDT frameworks like Automerge and Y.js and how they fit into this local-first future?


I think this skips one mega benefit for apps.

I’ve been using liteFS in production for a couple months.

Your web app is able to resolve db queries instantly.

You don’t need loading states if you’re using complex charts and other frontend JS that waits for data.

All the data is resolved so fast and you can just return all your data like more traditional apps, and the load times are insane.

If you’re multi region you can deploy one app instance there. Instead of 2-3. Postgres read replica, maybe something like redis.

It really replaces both of those, assuming you have a read heavy app it works great.


I think you always need loading states to account for slow network, or am I missing something?


>to account for slow network

Or a slow anything else, e.g., SQLite queries. This thread has focused on the network aspects, and they do stand out since there can be such a large gap between a network call vs a local SSD read. But we're still talking about a database, which could be huge (presumably it's the main, single DB for the whole app). And there is still all of the actual SQLite work that needs to happen to execute a query, plus opportunities for really bad performance b/c of bad queries, lack of indexes, all the usual suspects. Not to mention the load on the owning process itself. So, I'm agreeing that a loading state is needed.


I think this strongly depends on the application and use-case. I've worked at two businesses so far, and they target small and medium-sized companies in a tenant-style manner.

All data for one customer that is important enough to load easily fits within less than 5MB. That is of course not counting logs and such, but it's all "important" user-specific data. It's not -that- dissimilar from a small to medium-sized redux store in complexity. Lots of toggles, forms, raw text and some relations.

Of course this architecture doesn't scale to the enterprise level, or to other certain heavily data-driven applications (like imagine running the entirety of your sentry database in-browser?), but that's what architecture is -for-! Pick one that synergizes well with your use-case!


The SQLite database is located on the application server, so there is no network between the DB and the app.


Assuming you're serving a frontend that makes network calls to a backend, you'll need to handle loading states in the frontend regardless of how the backend retrieves its data.


The idea is that you're not doing that.


Unless your database is in the browser, you are always going to be at mercy of network latencies talking to the backend.


You're just saying, even if all you were doing was fetching a static JSON blob from the memory of the frontend server, you'd still want load states, right? (That makes sense, I'm just checking my understanding.)


Yup, exactly. Phones change wifi networks, routers drop packets, load balancers get overloaded. Hard to fully eliminate tail latencies.


The key here is to make a single API call to the backend which then runs 100+ SQL queries at once and combines the results into a single JSON response - that way you're only paying the network cost once.

See https://www.sqlite.org/np1queryprob.html

I've implemented GraphQL on top of SQLite and found it to be an amazingly good match, because the biggest weakness of GraphQL is that it makes it easy to accidentally trigger 100s of queries in one request and with SQLite that really doesn't matter.


FYI: We're offering a SQLite hosting service which automatically creates a full-fledged GraphQL endpoint for your database: https://www.airsequel.com


that doesn't eliminate need for loading states at all, and is already solved by things like react server components or remix (see waterfall removal graphic https://remix.run/)

i think this discussion is confusing the use of sqlite in local-first apps, where there's no loading states because the database is in the browser. you can use sqlite on your server, but you still need a "loading state".

even with postgres, if your data and server are colocated, the time between the 2 is already almost 0

now maybe the argument is your servers are deployed globally each with an sqlite db. that's not all that different from global postgres read replicas


If the data and the server are colocated on the same machine, the database network overhead is almost zero. But that's not necessarily true otherwise. For an HTTP request that incurs a single query, you can round it to zero, but you have to be careful not to write request handlers that incur lots of queries if you're trying to stay below the 100ms threshold.


> I've implemented GraphQL on top of SQLite and found it to be an amazingly good match

Could you give a pointer to the repository, or is this part of Datasette?


Nevermind, found it. Just don't know the license.

https://github.com/simonw/datasette-graphql


It's Apache 2 (there is a badge in the README) but I realize now I forgot to drop a copy of the license into the repo - fixing that now.


You can with sqlite in webassembly


In theory, but there are still a lot of sharp edges. e.g. I wanted to use sqlite in-browser with an ORM, but few ORMs support such a setup.


And your users are sitting in front the application server?


I “use the platform” so clicking a new tab shows the browser loading state. If you’re on a slow network, that’s the way I go about it.


You can run postgres on the same host as the web server too. Isn't that going to get you most of that same benefit in speed?


It’s better than running it on a separate server but if you’re using Postgres you’re always hitting the network stack on the machine, with SQLite that isn’t the case.


Postgres is great, but managing a fleet of them (one in each web server) and ensuring they are all working fine would bring a lot of operational complexity.

SQLite, on the other hand, skips all of that with its simple in-process model.


I just start postgres and it's solid. I don't understand this statement. With Dokku it's trivial.


the cool thing with sqlite is that you can compile your whole app into a single binary, so you don't need docker for example (or it's trivial to dockerize it afterwards if you really insist)


It is. I do this on plenty of hobby Laravel apps.


Me too, but I guess there could be even higher gains with SQLite when it's in-process...


How is it for writes? Would a CRM type system benefit from liteFS setup?


In Fly's implementation;

> LiteFS’ use of FUSE limits the write throughput to about 100 transactions per second so write-heavy applications may not be a good fit.

https://fly.io/docs/litefs/faq/#what-are-the-tradeoffs-of-us...


Seems that 100 writes per second is more than enough for a typical CRM (for a local business).


The article says "LiteFS supports roughly 100 writes per second".


We have so many distributed X applications nowadays that all try to solve the same problem, either in the same or different ways. I think we first have to come up with a simple, distributed, open-source storage solution. In the cloud, we have things like AWS S3, which is a very reliable distributed storage, but for self-hosting, we have:

Ceph, with which I have much experience, is a very solid and quite bulletproof storage solution that offers S3 protocol and FS. However, maintaining it in the long run is really challenging. You better become a Ceph expert.

SeaweedFS struggles with managing large data groups. It's inspired by an outdated Facebook study (Haystack) and is intended for storing and sharing large images. However, I think it's only average—it has poor documentation, underwhelming performance, and a confusing set of components to install. Its design allows each server process to use one big file for storage, bypassing slow file metadata operations. It offers various access points through gateways.

MinIO has evolved a lot recently, making it hard to evaluate. MinIO relies on many small databases. Currently, it's phasing out some features, like the gateway, and mainly consists of two parts: a command line interface (CLI) and a server. While MinIO's setup is complex, SeaweedFS's setup is much simpler. MinIO also seems to be moving from an open-source model towards a more commercial one, but I have not closely followed this transition.

All of these solutions are not simple enough to be the base for a distributed database application. What we really need would be something like an Ext4 successor, let's call it Ext5, with native distributed storage capabilities in the most dead-simple way. ZFS is another good candidate. ZFS has already solved the problem of how to distribute storage across multiple hard drives within one server very well, but it still lacks a good solution on how to distribute storage across different hard drives on different servers connected via a network.

Yes, I know there is the CAP theorem, so it is really a hard challenge to solve, but I think we can do better in terms of self-hosted solutions.


> While MinIO's setup is complex, SeaweedFS's setup is much simpler.

Are you sure you are not talking in reverse?

I find Minio single binary deployment very easy, and you also complained about SeaweedFS's complexity in the previous paragraph.


There is also Garage: https://garagehq.deuxfleurs.fr/


> In the cloud, we have things like AWS S3, which is a very reliable distributed storage

Yes, but S3 is basically a standardized protocol at this point. There are many both open and commercial alternatives, like Cloudflare R2 (no egress). So depending on the reason for self-hosting (such as preventing lock-in), S3 might be the least important thing to actually move away from. It’s way more difficult to migrate away from eg a proprietary db, sometimes by design.


Tigris Data [1] seems like a promising open-source solution [2] in this space.

[1] https://www.tigrisdata.com/ [2] https://github.com/tigrisdata-archive/tigris


This is kind of missing the point. I can only speak for the company that sponsors LiteFS (I don't think any edge computing companies put 'otoolep up to doing rqlite, for instance). We love Postgres. Most of our users use Postgres. And Postgres works fine in edge/backhaul configurations; companies were doing geographically distributed Postgres read replicas long, long before we came around. That's where we got the idea.

The idea of LiteFS is:

* Most apps are read-heavy, and some of them are overwhelmingly read-heavy.

* If you're read-heavy and can thus get away with it, there's a pretty significant performance win in replacing the networked n-tier architecture with SQLite, because Postgres round trips add up over the lifecycle of any given request.

* In fact, that's so much the case that --- as Richard Hipp has been pointing out for over a decade --- you can blow off the N+1 problem and just write natural queries.

I think Ben Johnson would be the first to tell you that LiteFS isn't a perfect fit for every application. We have systems that use both Postgres and LiteFS.

Meanwhile, you have to dig to find it on our web page! It's an open source project that works everywhere Linux does. Dial back the cynicism a bit! :)

A plug here for:

https://kerkour.com/sqlite-for-servers

Same author, and one of the best SQLite articles ever.


Man! I can’t agree more. It does seem foreign right now, but I truly do believe edge computing will become another “CDN” type abstraction, but it will take time for folks to catch up.

Can’t wait for the day that projects like litefs are just a default that nobody knows about, lol.

Go from “technology nobody knows about” to “technology nobody knows about, but runs the world.”


> What a nightmare! Now your application code spills into your database and our initial goal of simplifying application development is nothing but a long-forgotten dream. All of that for what? To save a few milliseconds to display a web page.

I think "a few milliseconds" vastly understates this: if you want to run your application closer to users, even just across the US, each query is (at least) 70ms just to get over the network and back again.

"Application code spills into your database" was a bad thing when you wrote one language (say, Java, or PHP) and another language (PSQL/TSQL/etc) for your "stored procedures", but that's not what most modern databases are advocating for.

Instead, and not unlike something like React Server Components (RSC), you can choose whether to run code close to the user or closer to the DB (for transactions) in the same language as your application, because it's still part of your application code. This is the model that Durable Objects[1], our coordinated storage service, uses.

Disclaimer: I work on D1 & Durable Objects at Cloudflare, so I'm likely to be called biased here, but it's not like we haven't a) thought about this deeply and b) actually use D1 and Durable Objects to build distributed systems at Cloudflare.

[1]: https://blog.cloudflare.com/durable-objects-easy-fast-correc...


I think this is a false dichotomy to try to frame there is a Sqlite vs Postgres situation going on. Sqlite is amazing super fast single writer database without network service so it can be used locally as a library mostly while Postgres is a full fledged RDBMS. Both have use-cases that the given service is a better option and there are many other software products out there to be used for similar purposes.

Notable SQLite use cases: https://www.sqlite.org/famous.html

Postgres does not have a similar page: https://www.postgresql.org/about/press/faq/


    > While SQLite is a really amazing database, most teams will benefit from avoiding it and going the PostgreSQL way instead.

    > Bazillions of engineering hours have been spent to make Postgres the best backend database and choosing SQLite will inevitably force you to reinvent what Postgres already had for many years, in a fragile and buggy way.
Could the same not also be said for MS SQL Server, Oracle, Sybase, MySQL, or MariaDB? The author offers no supporting evidence for this statement.

Rewrite that: "Bazillions of engineering hours have been spent to make XYZ the best backend database..."


> Rewrite that: "Bazillions of engineering hours have been spent to make XYZ the best backend database..."

What you're saying (and what the author is saying) however is clashing with the reality of so many developers using sqlite and being happy with it.

I'd suggest to rewrite it another way:

> Bazillion of developers think they'll need a full-fledged database for their new project while sqlite will cover most of their needs.


This is a good reply. I want to clarify why I excluded SQLite in that list: It is not a database server (all of the others are); it is an embedded database. I say that with zero disrespect. I am 110% a SQLite fanboi. Honestly, I have no experience with PosgreSQL, but I have heard a lot of good things about it. The community looks amazing.


> I want to clarify why I excluded SQLite in that list: It is not a database server (all of the others are); it is an embedded database.

Agree. The confusion probably comes from the fact, the SQLite programming language interfaces still have the "connection" abstraction.

Also, it would be great if there was a simple way to simply "load this entire database into memory". Its not too difficult to manually copy tables, but its much slower than it could be. Even a smallish ~250 MB database was taking like 30 seconds to copy row-by-row.


Without giving evidence, I do agree that Postgres is a better database that serves the general case better. I use both in production for different purposes, but postgresql makes a lot more sense then SQLite as our primary database


Fair comment. First hand experience is important. HN is great for people sharing their technical experiences. I would not say that is "without giving evidence"!


Among the various free backend hosting options currently available, does Cloudflare stand out as the best choice with its offerings such as Workers and D1? I would love to hear users experiences with these services or if there any informative resources that discuss the costs associated with scaling applications on the infra?


Something I've been thinking about is partitioning my SQLite. Instead of storing all user's data in one mega table, what if I made a SQLite database for each user? Provided users never talk to each other, I think this might work?


> Instead of storing all user's data in one mega table, what if I made a SQLite database for each user? Provided users never talk to each other, I think this might work?

I'm doing this in a project I'm developing for language learning, except that you have both shared databases for content, and individual databases for view logs, preferences, and so on. What I actually do is open a :memory: database, ATTACH all the appropriate databases. Transactions work just fine, but because the shared database is basically read-only, then there's no write contention because each user is just writing to their own database. Overall it makes queries easier, because you don't even need to include the user (or the language). (Of course, the flip side is that getting stats on all the users and languages is more difficult.)

Currently it's just single server, but it should be possible to read-replicate the content, and actually move the write replica of the study database to a local server. It should also make it straightforward to let people download their own information: just hand them the actual SQLite file.

If I ever grow large enough that I need multiple servers in different geos, I'll write up my experience and post it here.


You can attach to databases dynamically in queries and join across them. I probably wouldn't (in an ordinary data model) do per-user, but I would consider it for different functional areas.


There's a limit on how many databases you can attach to the same connection (SQLITE_LIMIT_ATTACHED), it defaults to 10.


Worth noting that this limit can be raised up to 125 (as I'm sure Simon is aware).

I would say that if one needs to query across more SQLite files than that, it's definitely time for a different data policy.


I believe this is what one of the companies mentioned in the article, Turso, can help you do. a per-tenant database.


If you want to query across users, which you probably want for analytics, that is going to be a massive PITA.


You should probably use a specialized DB for analytics (a.k.a. OLAP DB) anyway. As long as you have an automated way of replicating data from SQLite to your OLAP DB, everything should be fine.


You can slice it and dice it any way you want, really. The constraint is often what data needs to be written within a transaction. You'll have to figure our a way to reliably apply a consistent schema to all these database files somehow and keep track of them.


One of the things I appreciate about SQLite is being able to keep all the schema initialization and upgrades in the application itself, which are then checked into git and can be tested like mad with throw-away copies of the data.

Here's a package in golang I wrote to help with that process:

https://pkg.go.dev/gitlab.com/martyros/sqlutil@v0.0.0-202312...


This setup can work great and even support elements that are shared between users, if you also give those elements their own DB. I’m working on a prototype to support this natively in Prisma.


at least for awhile this is how bluesky/atproto worked. afaik they only ran into issues when the number of users on each server overwhelmed how many files would fit comfortably in a single directory (which is obviously a large number)

https://news.ycombinator.com/item?id=38171322


that's still how it works, we just shard our users across multiple hosts


Sqlite handles many tables, per tenant is more reasonable.


What portion of writes typically have to go to the primary server? If users are local then their data maybe is often local as well. In other words not all writes need to first go the central server from where they are distributed back to all edge-databases.

I can see that the organization running the application needs a global view of all data. But regional users perhaps don't. Often they just need to know their own data.

Write first to edge then copy to central database rather than write first to central database then trickle down to the site from where the write originated in. Just wondering what portion of applications could use this alternative design.


It’s been done and it becomes quite complicated very quickly. Then it becomes necessary to manage that complexity and you end up with something like Spanner.


This could work in very small projects. Still very risky.

Say you flag a write as "local". Later, some other place in your app starts relying on this write in another locality. If you don't update your write spec from "local" to "primary", you don't have a consistent database anymore, but you will make decisions thinking that you do.

Now consider a team of 10. Or 20...

Mayhem can spiral very quickly from there.


Don't get caught up in marriage to a particular tech flavor. Instead, architect and use a generic db adapter pattern that translates a more generic SQL/GraphQL flavor into whatever is necessary and efficient underneath for that particular DBMS or store. Meta does this for many warm storage services. It makes changing underlying DBMSes and scalability architectures much easier to develop and reason about than if every app has to know everything about its backend(s).


I don't believe Cloudflare D1 has read replicas yet? They describe how it will work in detail in a blog post [1], but in the future tense.

> We’re actively working on global read replication and realizing the above proposal (share feedback In the #d1 channel on our Developer Discord).

Perhaps it will be out by the time the book is finished.

[1] https://blog.cloudflare.com/building-d1-a-global-database


D1 is still pretty limited in my experience. No read replicas really kills any of the meaningful benefit of the architecture, and being built off of a (mostly) SQLite-compliant API makes me nervous as it isn't really SQLite at all.

Last year's major Cloudflare outage really was the final straw for me with regards to D1. I don't mean that as a knock at Cloudflare at all, the situation sounded horrible and I appreciate how the entire team responded to it. I just worry that internal responses to fundamental infrastructure issues will leave newer projects like D1 on ice for a year or two.


D1 (the ability to query your database at the edge) didn't get knocked offline during that outage though?


We did actually run into D1 issues during the outage, though I don't remember exact details on what was down. Our issues may have been API related with reads still functional.

My concerns, and this is very much my own concerns with no context of what has or is happening internally st Cloudflare, is that the outage exposed some serious issues that will take time to fix safely. The fact that D1 still doesn't support replication is an indication to me that it has been deprioritized, likely with other newer and less used products, while the infrastructure updates are dealt with.


> The fact that D1 still doesn't support replication is an indication to me that it has been deprioritized, likely with other newer and less used products, while the infrastructure updates are dealt with.

D1 is definitely not deprioritized. We're heads down on replication, and it's important for us to get it right. Takes time!


There’s no way of knowing from the outside, but nothing in the April 1 announcement for D1 suggests deprioritization to me. They just announced that they’re doing read replication, describing how it will work in detail. Why preannounce it if they’re not working on it?

My guess is that they wanted it for 1.0 but the release slipped. It happens.


I wonder if SQLite will ever be modified to allow concurrent writes. Right now it locks the whole DB on write [1]. But since SQLite usage is rapidly expanding from its embedded db roots, one has to wonder if the devs have considered creating some sort of SQLite daemon, or if the underlying architecture would have to be changed so drastically it wouldn't be worth it.

1. https://www.sqlite.org/faq.html#q5


There's an official branch that implements BEGIN CONCURRENT: https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...


In most SQL implementation, modification of row is table level locking anyway, so you can put tables in different sqlite db files and achieve same "concurrent" writes.


WAL2 + BEGIN CONCURRENT

...to pretty much eliminate write contention. I know, it sucks these aren't built in yet.

Also if we got some sort of router + map reduce helper (Vitess, Citus -like) it'd make massively distributed SQLite a lot more viable. Setups that don't hammer a single master would make all the difference.

Postgres' main disadvantage at scale is all the additional machinery required (backups, failover, proxy, no DDL replication with built-in logical replication .... UGH!!), even when you're using Citus. Feels like it forces you into k8s with the amount of orchestration you need to run.


this does not eliminate write contention, it just moves a bunch of the problem into the application as now you have to resolve conflicts yourself


> just moves a bunch of the problem into the application

Would you mind elaborating



Another relevant example is how MicroK8s uses distributed sqlite (dqlite) under the hood to make their Kubernetes distribution highly available, instead of using etcd. https://www.cncf.io/wp-content/uploads/2020/11/MicroK8s-High...


Most apps never need to scale. Also, worried about scaling? Just use an ORM that allows you to switch from SQLite to Postgres. It’s as simple as that.


Unless your using database specific features. One of the biggest advantages for Postgres is how incredible the ecosystem is. It doesn't work for everything, but I have an OEM, multiple kinds of text search (vector, inverted indexes, trigrams), recursive and graph-like queries (though that's admittedly less of an issue if n+1 isn't a problem), row-level acls, locks, etc.

It's really nice to have all of that power available in one piece of infrastructure.


Not really a problem if you go from sqlite to postgres. Which sqlite feature is missing from Postgres?


All SQLite queries are not just going to straight up work in Postgres.

e.g handling of dates is a big one. In SQLite they are just strings(kludgy IMO), where as Postgres has the timestamp data type.


This is saying: "just don't try to solve hard data storage problems". Not all applications are CRUD.


You're missing the point. Most software doesn't need to scale or solve hard data storage problems, and if it ends up having to, you can always upgrade to Postgres with minimal effort. That makes SQLite an attractive option if you won't immediately benefit from Postgres' rich features.


We're just going to be speaking past each other, I think. GP clearly stated they lean heavily on Postgres technologies (by calling out specifics like pgvector). Stating that the applications that don't need those technologies, indeed don't need them, is tautological!


The problem with Postgres is that out of the box, it is not much different from SQLite in terms of availability and scaling. You need to add a Pgbouncer before it can handle multiple connections, and you need a really complicated dance to make it highly available, with read replicas and hot standbys. This is why I prefer MongoDB for that matter.


If you're willing to accept eventual consistency (a big ask, but acceptable in some scenarios) then there are options like marmot [1] that replicate cdc over nats.

[1]: https://github.com/maxpert/marmot


The major downside to marmot right now is schema changes don't propagate.

Otherwise, keeping a very close eye on it.


It is a pain, but if your use case allows for scheduled downtime, it's not terrible.

I'm looking at Marmot for high availability. Not necessarily horizontal scaling, but instead having a backup server or two that have a constant up-to-date copy of the live db but also can take over if the main server dies, and the main server can then sync the data back when it comes online.


The only way I know for doing concurrent writes in sqlite is to open a transaction, accumulate a bunch of writes and then commit them. Otherwise it is dog slow. And it has to be a single process, or you get data corruption. Has this changed somehow?


This is due to SQLite calling fsync a lot by default, to be on the safe side. You can use pragma journal_mode = "wal" and pragma synchronous = "off" and it should be much faster, without risking corruption if your server powers down unexpectedly (at least in theory, you should still make regular backups, which can be done from within a running sqlite instance using the backup command)


Yes it has been years since the WAL mode fixed this. And WAL2 is even better now.


See some development frameworks for local-first apps support both SQLite and PostgreSQL. The advantage of using PostgreSQL is that when you add a cloud option to your local-first app, the migration becomes much easier.


> What if there was a simpler solution?

Then TFA mentions caching without talking at all about how that is a solution, let alone a simpler solution.


Is there a distributed version of SQLite that keeps its embedded library feature? For example, it could use EBS and S3 for shared storage, allowing for distributed read and write access, and possibly even multiple concurrent reads and writes.

Should this be available, numerous lightweight web applications could operate without having to set up a separate PostgreSQL or MySQL database.


Yes, that's (for instance) what Litefs does.


this page just says 'Forbidden'


not workinf for me currently either. i’m on safari/ios.


Yeah, says Forbidden in firefox or curl but loads in chromium - welcome to the open web of browser interoperability etc etc where servers now just decide your user agent or tls handshake or whatever is fishy.


also doesn't work on desktop safari for me


thank you!


I updated Chrome and reloaded.


[flagged]


Out of curiosity, did I stumble on an actual feud of sorts, or since they're a prolific commenter did they just not recognize you? Do you have links to the first four?


I have no idea what's going on here, but I got the same error and really did fix it by updating Chrome and restarting.


I, too, am curious. I have been searching comments for a bit but haven't found anything.


Also I think sqlite would not be a good fit if any sort of slow workers / background jobs are required.


Why not?

(I'm planning a background job system based around SQLite at the moment)


Because typically all writes need to happen from a single process, so if you want to run multiple processes writing to the same DB, you need to synchronize them somehow.

If you are running a workers loop together with your http serving loop, running on the same process is awkward: you would need to stop serving your webapp each time you want to deploy new workers. Also you would need to wait until all workers are done before you could redeploy the app. If one of the workers does something unexpected, it could take down your webapp together with any other workers running, etc.

If you used multiple processes you would need to perform sync through some IPC or something like redis to perform writes sequentially, but using a DB that already ships as a daemon would fit the problem better.


Thanks, that's useful.

My design should be OK - I'm planning on having the workers retrieve jobs and send back their results via an HTTP API to a single process that wraps the SQLite database (Datasette with a custom plugin).


I'm not super familiar with Datasette but to a certain extent it does seem like you already implemented some sort of sqlite daemon over http? Is it fair to say Datasette is "like postgrest but for sqlite"? When I was thinking of using sqlite for bg jobs that's the kind of thing I wanted to avoid, but maybe you are already there :-)

In case it helps, I've been investigating bg jobs too and saw a bunch of resources that can be helpful. One is a hn post about a job queue on top of pg [0] that has some cool pointers. Someone mentioned the "transactional outbox" pattern [1]. Separately, I found this video about implementing a work queue with Nats JS [2].

I suspect you could implement the outbox pattern in Datasette and provide a way to offload the jobs to any external queue, but Nats/JS seems nice since it provides all the building blocks to implement "exactly once" delivery, dead letter queue, hearbeats to ensure the workers completes the work, etc, and it is very easy to run. I think it could save you a lot of the tricky work of implementing all these features with SQL(ite).

The overall design would be something like:

    def trigger():
       """Enqueue a job transactionally: either fully succeeds or fully fails.
       job_id = transaction {
         job_data = ...
         enque job_data into outbox
       }
       # This can fail but no biggie, you will still need to poll in case of failure,
       # so no jobs will be created without their backing data, and no jobs will be dropped.
       transaction { remove job_id and send to nats }

    def background_poll():
       """Poll the outbox in case we succeeded in inserting into the outbox but somehow failed to deliver to the queue."""
       try periodically { transaction { remove from outbox and send to nats } }
... then in another process or processes, the workers would talk to nats/js to perform the work.

Elsewhere someone described a job system that only relied on a database without support of events (that is, not pg), and required creating a sessions table to ensure workers complete the jobs, etc [3]. This is the kind of thing that I think could be simplified by using nats/js or another external job queue.

--

0: https://news.ycombinator.com/item?id=38349716

1: https://microservices.io/patterns/data/transactional-outbox....

2: https://www.youtube.com/watch?v=7Jp3tyCGMZs

3: https://forum.cockroachlabs.com/t/how-to-implement-a-work-qu...


SQLite is neither shift nor hype, its a misunderstanding of underlying architecture challenges.

SQLite has pros and cons, like anything else. When it comes to web frameworks and app platforms, to fundamental problem is that many committed fully to server less/edge and ignored the los of persistent storage. You just can't use a local database when using serverless, or any type of distributed compute/rendering for that matter.

Databases are centralized by design, you can dodge some of that complexity with clever synchronization protocols but you are still limited to having a single primary DB at the end of the day.

For read-heavy use cases, tools like Turso can be invaluable. If database writes are more common, you'll always be limited by network latency. More importantly for most modern web apps, whether you render HTML in the browser or a server you can't avoid loading states. IMO you might as well lean on the platform and use server rendering whenever persistent state is involved.


"Something went wrong! Try reloading the page. If the problem persists, please update your web browser to the latest version."

RIP website.

edit: it's back


Should have used SQLite.


Something can be both hype and useful. Paradigm shift, no, because there's no fundamental architectural difference from any other replicated sql database.

Sqlite only really works as a throwaway database. For that it's pretty great. Although a lot of the current use cases for sqlite could have just been a .ini file.


SQLite is getting a lot heavier. This is becoming more like MySQL or Postgres.


No it's not. SQLite is still the only "serverless" database of those three. Nothing has changed, aside from the companies selling variants of it, but that's not SQLite's fault


The most recent SQLite amalgamation zip file is 2732008 bytes.

Version 3.26.0 from 2018 is 2286469 bytes.

That's a 20% increase in 6 years, but it's still just 2.6MB of compressed C.


The fate of all projects if they continue on long enough.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: