More

montanalow · on May 11, 2023

There is a lot of latency involved shuffling data for modern/complex ML systems in production. In our experience these costs dominate end-to-end user experienced latency, rather than actual model or ANN algorithms, which unfortunately limits what is achievable for interactive applications.

We've extended Postgres w/ open source models from Huggingface, as well as vector search, and classical ML algos, so that everything can happen in the same process. It's significantly faster and cheaper, which leaves a large latency budget available to expand model and algorithm complexity. In addition open source models have already surpassed OpenAI's text-embedding-ada-002 in quality, not just speed. [1]

Here is a series of posts explaining how to accomplish the complexity involved in a typical ML powered application, as a single SQL query, that runs in a single process with memory shared between models and feature indexes, including learned embeddings and reranking models.

- Generating LLM embeddings with open source models in the database[2]

- Tuning vector recall [3]

- Personalize embedding results with application data [4]

This allows a single SQL query to accomplish what would normally be an entire application w/ several model services and databases

e.g. for a modern chatbot built across various services and databases

  -> application sends user input data to embedding service
      <- embedding model generates a vector to send back to application
  -> application sends vector to vector database
      <- vector database returns associated metadata found via ANN
  -> application sends metadata for reranking
      <- reranking model prunes less helpful context
  -> application sends finished prompt w/ context to generative model
      <- model produces final output
  -> application streams response to user

[1]: https://huggingface.co/spaces/mteb/leaderboard

[2]: https://postgresml.org/blog/generating-llm-embeddings-with-o...

[3]: https://postgresml.org/blog/tuning-vector-recall-while-gener...

[4]: https://postgresml.org/blog/personalize-embedding-vector-sea...

Github: https://github.com/postgresml/postgresml

montanalow · on Nov 8, 2022

RAM and the Postgres shared buffers, as well OS page cache can also be important factors when there is an active working subset, e.g. the active sessions on a website might be reused hundreds of times per session, with only relatively small portion of them active at any time.

Shared RAM across all cores can be a reason to use fewer larger machines rather than more smaller ones in such a case. Postgres does give you options either way though.

derefr · on Nov 9, 2022

Yup. In both cases, I was presuming a worst-case workload — e.g. an OLAP workload where you have to read all the data in the DB to answer a query, and "all the data" is far larger than RAM; or 10k concurrent queries from different tenants who all care about different things. That's really where bottlenecks will hit you. "More RAM" does help when your problems aren't worst-case problems. :)

montanalow · on Oct 20, 2022

Low effort comment that didn't read the post.

- Multiple formats were compared

- Duckdb is not a production ready service

- Pandas isn't used

You seem to be trolling.

learndeeply · on Oct 20, 2022

How would I be able to respond to the post in detail if I didn't read it? What a bizarre, defensive response. To address your points:

- Multiple formats were compared

Yes, but not a zero-copy or efficient format, like flatbuffer. It was mentioned as one of the highlights of postgresML:

> PostgresML does one in-memory copy of features from Postgres

> - Duckdb is not a production ready service

What issues did you have with duckdb? Could use some other in-memory store like Plasma if you don't like duckdb.

> - Pandas isn't used

that was responding to the point in the post:

> Since Python often uses Pandas to load and preprocess data, it is notably more memory hungry. Before even passing the data into XGBoost, we were already at 8GB RSS (resident set size); during actual fitting, memory utilization went to almost 12GB.

> You seem to be trolling.

By criticizing the blog post?

montanalow · on Oct 20, 2022

How are you doing online ML inference, without fetching data?

montanalow · on Oct 20, 2022

You get it. 1 tier is better than 2 tier. Python can't be 1 tier, unless it loads the full dataset which is not generally feasible for production online inference cases. PostgresML is 1 tier, and supports the traditional Python use cases.

xapata · on Oct 20, 2022

Why can't Python be 1 tier? It's a general-purpose, extensible language. It can do anything that PostgreSQL can do.

montanalow · on Oct 20, 2022

The converse is also true. We may be missing diagnosis for people born in September, who may struggle but not quite meet the definitions which often include academic performance.

thomastjeffery · on Oct 20, 2022

Even "academic performance" isn't a clear enough predictor. It isn't even a single axis!

I was a perfect student all through grade school; except I had terrible grades in almost every class. Why? Because I almost never did homework. I usually aced tests without even trying, though. And I could always provide meaningful input into class discussions. But that isn't enough for the strict traditional structure of academia.

Teachers would make me stay after school to try to catch up on homework, and I would just wait patiently until it was late enough in the evening that they would give up and let me go home. This was a recurring pattern from 4rth grade all through 12th. I only graduated high school because I finished hundreds of pages of make-up work over the last couple weeks.

The reality is I couldn't do homework. I could focus and sit still, which must have meant I don't have any attention or hyperactivity disorder...

...but that's not what ADHD is. It's an executive functioning disorder.

ADHD is being diagnosed more today because it has been incredibly under-diagnosed in the past. It's still significantly underdiagnosed in adults today.

jasonladuke0311 · on Oct 20, 2022

This largely mirrors my experience. You know how people talk about addiction, and how they are "drawn" to their poison, like by some force? The polar opposite of that is what I felt with homework. Like trying to push two identical magnetic poles together.

thomastjeffery · on Oct 20, 2022

> Like trying to push two magnetic poles together

...with atrophied hands.

I knew I could, I knew I should, but I could never do. I could only watch myself, never participate.

It's really validating to have words like "executive dysfunction" to describe this pattern, as opposed to "lazy", "tired", and "unmotivated" - which are all I had growing up.

montanalow · on Oct 20, 2022

What may be missing from cloud, is alignment of incentives. If you waste more compute, you increase their profit margins. That would explain things the author questions like general latency increases.

montanalow · on Oct 20, 2022

I think what you're missing is that XGBoost is worthless without data to use for inference. That data can come from in process, or over the wire. One is fast, one is not.

theamk · on Oct 20, 2022

Well, imagine nginx plugin that runs XGBoost. Or even standalone Rust/C++ microservice which provides XGBoost via standard http interface. The data might come from filesystem, or loaded from network location on startup/reload and kept in memory.

Basically, postgresql is a stateful service, and stateful services are always major pain to manage -- you need to back them up, migrate, think about scaling... Sometimes they are inevitable, but that does not seem to be the case here.

If you have CI/CD set up, and do frequent deploys, it will be much easier and more reproducible to include models in build artifact and have them loaded from filesystem along with the rest of the code.

montanalow · on Oct 20, 2022

Stateful services are indeed more painful to manage than non stateful ones. Ignoring state (data fetch time) for ML as if the model artifact is the only important component is... not a winning strategy.

montanalow · on Oct 20, 2022

As a contributor, I think it's interesting when comments focus on the language (Python vs Rust) vs the architecture (local vs remote). Inference is embarrassingly parallelizable, with Python Flask or Postgres replicas. I think the interesting thing is that data retrieval costs tend to dominate other costs, and yet are often ignored.

ML algorithms get a lot focus and hype. Data retrieval, not as much.

chaps · on Oct 20, 2022

For anyone who skips the intro and just goes to the results, this is what they see: https://imgur.com/tEK73e8

A suggestion: clean up the blog post's charts and headers to make it much, much more clear that what's being compared isn't python vs postgresml.

montanalow · on Oct 20, 2022

Another suggestion: Don't build you identity around a language or platform. They come and go. Except SQL. It's been around for longer than either of us.

chaps · on Oct 20, 2022

Agreed, which is why I use postgres for most of my work unless I can't avoid it.

deepstack · on Oct 20, 2022

that is reason many older developer tend to do everything biz logic etc all in db store procedure/functions/view, etc. The cost of getting the data is native, no connection pooling needed, and with V8/python integration in the PG, it is non trivial what language you use. If you are dealing with large amount of data in a db, why not just do everything there. DB like sql has cursor, merge, that makes manipulating large set of data much easier than moving it on to another language environment.

montanalow · on Sept 21, 2022

Those well defined slices of pie you’re referring to are often exactly the political boundaries OP is referring to.

This is the degenerate form of Conways Law.