dataplayer's comments

dataplayer · on May 1, 2023

There's only one Frontend Engineer position listed in your careers page. If one wanted to be considered for the Compilers Scala Engineer position do we apply to the Frontend position or is the compiler position no longer available?

pubby · on May 1, 2023

Last time I emailed them the compiler position was filled.

dataplayer · on Dec 3, 2020

You can spin up test Kubernetes clusters with different underlying container runtimes using Minikube. If you want to play around with a cluster running containerd instead of the docker container runtime use:

minikube start --container-runtime=containerd

Use this to convince yourself that all your current docker images will still deploy and work as usual.

dataplayer · on Dec 2, 2020

What exactly are "materialized views"?

ako · on Dec 2, 2020

It's a query of which you save the results in a cache database table, so next time when it is queried, you can provide the results from the cache.

Typically, in a traditional RDBMS, the query is defined as a sql view, which you either have to manually refresh, or can be refreshed periodically.

Using streaming systems like kafka, it's possible to continously update the cached results based in the incoming data, so the result is a near realtime up to date query result.

Writing the stream processing to update the materialized view can be complex, using SQL like materialize enables you to do, makes it a lot more productive.

bluejekyll · on Dec 2, 2020

“In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function.”

https://en.m.wikipedia.org/wiki/Materialized_view

derefr · on Dec 2, 2020

Let's start with views. A database view is a "stored query" that presents itself as a table, that you can further query against.

If you have a view "bar":

    CREATE VIEW bar AS $$
    SELECT x * 2 AS a, y + 1 AS b FROM foo
    $$

and then you `SELECT a FROM bar`, then the "question" you're really asking is just:

    SELECT a FROM (SELECT x * 2 AS a, y + 1 AS b FROM foo)

— which, with efficient query planning, boils down to

    SELECT x * 2 AS a FROM foo

It's especially important to note that the `y + 1` expression from the view definition isn't computed in this query. The inner query from the view isn't "compiled" — forced to be in some shape — but rather sits there in symbolic form, "pasted" into your query, where the query planner can then manipulate and optimize/streamline it further, to suit the needs of the outer query.

-----

To materialize something is to turn it from symbolic-expression form, into "hard" data — a result-set of in-memory row-tuples. Materialization is the "enumeration" in a Streams abstraction, or the "thunk" in a lazy-evaluation language. It's the master screw that forces all the activity dependent on it — that would otherwise stay abstract — to "really happen."

Databases don't materialize anything unless they're forced to. If you do a query like

    SELECT false FROM (SELECT * FROM foo WHERE x = 1)

...no work (especially no IO) actually happens, because no data from the inner query needs to be materialized to resolve the outer query.

Streaming data out of the DB to the user requires serialization [= putting the data in a certain wire format], and serialization requires materialization [= having the data available in memory in order to read and re-format it.] So whatever final shape the data returned from your outermost query has when it "leaves" the DB, that data will always get materialized. But other processes internal to the DB may sometimes require data to be materialized as well.

Materialization is costly — it's usually the only thing forcing the DB to actually read the data on disk, for any columns it wasn't filtering by. Many of the optimizations in RDBMSes — like the elimination of that `y + 1` above — have the goal of avoiding materialization, and the disk-reads / memory allocations / etc. that materialization requires.

-----

Those definitions out of the way, a "materialized view" is something that acts similar to a view (i.e. is constructed in terms of a stored query, and presents itself as a queriable table) but which — unlike a regular view — has been pre-materialized. The query for a matview is still stored, but at some point in advance of querying, the RDBMS actually runs that query, fully materializes the result-set from it, and then caches it.

So, basically, a materialized view is a view with a cached result-set.

Like any cache, this result-set cache increases read-time efficiency in the case where the original computation was costly. (There's no point in "upgrading" a view into a matview if your queries against the plain view were already cheap enough for your needs.)

But like any cache, it needs to be maintained, and can become out-of-sync with its source.

Although materialized views are part of the SQL standard, not all SQL RDBMSes implement them. MySQL/MariaDB does not, for example. (Which is why you'll find that much of the software world just pretends matviews don't exist when designing their DB architectures. If it ever needs to run on MySQL, it can't use matviews.)

The naive approach that some other RDBMSes (e.g. Postgres) take to materialized views, is to only offer manual, full-pass recalculation of the cached result-set, via some explicit command (`REFRESH MATERIALIZED VIEW foo`). This works with "small data"; but at scale, this approach can be so time-consuming for large and complex backing queries, that by the time cache is rebuilt, it's already out-of-date again!

Because there are RDBMSes that either don't have matviews, or don't have scalable matviews, many application developers just avoid the RDBMS's built in matview abstraction, and build their own. Thus, another large swathe of the world's database architecture either will use cron-jobs to regular run+materialize a query, and then dump its results back into a table in the same DB; or it will define on-INSERT/UPDATE/DELETE triggers on "primary" tables, that transform and upsert data into "secondary" denormalized tables. These are both approaches to "simulating" matviews, portably, on an RDBMS substrate that isn't guaranteed to have them.

Other RDBMSes (e.g. Oracle, SQL Server, etc.) do have scalable materialized views, a.k.a. "incrementally materialized" views. These work less like a view with a cache, and more like a secondary table with write-triggers on primary tables to populate it — but all handled under-the-covers by the RDBMS itself. You just define the matview, and the RDBMS sees the data-dependencies and sets up the write-through data flow.

Incrementally-materialized views are great for what they're designed for (reporting, mostly); but they aren't intended to be the bedrock for an entire architecture. Building matviews on top of matviews on top of matviews gets expensive fast, because even fancy enterprise RDBMSes like Oracle don't realize, when populating table X, that writing to X will in turn write to matview Y, which will in turn "fan out" to matviews {A,B,C,D}, etc. These RDBMS's matviews were never intended to support complex "dataflow graphs" of updates like this, and so there's too much overhead (e.g. read-write contention on index locks) to actually make these setups practical. And it's very hard for these DBMSes to change this, as their matviews' caches are fundamentally reliant on database table storage engines, which just aren't the right ADT to hold data with this sort of lifecycle.

-----

Materialize is an "RDBMS" (though it's not, really) engineered from the ground up to make these sorts of dataflow graphs of matviews-on-matviews-on-matviews practical, by doing its caching completely differently.

Materialize looks like a SQL RDBMS from the outside, but Materialize is not a database — not really. (Materialize has no tables. You can't "put" data in it!) Instead, Materialize is a data streaming platform, that caches any intermediate materialized data it's forced to construct during the streaming process, so that other consumers can work off those same intermediate representations, without recomputing the data.

If you've ever worked with Akka's Streams, or Elixir's Flows, or for that matter with Apache Beam (nee Google Dataflow), Materalize is that same kind of pipeline. But where all the plumbing work of creating intermediate representations — normally a procedural map/reduce/partition kind of thing — is done by defining SQL matviews; and where the final output isn't a fixed output of the pipeline, but rather comes from running an arbitrary SQL query against any arbitrary matview defined in the system.

dragonwriter · on Dec 2, 2020

> Most RDBMSes (e.g. Postgres) only offer manual (`REFRESH MATERIALIZED VIEW foo`) full-pass recalculation of the cached result-set for matviews.

"Most" here seems very much wrong, at least of major products: Oracle has an option for on-commit (rather than manual) and incremental/incremental-if-possible (FAST/FORCED) refresh, so it is limited to neither only-manual nor only-full-pass recalculation. SQL Server indexed views (their matview solution) are automatically incrementally updated as base tables change, they don't even have an option for manual full-pass recalculation, AFAICT. DB2 materialized query tables (their matview solution) have an option for immediate (on-commit) refresh (not sure if the algo here is always full-pass, but its at a minimum not always manual.) Firebird and MySQL/MariaDB don't have any support for materialized views at all (though of course you can manually simulate them with additional tables updated by triggers.) Postgres seems to be the only major RDBMS with both material view support and the limitation of only on-demand full-pass recalculation of matviews (for that matter, except maybe DB2 having the full-pass limitation, it seems to be the only one with either the only-manual or only-full-pass limitation.)

jacques_chester · on Dec 2, 2020

I think that it's true that many databases offer incremental updates and it's incorrect to say that manual refreshes were the state of the art.

The important point is that Materialize can do it for almost any query, very efficiently, compared to existing options. That opens a lot of possibilities.

dragonwriter · on Dec 2, 2020

> The important point is that Materialize can do it for almost any query, very efficiently, compared to existing options. That opens a lot of possibilities.

Yes, this does seem like a very big deal.

derefr · on Dec 2, 2020

You're right; I updated my comment.

dragonwriter · on Dec 2, 2020

That was a fantastic and illuminating update, thank you.

quodlibetor · on Dec 2, 2020

This is a fantastic comment!

One small thing: we do now have tables[1]! At the moment they are ephemeral and only support inserts -- no update/delete. We will remove both of those limitations over time, though!

[1]: https://materialize.com/docs/sql/create-table/

jacques_chester · on Dec 2, 2020

This is an outstanding explanation. Much better than mine.

findjashua · on Dec 2, 2020

updated results of a query - eg if you do some aggregation or filtering on a table, or join two tables, or anything of the sort - materialized view will give you the updated results of the query in a separate table

jacques_chester · on Dec 2, 2020

Suppose you have normalized your data schema, up to at least 3NF, perhaps even further up to 4NF, 5NF or (as Codd intended) BCNF.

Great! You are now largely liberated from introducing many kinds of anomaly at insertion time. And you'll often only need to write once for each datum (modulo implementation details like write amplification), because a normalised schema has "a place for everything and everything in its place".

Now comes time to query the data. You write some joins, and all is well. But a few things start to happen. One is that writing joins over and over becomes laborious. What you'd really like is some denormalised intermediary views, which transform the fully-normalised base schema into something that's more convenient to query. You can also use this to create an isolation layer between the base schema and any consumers, which will make future schema changes easier and possibly improve security.

The logical endpoint of doing so is the Data Warehouse (particularly in the Kimball/star schema/dimensional modelling style). You project your normalised data, which you have high confidence in, into a completely different shape that is optimised for fast summarisation and exploration. You use this as a read-only database, because it massively duplicates a lot of information that could otherwise have been derived via query (for example, instead of a single "date" field, you have fields for day of week, day of month, day of year, week of year, whether it's a holiday ... I've built tables which include columns like "days until major conference X" and "days since last quarterly release").

Now we reach the first problem. It's too slow! Projecting that data from the normalised schema requires a lot of storage and compute. You realise after some scratching that your goal all along was to pay that cost upfront so that you can reap the benefits at query time. What you want is a view that has the physical characteristics of a table. Meaning you want to write out the results of the query, but still treat it like a view. You've "materialized" the view.

Now the second problem. Who, or what, does that projection? Right now that role is filled by ETL, "Extract, Transform and Load". Extract from the normalised system, transform it into the denormalised version, then load that into a data warehouse. Most places do this on a regular cadence, such as nightly, because it just takes buckets and buckets of work to regenerate the output every time.

Now enters Materialize, who have a secret weapon: timely dataflow. The basic outcome is that instead of re-running an entire view query to regenerate the materialized view, they can, from a given datum, determine exactly what will change in the materialized view and only update that. That makes such views potentially thousands of times cheaper. You could even run the normalised schema and the denormalised projections on the same physical set of data -- no need for the overhead and complexity of ETL, no need to run two database systems, no need to wait (without the added complexity of a full streaming platform).

gen220 · on Dec 2, 2020

That's a great description! Does materialize describe how they implement timely dataflow?

At my current company, we have built some systems like this. Where a downstream table is essentially a function of a dozen upstream tables.

Whenever one of the upstream tables changes, it's primary key is published to a queue, some worker translates this upstream primary key into a set of downstream primary keys, and publishes these downstream primary keys to a compacted queue.

The compacted queue is read by another worker, that "recomputes" each dirty key, one-at-a-time, which involves fetching the latest-and-greatest version of each upstream table.

This last worker is the bottleneck, but it's optimized by per-key caching, so we only fetch the latest-and-greatest version once per update. It can also be safely and arbitrarily parallelized, since the stream they read from is partitioned on key.

scott_s · on Dec 2, 2020

> Does materialize describe how they implement timely dataflow?

It's open source (https://github.com/TimelyDataflow/timely-dataflow), and also extensively written about both in academic research papers and documentation for the project itself. The GitHub repo has pointers to all of that. See also differential dataflow (https://github.com/timelydataflow/differential-dataflow).

albertwang · on Dec 2, 2020

Here's a 15-minute introduction to Timely Dataflow by Frank, our co-founder: https://www.youtube.com/watch?v=yOnPmVf4YWo

dataplayer · on Sept 10, 2020

I was a graduate student at CU Boulder right before Dr. Larry Baggett retired. Once, while studying for my analysis prelim I asked him a question off an old exam I was struggling with and he immediately whipped out a piece of paper and pen and starting writing down integrals without skipping a beat. What a truly amazing and brilliant person, thanks Dr. Baggett!

dataplayer · on July 31, 2013

I thought about doing this at one point but ran into many API fees to look up ISBN related details. Can you discuss what API you are using, fees you pay to the API owner, and how that is incorporated into your fee? Also, did you write your own ISBN phone scanner app? Or are you leveraging some other, possibly open source, application? Any details would be great to hear. Nice Job!

dataplayer · on Sept 3, 2012

Wouldn't it be awesome if this map turned out to be a good predictor of the states Obama wins in the next election? Reddit could be the new place where candidates come to do "Reddit Hall Meetings" as opposed to "Town Hall Meetings". Does this service already exist? Can I call dibs on this as my start up idea? If I can't and someone else builds it, I would love to help. I'm a noob programmer with mediocre skills across the stack looking to get my hands and brain dirty.