Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building the world’s fastest website analytics (2021) (usefathom.com)
52 points by viraptor on Oct 15, 2022 | hide | past | favorite | 38 comments


They would have saved a whole lot of time just using Clickhouse which is a US company and has completely severed ties with Russia [1]. Altinity has offered managed clickhouse since 2020 [2], and now Clickhouse Cloud does too [3]

[1] https://clickhouse.com/blog/we-stand-with-ukraine

[2] https://altinity.com/cloud-database/

[3] https://clickhouse.com/cloud


If you're really confident that your SaaS application will only have that single workload (single table group-by) into perpetuity, it's a workable choice. But, that's a big bet against the reality of the diverse workloads required today in scale-up SaaS applications. Sure, you can stitch together specialized stores to make your own accidental distributed database for your app, but the simpler, more straight-forward and lower-latency alternative available. Yes, I'm a SingleStore employee, but it is worth considering the argument on its merit. It's also worth trying for free in production. The free tier for production is not time-bound, just capacity-bound.


> Sure, you can stitch together specialized stores to make your own accidental distributed database for your app, but the simpler, more straight-forward and lower-latency alternative available.

Organizations like Facebook, Amazon, Netflix, Cloudflare, and many, many others have been combining specialized data stores for many years. Stonebraker, Madden, and colleagues summarized the argument for this approach in 2007. [0] At this point I can't think of any major SaaS application I've seen recently that did not have multiple data store types. That includes event streams, which are data stores as well.

[0] http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf

Disclaimer: I work for Altinity.


So, you think that the 99.99% of businesses around the world who are not in the cloud hyperscaler business should all build their systems as if they had the engineering resources of cloud hyperscalers? I've seen first-hand how enterprises with Web 2.0 envy made costly, multi-year mistakes in technology choices and architecture with that approach. Be inspired by the engineering prowess of FB, AWS, Netflix, etc., for sure, but don't make the mistake of taking the approach of a $470B (AWS) or $86B (FB) company directly when your company's situation is nothing like that. Stated another way, you can't coach exactly like Pep Guardiola if you don't have his resources and his players. Existing enterprises have a totally different starting point. Enterprises simply can't follow the hyperscaler hegemony for their digital transformation strategies and expect to win. Also, think about who is served by building inefficient, accidental distributed databases underneath your SaaS applications when that can be avoided by using a proven technology alternative already exists. It's the CSPs you're paying in the monthly cloud bill for every byte stored, every byte computed and every byte moved.


MySQL, ClickHouse, and Kafka are all proven technologies. You can run them yourself or choose any of a number of excellent cloud services to run them for you. It's not that hard to combine them and let each part serve the purpose for which it is designed. I've seen companies from small startups to organizations the size of Facebook take this approach. It's a reasonable architecture with many thousands of successful examples.

My own company is not especially large, but we run MySQL, ClickHouse, and Prometheus to manage our own SaaS platform for ClickHouse. We are happy with the result. (Though it would be nice to use ClickHouse as a Prometheus backend.)


> it would be nice to use ClickHouse as a Prometheus backend

Well... that's already possible and it works great! As you might know https://qryn.dev turns ClickHouse into a powerful Prometheus *remote_write* backend and the GO/cloud version supports full PromQL queries off ClickHouse transparently (the JS/Node version transpiles to LogQL instead) and from a performance point of view its well on par with Prometheus, Mimir and Victoriametrics in our internal benchmarks (including Clickhouse as part of the resource set) with millions of inserts/s and broad client compatibility. Same for Logs (LogQL) and Traces (Tempo)

Disclaimer: I work on qryn


You should probably focus on evangelizing the features of your own database instead of broadcasting your ignorance on those of a competitor.


Yandex (Google of Russia) is a large share holder of Clickhouse Inc[1]. Yes, this maybe a European Yandex subsidiary.. I really don't care to dig, but I think its dishonest to say the company doesn't have Russian backers. Has Clickhouse Inc forced Yandex to divest their large ownership stake?

[1] https://www.crunchbase.com/organization/clickhouse/company_f...


> I really don't care to dig

Then why are you opining on the subject?


Because half truths like your statements bother me? Yandex is a large investor in clickhouse Inc. (at last as far as public record shows). Have they forced Yandex to divest their holdings?


It's literally in the article I linked (https://clickhouse.com/blog/we-stand-with-ukraine). As you're incapable of reading let me highlight it for you:

> We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors.

It's disappointing after 11+ yrs of failing to find traction for MemSQL, this is the hill you die on? You're running your mouth on some random comments thread about a shitty 2-person analytics platform where the author clearly doesn't know the first thing about databases and just went with the first Twitter ad he came across.

Under Raj's "leadership" you know damn well MemSQL/SingleStore will be dead within two years. The hyper-aggressive sales pressure had some success during the lofty bull market due to clueless IT managers with massive budgets, but that time has come to an end. You guys missed the window to IPO to get your own exit and your runway is quickly diminishing.


I read the article. And yes, this was probably a mistake to engage with you on it.

Yandex N.V. (A dutch holding company of Yandex Russia) is listed as an investor of Clickhouse Inc. This is all I'm stating.


This article makes a lot more sense when you know that Singlestore/MemSQL gifted Jack Ellis shares.


Lmao. Hey, if I wasn't running Fathom, Dev Rel @ SingleStore is the only role I'd consider in tech right now. And I would definitely try to get some of those sweet shares as part of compensation. Alas, I don't own any SingleStore shares.


Hah. What a joke. So it's all astroturfing and fake evangelism.


By migrating to SingleStore, we didn’t have to change any persistence logic (we were already using SQL and Laravel Eloquent). Having explored doing that for Elasticsearch, it wasn’t something we wanted to do.

Looking back, I’m glad we went in this direction. Fathom has grown beyond what we could’ve imagined. Let’s see what happens over the next five years.


Clickhouse is a great technology for a niche scenario. A lack of a good QO is one big thing missing but not the only one unfortunately.

I know a company that forked Clickhouse but had to rebuild virtually everything to be a proper database.

There are already plenty of technologies, more mature than Clickhouse for addressing such scenarios.


ClickHouse is almost ubiquitous for web analytics services.

For example, here is news from today about Splitbee: https://news.ycombinator.com/item?id=33334104 No surprise it is using ClickHouse: https://splitbee.io/blog/new-pricing


While reading the article, I found that they are rediscovering the tricks, already well-known for ClickHouse:

> The biggest mistake I made was that I kept the UPDATEs in our code (we update the previous pageview with the duration, remove the bounce, etc.). In a few weeks, I'll be moving to 100% append-only by utilizing negative numbers.

> For example, if you want to set bounce_rate to 0%, you would write a 1 for bounce_rate on the first pageview and then insert a duplicate with -1 for bounce_rate. And for the duplicate row, you'd have nothing set for pageviews, visits and uniques, so it would all group nicely.

This is nearly a copy-paste from ClickHouse documentation.

> We shard on UUID, and then we set SiteId as the sort key. We do this because we want to utilize something called "local joins" in SingleStore. Long story short, events can be joined with event_properties (allowing you to have thousands of dynamic properties per event you track), and it's fast.

This statement is slightly disappointing as well. No magic, no advantage to ClickHouse.

The outcome can be read as follows: "use SingleStore exactly as you'd use ClickHouse and maybe it will work alright".

SingleStore advantages:

- good compatibility with MySQL dialect;

- offers good UPDATE/DELETE for fresh data (invalidated by this article, as they dropped the usage of UPDATE for better performance);

- good support for JOINs (invalidated by this article, as they ended up using local JOINs);

SingleStore disadvantages:

- not open-source, worse brand recognition, the old brand (MemSQL) did not work well too;

- performance claims don't validate;

- older company and many initial developers no longer work there;

Fathom sounds like a toy use case as the data amount is too low. ClickHouse is almost universally selected for similar and larger use cases, with hundreds of billions of events each day.

PS. Nevertheless, the post is hilarious:

> I was suffering from low energy in the two weeks leading up to this migration, and I was feeling awful throughout migration week, especially on migration day. On Sunday 14th March 2021, two days after we had finished the migration, my wife showed me an already-half-used bag of coffee beans I had been drinking through most of this migration, and the bag said "DECAF." Divorce proceedings are underway.


Clickhouse is great a single table analytical queries (Group-by + aggregates + filters over a single table). It can match any top tier columnstore database at this, maybe even better at smaller scales, and if that is all you need it will do the job well

Beyond this, it has a naive query optimizer and limited ability to run distributed joins. This is why you won't find well known analytical benchmark results like TPC-H and TPC-DS for clickhouse vs other SQL data warehouses.

SinglestoreDB has been doing real time analytics much longer then clickhouse and has a much more mature feature set at this point[1][2]. Our go to market is more big enterprise driven so we are definitely much less well known among developers. Revenue share-wise I suspect we are the leader in real time analytics (see disclaimer below - I'm biased but our revenue is reaching thresholds for IPO readiness).

Also, if you follow Jacks later posts you'll see he replaces DynamoDB and Redis with SinglestoreDB as well. Now were getting into places Clickhouse doesn't tread at all...

Disclaimer: I'm one of the cofounders of MemSQL/SingleStoreDB (still working away on making it better all these years later...).

[1] https://www.singlestore.com/blog/the-technical-capabilities-...

[2] https://dl.acm.org/doi/abs/10.1145/3514221.3526055


Can you provide SingleStore result for TPC-H and TPC-DS? I can't find it. Why? [1] https://www.tpc.org/tpcds/results/tpcds_results5.asp?orderby... [2] https://www.tpc.org/tpch/results/tpch_results5.asp?orderby=d... Disclaimer: I work for ClickHouse


Almost no one does "official" TPC results these days (maybe other then Oracle and some of the Chinese vendors).

Most other cloud DWs have public TPC-H or TPC-DS results that are easily googlable. Clickhouse is missing for a reason...

  - https://research.gigaom.com/report/data-warehouse-cloud-benchmark/
  - https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html
  - https://celerdata.com/blog/starrocks-queries-outperform-clickhouse-apache-druid-and-trino
  - https://aws.amazon.com/blogs/big-data/amazon-redshift-continues-its-price-performance-leadership/
Our results are here: https://www.singlestore.com/blog/tpc-benchmarking-results/


It's worth noting that ClickHouse and MySQL work quite well together. Clickhouse has a similar SQL dialect and can read data directly from MySQL. (Awesome feature BTW; in some cases queries even run faster that way.)

It's straightforward to build applications that do transaction processing in MySQL and analytics in ClickHouse. You can migrate data out of MySQL that does not belong there. You can also use CDC to mirror transaction data into ClickHouse for analytic processing. [0] The resulting applications are quite robust and have the advantage that different parts can scale independently.

Building applications that use the strengths of each of the component databases is a natural way to scale systems.

Disclaimer: I'm the author of the cited article and I work for Altinity.

[0] https://altinity.com/blog/using-clickhouse-as-an-analytic-ex...


Sounds pretty cool. It's interesting to see how the landscape for this has changed over the last 15 years, since I worked at Splunk.

I just started working at FeatureBase a few weeks ago, here in Austin. It's Open Source and we have a 5 minute guide to get it running here: https://docs.featurebase.com/. We also have a cloud/serverless option.

FeatureBase is a B-tree database which uses Roaring Bitmaps. This makes it suitable for doing analytical queries on massive data sets immediately after ingestion. You do have to model the data properly, however.


I have mixed feelings reading this article. There was no requirement around code refactoring, however multiple solutions were axed because of “how much of our application I was going to need to refactor”. Maybe have that as a hard requirement then and don’t invest time learning tech that doesn’t meet this requirement? Also, was that a joke about the divorce? Either way bad taste and red flag on the personality side for me.


My wife thought it was funny ;)


SingleStore is pretty amazing since it supports UPDATE DELETE and joins with other tables.

Clickhouse updates are painful. I looked on postgres with citus extension and it only supports appends.

Their paper on how they organize the columnstore and row store buffer is pretty neat. https://images.go.singlestore.com/Web/MemSQL/%7Bd1e77ba1-0e3...

It’s really cool to see usefathom take off doing millions in ARR with two engineers. SingleStore has to been a high leverage choice so they can focus on product. Power to them.

Meanwhile we’ve been grappling with a mixture of Mongo, Snowflake and Postgres for more than an year with a decent sized team. The curse of VC funded startups is that with more money and more engineers, a likely outcome is more complexity. The infra becomes a complex beast that takes many quarters/years to build because there isn’t that market pressure to be time and money efficient.


At some point, isn't it better to build your own database? And choose the trade offs that fit your project. Or pay/hire someone to do it.


I honestly can't think of any time where "build your own database" is a good solution, unless you're a database company or maybe Google.


Definitely not. No regrets after making the move. March next year will be the two year mark.


I'm curious how you (and your team I presume) came to decide you need to build your own database?


We didn’t decide that, we used SingleStore


That sounds like a really bad solution, actually.

Building a reliable DBMS with any kind of decent performance can easily take years. And by then, it's still not sure that you're doing any better than any of the existing solutions.


And that point is when you can’t find someone to host it for you.

Though, I’d be sorely tempted to spin up a bfc of clickhouse on bare metal.


That one was tried, among many others covered in the article.


Clickhouse isn’t really relevant for a lot of us. SingleStore outperforms clickhouse and the latter isn’t MySQL wire compatible. Far more features come with SingleStore too. And if you speak to people running Clickhouse, they’re also maintaining a Postgres set-up. With SingleStore, you get your OLTP and OLAP in one database. So our users & sites table sit in memory (backed up by disk), meaning ultra fast read/write speeds (comparable to clustered/high availability Redis). And then we put our pageviews/events in columnstore (disk) which offers rapid performance for analytical queries.


These people don't have the inhouse skills to maintain a simple SQL server, let alone optimize their schema (MySQL being too slow to sum/group 11m records in a timely manner? Please.)

Attempting to make your own db while being ignorant of competing products would be an absolute disaster


I completely agree! For the people following back home, who don’t want to move away from MySQL, can you let us know how to achieve a <10ms GROUP BY aggregation with a high cardinality column (11 million distinct values for “pathname”).

The query was:

SELECT SUM(pageviews) as total, pathname FROM pageviews GROUP BY pathname ORDER BY total DESC LIMIT 10.

If we don’t hear an answer from you, I’ll be really upset. Otherwise, we may have to add your ideas to the article!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: