More

cmollis · on June 2, 2024

yes.. that is correct (also worked there 2000-01). Bertelsmann offered 1B to settle the suit with the Labels (including the ones Bertelsmann owned, somewhat ironically) so the subscription service that we were building could go live, but this was rejected out of hand. After that, it was an existential moment for the entire Industry.. until Steve Jobs pitched what Apple had been working on (iTunes). My personal recollection was that probably at no other time would the Music Industry agree to the terms that Jobs wanted for the iTunes service, but there literally were no other viable options at the time so they agreed. you know the rest.

cmollis · on April 19, 2024

Yes. Duckdb works very well with parquet scans on s3 right now.

iamcreasy · on April 21, 2024

Does it work well with Hive tables storing parquet files on s3?

cmollis · on April 5, 2024

we've had tornadoes in NJ..

saalweachter · on April 5, 2024

The East Coast gets microtornadoes that deshingle roofs or tip trees over.

The Midwest gets MS Paint eraser tool tornadoes.

jrockway · on April 5, 2024

The East Coast gets EF3 and EF4 tornadoes.

itishappy · on April 5, 2024

We get about one a decade in upstate NY.

cmollis · on March 12, 2024

sounds a bit like what Iceberg does with writes on parquet.

vegetable26 · on March 12, 2024

edit: author here

Yah that's a good point. We use a similar copy-on-write implementation as many other storage systems (such as Iceberg) that offer time-travel, branching, etc... where you have snapshots represented by pointers to immutable snapshot-layer or delta files (or however else you want to call them).

In our case, we wanted to provide customers with a very similar/seamless experience of using DuckDB on MotherDuck as they would with local DuckDB. To provide this parity we needed the ability to extend DuckDB's native storage to support these capabilities (time-travel, branching, etc...). This led us to implementing Differential Storage.

cmollis · on March 11, 2024

I've been testing duckdb's ability to scan multi-tb parquet datasets in S3. I have to say that i've been pretty impressed with it. I've done some pretty hairy SQL (window functions, multi-table joins, etc).. stuff that takes less time in Athena, but not by that much. Coupled with its ability to pull and join that data with information in RDB's like mysql make it a really compelling tool. Strangely, the least performant operations were the mysql look ups (had to set SET GLOBAL mysql_experimental_filter_pushdown=true;). Anyway.. definitely worth another look.. i'm using v 9.2

cmollis · on Feb 29, 2024

I always thought it was easier to read code than to write it.

burnished · on Feb 29, 2024

Other way around in my estimation.

iopq · on March 1, 2024

I've read every line of a 300,000 line web application in less than a year working at a company. I doubt I could write that many lines of code in that amount of time.

kelnos · on March 1, 2024

300,000 lines of code doesn't seem unattainable for a single developer in a year. If I'm in the zone, I can easily bang out 5-10k lines of code over a weekend if I know what I want to write.

But that's the thing: a lot of development involves not writing any code, as you re-think your abstractions, plan the architecture of the next bits you'll write, debug what you've written, etc.

So I don't think we're necessarily talking about speed when we say that it's harder to read code than to write it. I think reading -- and truly understanding -- code (especially when it's someone else's, or even yours, that you haven't seen in a long time) can require quite a bit more mental effort than writing code.

iopq · on March 1, 2024

Surely you jest, nobody writes a full shopping platform by themselves in a year, yet I read it and deleted all the old code that we didn't need.

burnished · on March 7, 2024

Interesting. I usually think about it at a different scale; its much easier to write something small and clever than it is to read and understand it later.

mewpmewp2 · on March 1, 2024

What language do you use? In UI you can reach that 300k loc quite quickly.

iopq · on March 1, 2024

PHP, so some of it was JS and some HTML

cmollis · on Feb 14, 2024

I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software. I haven't done the same types of things in Polars yet (simple selects).

theLiminator · on Feb 17, 2024

Hmm, how are you getting acceptable performance doing that? Is your data in a delta table or something?

I've actually personally found that DuckDB is tremendously slow against the cloud, though perhaps I'm going through the wrong API?

I'm using https://duckdb.org/docs/guides/import/s3_import.

My data is hive partitioned, when I monitor my network throughput, I only get a few MB/s with DuckDB but can achieve 1-2GB/s through polars.

Very possible it's a case of PEBKAC though.

cmollis · on Feb 14, 2024

Also, duckdb allows you convert scan results directly to both pandas and pl dataframes.. so you can mix and match based on need.

cmollis · on Jan 28, 2024

1. hacker news.. absolutely my favorite daily web experience. The nerdy eclecticism of the topics, coupled with the stripped-down experience are just unequaled on the Web, in my view. 2. writing software.. after almost 40 years (including college), I still love learning new languages, new techniques.. still love solving problems.. after 40 years, I've even gotten somewhat passable at it. 3. golf.. even though I'm not very good..

cmollis · on Nov 7, 2023

cmollis · on Nov 6, 2023

this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..

wenc · on Nov 6, 2023

No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning

cmollis · on Nov 6, 2023

ok.. thanks.. I'll try it out. I can think of few use-case that we have where this might be a good alternative to athena.