Hacker Newsnew | past | comments | ask | show | jobs | submit | more cmollis's commentslogin

yes.. that is correct (also worked there 2000-01). Bertelsmann offered 1B to settle the suit with the Labels (including the ones Bertelsmann owned, somewhat ironically) so the subscription service that we were building could go live, but this was rejected out of hand. After that, it was an existential moment for the entire Industry.. until Steve Jobs pitched what Apple had been working on (iTunes). My personal recollection was that probably at no other time would the Music Industry agree to the terms that Jobs wanted for the iTunes service, but there literally were no other viable options at the time so they agreed. you know the rest.


Yes. Duckdb works very well with parquet scans on s3 right now.


Does it work well with Hive tables storing parquet files on s3?


we've had tornadoes in NJ..


The East Coast gets microtornadoes that deshingle roofs or tip trees over.

The Midwest gets MS Paint eraser tool tornadoes.


The East Coast gets EF3 and EF4 tornadoes.


We get about one a decade in upstate NY.


sounds a bit like what Iceberg does with writes on parquet.


edit: author here

Yah that's a good point. We use a similar copy-on-write implementation as many other storage systems (such as Iceberg) that offer time-travel, branching, etc... where you have snapshots represented by pointers to immutable snapshot-layer or delta files (or however else you want to call them).

In our case, we wanted to provide customers with a very similar/seamless experience of using DuckDB on MotherDuck as they would with local DuckDB. To provide this parity we needed the ability to extend DuckDB's native storage to support these capabilities (time-travel, branching, etc...). This led us to implementing Differential Storage.


I've been testing duckdb's ability to scan multi-tb parquet datasets in S3. I have to say that i've been pretty impressed with it. I've done some pretty hairy SQL (window functions, multi-table joins, etc).. stuff that takes less time in Athena, but not by that much. Coupled with its ability to pull and join that data with information in RDB's like mysql make it a really compelling tool. Strangely, the least performant operations were the mysql look ups (had to set SET GLOBAL mysql_experimental_filter_pushdown=true;). Anyway.. definitely worth another look.. i'm using v 9.2


I always thought it was easier to read code than to write it.


Other way around in my estimation.


I've read every line of a 300,000 line web application in less than a year working at a company. I doubt I could write that many lines of code in that amount of time.


300,000 lines of code doesn't seem unattainable for a single developer in a year. If I'm in the zone, I can easily bang out 5-10k lines of code over a weekend if I know what I want to write.

But that's the thing: a lot of development involves not writing any code, as you re-think your abstractions, plan the architecture of the next bits you'll write, debug what you've written, etc.

So I don't think we're necessarily talking about speed when we say that it's harder to read code than to write it. I think reading -- and truly understanding -- code (especially when it's someone else's, or even yours, that you haven't seen in a long time) can require quite a bit more mental effort than writing code.


Surely you jest, nobody writes a full shopping platform by themselves in a year, yet I read it and deleted all the old code that we didn't need.


Interesting. I usually think about it at a different scale; its much easier to write something small and clever than it is to read and understand it later.


What language do you use? In UI you can reach that 300k loc quite quickly.


PHP, so some of it was JS and some HTML


I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software. I haven't done the same types of things in Polars yet (simple selects).


Hmm, how are you getting acceptable performance doing that? Is your data in a delta table or something?

I've actually personally found that DuckDB is tremendously slow against the cloud, though perhaps I'm going through the wrong API?

I'm using https://duckdb.org/docs/guides/import/s3_import.

My data is hive partitioned, when I monitor my network throughput, I only get a few MB/s with DuckDB but can achieve 1-2GB/s through polars.

Very possible it's a case of PEBKAC though.


Also, duckdb allows you convert scan results directly to both pandas and pl dataframes.. so you can mix and match based on need.


1. hacker news.. absolutely my favorite daily web experience. The nerdy eclecticism of the topics, coupled with the stripped-down experience are just unequaled on the Web, in my view. 2. writing software.. after almost 40 years (including college), I still love learning new languages, new techniques.. still love solving problems.. after 40 years, I've even gotten somewhat passable at it. 3. golf.. even though I'm not very good..


+1


this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..


No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning


ok.. thanks.. I'll try it out. I can think of few use-case that we have where this might be a good alternative to athena.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: