Thanks for clarifying my poorly worded description, that’s exactly what I meant. Like in the example given, the difference is 10-4=6, let’s call this the naive_greedy_miss_factor. Can we choose three other denominations so that NGMF is > 6?
I wouldn't trust a taxi driver's predictions about the future of economics and society, why would I trust some database developer's? Actually, I take that back. I might trust the taxi driver.
The point is that you don't have to "trust" me, you need to argue with me, we need to discuss about the future. This way, we can form ideas that we can use to understand if a given politician or the other will be right, when we will be called to vote. We can also form stronger ideas to try to influence other people that right now have a vague understanding of what AI is and what it could be. We will be the ones that will vote and choose our future.
Life is too short to have philosophical debates with every self promoting dev. I'd rather chat about C style but that would hurt your feelings. Man I miss the days of why the lucky stiff, he was actually cool.
Sorry boss, I'm just tired of the debate itself. It assumes a certain level of optimism, while I'm skeptical that meaningfully productive applications of LLMs etc. will be found once hype settles, let alone ones that will reshape society like agriculture or the steam engine did.
Whether it is a taxi driver or a developer, when someone starts from flawed premises, I can either engage and debate or tune out and politely humor them. When the flawed premises are deeply ingrained political beliefs it is often better to simply say, "Okay buddy. If you say so..."
We've been over the topic of AI employment doom several times on this site. At this point it isn't a debate. It is simply the restating of these first principles.
You shouldn't care about the "who" at all. You should see their arguments. If taxi driver doesn't know anything real, it should be plain obvious and you can state it easily with arguments rather than attacking the background of the person. Actually, your comment is one of the most common logical flaws (Ad Hominem), combining even multiple at the same time.
What you're really saying is that the database presented in OP is not useful because it only handles DQL.
1. SQL can be thought of as being composed of several smaller lanuages: DDL, DQL, DML, DCL.
2. columnq-cli is only a CLI to a query engine, not a database. As such, it only supports DQL by design.
3. I have the impression that outside of data engineering/DBA, people are rarely taught the distinction between OLTP and OLAP workloads [1]. The latter often utilizes immutable data structures (e.g. columnar storage with column compression), or provides limited DML support, see e.g. the limitations of the DELETE statement in ClickHouse [2], or the list of supported DML statements in Amazon Athena [3]. My point -- as much as this tool is useless for transactional workloads, it is perfectly capable of some analytical workloads.
I'm curious why do you think of DAX as a virtue. My poor SQL-shaped peg brain has never really fit the DAX hole of MS software.
Also, it always struck me as something too complex for the non-technical folks, and not expressive enough for tech-literate analysts/data engineers &c.
Just options. VBA is there as well. Excel's virtue is not specializing in any specific task, but being versatile enough to express a multitude of business solutions. 'Excel is my database' wasn't always a punchline.
That's more than an equally domain specific process like Qlik, and more than a specific vendor tool like tableau. And anyway if PowerBI didn't have a pain point it wouldn't be a MS product.
That's what I do at $dayjob whenever I have to do windowing &c. Figuring out this stuff in Pandas is a waste of time. Before I discovered DuckDB, I would re-learn the API every damn time.
I came up with a little utility function, which you can implement yourself :)
Years of unpicking others use of Rs sqldf (which by default used to copy the entire data frame to a SQLite db, run the query, the copy the result set back) when they complained their R code was to slow has taught me a visceral, negative to the name and pattern.
Glad to to see duckDB delivering, finally, on the promise of running SQL against in-memory dataframes
I like interface-only packages in the Julia ecosystem e.g. Tables.jl enables the development of several packages for querying tabular data that work across many concrete implementations; Plots.jl separates the high-level plotting interface from the plotting backend.
It's true. I've spent a small but nontrivial amount of time learning and using Polars, but it's just a nonstarter for most work projects. Not only does no one else know it exists, let alone how to use it, but it doesn't integrate with (to my knowledge) any ETL or ML Python library. You have to convert to pandas or NumPy, which is costly and to some extent defeats the purpose.
The to numpy conversion is free if you don't have missing data. Which is most of the cases if you send it over to a ML library.
If its not zero copy. It is still not a big deal. Pandas make a lot more copies internally. I truly wouldn't worry about that single copy if you have a order of magnitude speedup overall.
I stand corrected. The conversion felt relatively slow to me, but it was a large dataset and there were definitely missing values. Overall the benefits to speed and API cleanliness might be worth it, though it feels a bit gross to convert Spark to pandas to Polars to NumPy to DMatrix.
That said, it’s so much better than pandas for data manip that I’ll probably still try to use it.
Are you the author? If so, thanks for being so responsive on GitHub. You fixed basically every issue I had almost immediately back when I was learning Polars. It was awesome.
Yep, Thats me. Glad to help. :) There still room for parallelization when converting to a matrix. I will take a look. Haven't given that conversion any effort yet because that's often a one time conversion at the end of a pipeline.
I don't know DuckDB but polars could dethrone pandas. We're planning on using it to create our pipeline. Ibis-project is another solution if anyone wants to check it out.
I haven't touched pandas in months, but I also found quite tiring to deal with pandas.
Does your setup allow for an end-to-end solution? I mean, can I sink time into that setup and feel like I have everything I need to for regular data-wrangling?
I'm sure Pandas is amazing, but as a newbie I found myself doing many transformation logic with python data structures because it's just so much easier.
Maybe I'm dumb but going around the docs sometimes was like :/
Author of the post and siuba here. I'm pretty interested in exploring supporting polars as a backend, and if it works well supporting versions of the SQL backends that translate to SQL based on the polars method API :).
(I haven't really used it, but it looks promising)
Hey, I love siuba. Haven't had a chance to use it much but it scratches an itch for me. For years I've grumbled about how Python isn't flexible enough to accommodate tidyverse style libraries, as it lacks pipes and lazy evaluation (or macros), but siuba has managed to be very nice to use.