Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The main reason for not using causal inference is not because data scientists don’t know about the different approaches or can’t imagine something equivalent (a lot of reinvention); forecasting is one of the most common tasks, after all.

The main reason is that they generally work for software companies where it’s easier and less susceptible to analyst influence to implement the suggested change and test it with a Random Control Trial. I remember running an analysis that found that gender was a significant explaining factor for behavior on our site; my boss asked (dismissively): What can we do with that information? If there is an assumption of how things work that doesn’t translate to a product change, that insight isn’t useful; if there is a product intuition, testing the product change itself is key, and there’s no reason to delay that.

There are cases where RCTs are hard to organize (for example, multi-sided platform businesses) of changes that can’t be tested in isolation (major brand changes). Those tend to benefit from the techniques described there——and they have dedicated teams. But this is a classic case of a complicated tool that doesn’t fit most use cases.



Actually causal inference is also really hard to benchmark. My colleague started an effort to be actually able to reproduce and compare results. Also the algorithms often do not scale too well.

Everytime we wanted to use this for real data it is just a little bit too much effort and the results are not conclusive because it is hard to verify huge graphs. My colleague e.g. wanted to apply it explain risk confounders in investment funds.

I personally also do not like the definition of causality they base it on.


One way to test this is through a placebo test, where you shift the treatment, such as moving it to an earlier date, which I have seen used successfully in practice. Another approach is to test the sensitivity of each feature, which is often considered more of an art than a science. In practice, I haven't observed much success with this method.


You don’t need to look at a graph at all though, right? There are plenty of tests that can help you identify factors that could be significantly affecting your distribution


If you want to make causal inferences you really do have to look at a graph that includes both observed and probable unobserved causes to get any real sense of what’s going on. Automated methods absent real thinking about the data generating process are junk.


“Graph” here means the directed acyclic graph encoding the causal relationships, not a chart of a distribution.


You can only select among features that you have measured.


Go on, please. What definition, and algorithms with scaling problems?


A/b experiments are definitely a gold standard as they provide true causality measurement (if implemented correctly). However, they are often expensive to run: need to implement the feature in question (which is less than 50% going to work) and then collect data for 1-4 weeks before being able to make the decision. As a result only a small number of business decisions today rely on a/b tests. Observational causal inference can help bring causality into many of the remaining decisions, which need to be made quicker or cheaper.


The “gold standard” has failure modes that seem to be ignored.

E.g.: making UI elements jump around unpredictably after a page load may increase the number of ad clicks simply because users can’t reliably click on what they actually wanted.

I see A/B testing turning into a religion where it can’t be argued with. “The number went up! It must be good!”


That’s generally because the metrics you are looking at do not represent what users care about. It’s different than the testing methodology, often overlooked, and a lot more important.

I’ve argued that A/B testing training should focus on that skill a lot more than Welch’s theory, but I had to record my own classes for that to happen.


But those metrics are hard to move, so you target secondary metrics.

The problem with that strategy becomes obvious when you spell out the consequences: measurably improving the product is hard, so you measure something else and hope you get product improvements.


There can be a real ethical dilemma when applying A/B testing in medical setting. Placing someone with an incurable disease in a control group is condemning them to death while in treatment group they might have a chance. On the other hand, without a proper A/B testing methodology the drug efficacy cannot be established. So far no perfect solution to the dilemma has been found.


> in a control group

The control group gets the current standard treatment, not nothing (in case that was a source of confusion). Plus they typically don't have to pay for it which is a benefit for them.

Large trials today will typically conduct interim analyses and will have pre-defined guidelines for when to stop the trial because the new treatment is either clearly providing a benefit or is clearly futile.

Here is an example of such a study: https://www.ahajournals.org/doi/10.1161/CIRCHEARTFAILURE.111...


Most therapeutic trials are nowadays "Intent to treat". So subject would receive either standardized tx or experimental tx in th e randomization. Many of them also have crossovers such that when measurable (as defined by the protocol) benefit is seen, standard tx based subjects can be moved over to the experimental arm


It's not really an ethical dilemma until you know it works, and then usually if the evidence is strong enough they'll cut the trial early.


All the alternative methods require the same sacrifice. More importantly, most suggested treatments fail to cure deadly conditions or have major side effects or risks that are just as unethical to thrust upon people untested.

If you look at it properly, i.e. evaluate what should be your actions before the test (Do nothing, Impose untested treatment, Test with proper control to learn what to do with the majority of the population), the answer is rarely ambiguous.

There is a debate to be had on how much pre-clinical work to be done before clinical testing, but those are increasingly automated, cheap, and fast, so we often reach the point where a double-blind test is the next logical step.

The argument you present is based on either an unwarranted confidence in treatments, or information that wasn’t available when the decision had to be made.


You can end the trial early when it’s clear the treatment is working. This just happened last week with Ozempic for diabetes caused kidney disease. https://www.wxyz.com/news/health/ask-dr-nandi/novo-nordisk-e...


Causal inference is useful, but it's neither quicker nor cheaper.


Agree that it is hard today. A person you might know is trying to prove that is doesn’t have to be: https://www.motifanalytics.com/blog/bringing-more-causality-... .

We’d love to chat more with you on the topic - feel free to hit Sean or me on LinkedIn.


I am a big fan of what Sean and you are trying to do–I wrote up a chapter about it this weekend, actually. I’m worried that you both have worked for companies where a lot of work has been done to identify relevant dimensions (metrics and categories) and automate causality (or rather: estimating factors on a pre-existing causal graph because that’s the slight of hands the word “causality” does) made sense once you’ve reached that level of maturity.

But to reach that point, before having relevant dimensions, there has to be a lot of work, generally motivated by disappointing experiments. “Why didn’t that work?” is often answered by “Because our goal is too remote from our actions—here’s a better proxy” or “Because this change only makes sense to 8% of our users, here’s how we can split them.”

I’m worried that too many people will think the tool itself is enough and not a complement to the maturity in understanding a company’s user. This ‘solutionism’ is widespread among Data tools: https://www.linkedin.com/posts/bertilhatt_the-potential-gap-...


Thank you for clarifying.

Reading some of your posts I think we agree more than disagree. A big difference from most new analytics tools you see today is that we don't want to provide a magic "solution" (which is bound to over-promise and under-deliver) but rather a generic tool to quickly define and try out different business categories on the data.

Followed you on LinkedIn for more in-depth takes.


It is likely to be cheaper and quicker to run a counterfactual test in the computer than in real life.

The question is how reliable it is.


> As a result only a small number of business decisions today rely on a/b tests.

The default for all code changes at Netflix is they’re A/B tested.


an expensive test is better than an expensive mistake :) within the scale of hundreds of decisions made with inherent bias of the product/biz/ops teams that direction misalignment can be catastrophic


You can apply it to estimate the impact of any business decision if you have data, so not only IT companies can benefit from it. However, the problem arises when the results don't align with the business's expectations. I have firsthand experience with projects being abandoned simply because the results didn't meet expectations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: