I've self-learned for a long time in the causal inference space and model evaluation is a concern for me. My biggest concern is falsification of hypotheses. In ML, you have a clear mechanism to check estimation/prediction through holdout approaches. In classical metrics, you have model metrics that can be used to define reasonable rejection regions for hypothesis tests. But causal inference doesn't seem to have this, outside traditional model fit metrics or ML holdout assessment? So the only way a model is deemed acceptable is by prior biases?
If my understanding is right, this means that each model has to be hand-crafted, adding significant technical debt to complex systems, and we can't get ahead of the assessment. And yet, it's probably the only way forward for viable AI governance.
> In ML, you have a clear mechanism to check estimation/prediction through holdout approaches.
To be clear, you can overfit while your validation loss does not decrease. If your train and test data are too similar then no holdout will help you measure generalization. You have to remember that datasets are proxies for the thing you're actually trying to model, they are not the thing you are modeling themselves. You can usually see this when testing on in class but out of train/test distribution data (e.g. data from someone else).
You have to be careful because there are a lot of small and non-obvious things that can fuck up statistics. There's a lot of aggregation "paradoxes" (Simpsons, Berkson's), and all kinds of things that can creep in. This is more perilous the bigger your model too. The story of the Monte Hall problem is a great example of how easy it is to get the wrong answer while it seems like you're doing all the right steps.
For the article, the author is far too handwavy with causal inference. The reason we tend not to do it is because it is fucking hard and it scales poorly. Models like Autoregressive (careful here) and Normalizing Flows can do causal inference (and causal discovery) fwiw (essentially you need explicit density models with tractable densities: referring to Goodfellow's taxonomy). But things get funky as you get a lot of variables because there are indistinguishable causal graphs (see Hyvarien and Pajunen). Then there's also the issues with the types of causalities (see Judea's Ladder) and counterfactual inference is FUCKING HARD but the author just acts like it's no big deal. Then he starts conflating it with weaker forms of causal inference. Correlation is the weakest form of causation, despite our often chanted saying of "correlation does not equate to causation" (which is still true, it's just in the class and the saying is more getting at confounding variable). This very much does not scale. Similarly discovery won't scale as you have to permute so many variables in the graph. The curse of dimensionality hits causal analysis HARD.
To be clear, the mechanism for checking ML doesn't really check ML. There's really little value in a confidence interval conditional on the same experimental conditions that produced the dataset on which the model is trained. I'd often say it's actively harmful, since it's mostly misleading.
Insofar as causal inference has no such 'check', its because there never was any. Casual inference is about dispelling that illusion.
> Insofar as causal inference has no such 'check', its because there never was any. Casual inference is about dispelling that illusion.
Aye, and that's the issue I'm trying to understand. How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?
We can focus on a particular philosophical point, like parsimony / Occam's razor, but as far as I can tell that isn't always sufficient.
There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!
> How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?
I just want to second MJ's points here. You have to remember that 1) all models are wrong and 2) it's models all the way down. Your data is a model: it models the real world distribution, what we might call the target distribution, which is likely intractable and often very different from your data in various conditions. Your metrics are models: obviously given the previous point, but not as obvious from the point that even with perfect data these are still models. Your metrics all have limitations and you must be careful to clearly understand what they are measuring, rather than what you think they are. This is an issue of alignment and the vast majority of people do not consider precisely what their metrics mean and instead rely on the general consensus (great ML example: FID does not measure fidelity, it is distance measurement of distributions. But you shouldn't stop there, that's the start). These get especially fuzzy in higher dimensions where geometries are highly non-intuitive. It is best to remember that metrics are guides and not targets (Goodhart).
> There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!
I mean we can use likelihood ;) if we model density of course. But that's not the likelihood that your model is the correct model, it is the likelihood that given the data that you have that your model's parameterization can reasonably model the sampling distribution of data. These are subtly different, the difference is from above. And then we gotta know if you're actually operating on the right number of dimensions. Are you approximating PCA like a typical VAE? Is the bottleneck enough for proper parameterization? Is your data in sufficient dimensionality? Does the fucking manifold hypothesis even hold for your data? What about the distribution assumption? IID? And don't get me started on indistinguishablity in large causal graphs (references in another comment).
So rather in practice it is just best to try to make a model that is robust to your data but always maintain suspicion of it. After all, all models are wrong and you're trying to model data, not have a model of data.
Evaluation is fucking hard (it is far too easy to make mistakes)
I always love to find/know there are others in the tech world that care about the nuance around evaluation math and not just benchmarks. Often it feels like I'm alone. So thank you!
In general, you can't, and most of reality isnt knowable. That's a problem with reality, and us.
I'd take a bayesian approach across an ensemble of models based on the risk of each being right/wrong.
Consider whether Drug A causes or cures cancer. If there's some circumstantial evidence of it causing cancer at rate X in population Y with risk factors Z -- and otherwise broad circumstial evidence of it curing at rate A in pop B with features C...
then what? Then create various scenarios under these (likely contradictory) assumptions. Formulate an appropriate risk. Derive some implied policies.
This is the reality of how almost all actual decisions are made in life, and necessarily so.
The real danger is when ML is used to replace that, and you end up with extremely fragile systems that automate actions of unknown risk -- on the basis they were "99.99%", "accurate", ie., considered uncontrolled experimental condition E1 and not E2...10_0000 which actually occur
> How to know if model 1 or model 2 is more "real" or, for lack of a better term, more useful and reflective of reality?
You don't. Given observational data alone, it's typically only possible to determine which d-separation equivalence class you're in. Identifying the exact causal structure requires intervening experimentally.
> There should be some way to determine a model's likelihood of structure
Why? If the information isn't there, it isn't there. No technique can change that.
Acyclic structure on variables is a very strong pre-supposition that, honestly, is not how many systems in engineering are well-described by, so I don't like this idea of boiling causality solely down to DAG-dependent phrases like "d-separation" or "exact causal structure". Exact causal structure a.k.a. actual causality is particular to one experimental run on one intervention.
D-separation still works for cyclic graphs, it just can't rule out causal relationships between variables that lie on the same cycle. And neither can any other functional-form-agnostic method, because in general feedback loops really do couple everything to everything else.
More rigorously: given a graph G for a structural equation model S, construct a DAG G' as follows
- Find a minimal subgraph C_i transitively closed under cycle membership (so a cycle, all the cycles it intersects, all the cycles they intersect, and so on)
- Replace each C_i with a complete graph C'_i on the same number of vertices, preserving outgoing edges.
- Add edges from the parents of any vertices in C_i (if not in C_i themselves) to all vertices in C'_i
- Repeat until acyclic
d-separation in G' then entails independence in S given reasonable smoothness assumptions I don't remember the details of off the top of my head.
This isn't a quality of fit issue (and even if it were, linear models are not always sufficient). The problem is that different causal structures can entail the same set of correlations, which makes them impossible to distinguish through observation alone.
Grandparent commenter here -- I'm glad I sufficiently communicated my concern, I feel like you and mjburgess have nailed it. Fit metrics alone aren't sufficient to determine an appropriate model use (even ignoring the issues of p-hacking an other ills).
If my understanding is right, this means that each model has to be hand-crafted, adding significant technical debt to complex systems, and we can't get ahead of the assessment. And yet, it's probably the only way forward for viable AI governance.