Causal Inference and Missing Data Group at Inria
22.04.2024
Sven Klaassen
University of Hamburg
Economic AI
Jan Teichert-Kluge
University of Hamburg
Philipp Bach
University of Hamburg
Economic AI
Victor Chernozhukov
Massachusetts Institute
of Technology
Martin Spindler
University of Hamburg
Economic AI
\[ \begin{align} Y &= \theta_0 D + g_0(X) + \varepsilon, & \mathbb{E}[\varepsilon | X, D] = 0 \label{eq:plr1} \\ D &= m_0(X) + \vartheta, & \mathbb{E}[\vartheta | X] = 0 \label{eq:plr2} \end{align} \]
with
Frisch-Waugh-Lovell style approach: \(\theta_0\) can be consistently estimated by partialling out \(X\), i.e,
Neyman Orthogonality \[ \left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta)] \right|_{\eta=\eta_0} = 0 \] ensures that the moment condition identifying \(\theta_0\) is insensitive to small pertubations of the nuisance function \(\eta\) around \(\eta_0\)
Assumptions for the nuisance elements (see Chernozhukov et al. (2018)) \(\lVert \hat{m} - m_0 \rVert_{P,2} \times \big( \lVert \hat{m} - m_0 \rVert_{P,2} + \lVert \hat{\ell} - \ell_0\rVert _{P,2}\big) \le \delta_N N^{-1/2}\)
Under some regularity conditions the estimator \(\hat{\theta}\) concentrates in \(1/\sqrt{n}\)-neighborhood of \(\theta_0\) and \[ \sqrt{n}(\hat{\theta} - \theta_0) \to \mathcal{N}(0,\sigma^2) \]
\(\Rightarrow\) simple approach to control for confounders, if they can be used for ML models!
Use pretrained models for unstructured data e.g. Bert and Beit
Basic models for tabular data seem sufficient
Monitor the nuisance losses
We generate a semi-synthetic dataset with a known treatment effect parameter
To generate credible confounding wil be based on the labels or outcomes of the corresponding supervised learning task
We use the following datasets:
Modality | Dataset | Target \(\tilde{X}\) | Control \(X\) |
---|---|---|---|
Tabular | DIAMONDS Wickham (2016) |
\(\log(\text{Price})\) | Carat, Cut, Color, Clarity, … |
Text | IMDB Maas et al. (2011) |
Sentiment | Review Text |
Image | CIFAR-10 Krizhevsky (2009) |
Label | Image |
We generate a semi-synthetic dataset according to the underlying PLR model \[ \begin{align} Y &= \theta_0 D + \tilde{g}_0(\tilde{X}) + \varepsilon, \\ D &= \tilde{m}_0(\tilde{X}) + \vartheta, \end{align} \] where \(\tilde{X}= (\tilde{X}_{\text{tab}}, \tilde{X}_{\text{txt}}, \tilde{X}_{\text{img}})\) with the following additive structure \[ \begin{align} \tilde{g}_0(\tilde{X}) &= \sum_{\text{mod}\in\{\text{tab},\text{txt},\text{img}\}} \tilde{g}_{\text{mod}}(\tilde{X}_{\text{mod}}) \\ \tilde{m}_0(\tilde{X}) &= \sum_{\text{mod}\in\{\text{tab},\text{txt},\text{img}\}} \tilde{m}_{\text{mod}}(\tilde{X}_{\text{mod}}) \end{align} \] and \(\varepsilon, \vartheta \sim \mathcal{N}(0, 1)\).
The effect on the outcome \(Y\) is generated via a standardized version of target variable to balance the confounding impact of all modalities: \[ \begin{align} \tilde{g}_{\text{mod}}(\tilde{X}_{\text{mod}}) = \frac{\tilde{X}_{\text{mod}} - \mathbb{E}[\tilde{X}_{\text{mod}}]}{\sigma_{\tilde{X}_{\text{mod}}}}, \quad \text{mod}\in\{\text{tab}, \text{txt}, \text{img}\} \end{align} \]
Further, to ensure a strong confounding, the impact on the treatment \(D\) is defined via: \[ \begin{align} \tilde{m}_{\text{mod}}(\tilde{X}_{\text{mod}}) = -\tilde{g}_{\text{mod}}(\tilde{X}_{\text{mod}}), \quad \text{mod}\in\{\text{tab}, \text{txt}, \text{img}\} \end{align} \]
The treatment effect is set to \(\theta_0=0.5\) with \(n=50.000\) samples in the dataset
Both \(\tilde{g}_0(X)\) and \(\tilde{m}_0(X)\) are rescaled to ensure a signal-to-noise ratio of \(2\) for \(Y\) and \(D\) (given unit variances of the error terms)
Results of Simulation Study. Reported: mean ± sd. over five random train-test splits
Baseline | Embedding | Deep | |
---|---|---|---|
\(r^2(Y, \hat{l}_0)\) | \(0.31 \pm 0.01\) | \(0.87 \pm 0.02\) | \(\mathbf{0.90 \pm 0.01}\) |
\(r^2(D, \hat{m}_0)\) | \(0.31 \pm 0.01\) | \(0.87 \pm 0.02\) | \(\mathbf{0.90 \pm 0.01}\) |
\(\hat{\theta}\) | \(-0.32 \pm 0.01\) | \(\mathbf{0.28 \pm 0.01}\) | \(0.27 \pm 0.01\) |
Higher = better (best in bold)
Variable | Description |
---|---|
Sales Rank | Sales rank as weighted mean for the last 30 days |
Price | Price as weighted mean for the last 30 days |
Text | Combination of Title, Category, Description, etc. |
Image | Image of the product |
Continous Variables
Categorical Variables
Scatterplots with OLS Regression of Tabular Variables and \(\ln(Q)\)
\[ \ln(Q_{}) = \theta_0 \ln(P) + g_0(X) + \epsilon \]
\(\Rightarrow\) The causal parameter \(\theta_0\) can be interpreted as price elasticity of demand!
2.5 % | \(\hat{\theta}\) | 97.5 % |
---|---|---|
-0.072 | -0.046 | -0.019 |
2.5 % | \(\theta\) | 97.5 % |
---|---|---|
-0.132 | -0.1098 | -0.080 |
Key | Used Confounders |
---|---|
\(\texttt{img}\) | \(X = (X_{img})\) |
\(\texttt{imgtab}\) | \(X = (X_{img}, X_{tab})\) |
\(\texttt{txtimg}\) | \(X = (X_{txt}, X_{img})\) |
\(\texttt{txt}\) | \(X = (X_{txt})\) |
\(\texttt{txttab}\) | \(X = (X_{txt}, X_{tab})\) |
\(\texttt{txtimgtab}\) | \(X = (X_{txt}, X_{img}, X_{tab})\) |
\(\Rightarrow\) The aim is to emphasize the benefits of utilizing unstructured data.
Model | Covariates | \(R^2_{l_0}\) | \(R^2_{m_0}\) | \(\hat{\theta}_0\) |
---|---|---|---|---|
\(\texttt{OLS}\) | \(X=X_{tab}\) | \(0.3300\) | - | \(-0.0455\) |
\(\texttt{DoubleMLPLR}\) | \(X=X_{tab}\) | \(0.5986\) | \(0.1884\) | \(-0.1098\) |
\(\texttt{DoubleMLPLRDeep}\)1 | \(X=(X_{tab}, X_{txt}, X_{img})\) | \(0.5990\) | \(0.6765\) | \(-0.2794\) |
In case you have questions or comments, feel free to contact me
Kinsmart Set of 4 McLaren 720s Toy | …
Pixar Cars Mack Uncle Lightning…
The naive approach minimizes the following MSE
\[\begin{align} min_{\theta} \mathbb{E}[(Y - D\theta_0 - g_0(X))^2] \end{align}\]
This implies the following moment equation
\[\begin{align} \mathbb{E}[\underbrace{(Y - D\theta_0 - g_0(X))D}_{=:\psi (W, \theta_0, \eta)}]&=0 \end{align}\]
Whereas for the partialling-out approach minimizes
\[\begin{align} min_{\theta} \mathbb{E}\big[\big(Y - \mathbb{E}[Y|X] - (D-\mathbb{E}[D|X])\theta\big)^2\big] \end{align}\]
which implies
\[\begin{align} \mathbb{E}\big[\underbrace{\big(Y - \mathbb{E}[Y|X] - (D-\mathbb{E}[D|X])\theta\big)(D-\mathbb{E}[D|X])}_{=:\psi (W, \theta_0, \eta)}\big]&=0 \end{align}\]
\[\begin{align} \psi (W, \theta_0, \eta) = & (Y - D\theta_0 - g_0(X))D \end{align}\]
Regression adjustment score
\[\begin{align} \eta &= g(X), \\ \eta_0 &= g_0(X). \end{align}\]
\[\begin{align} \psi (W, \theta_0, \eta_0) = & \Big((Y- E[Y|X])\\ &-(D-E[D|X])\theta_0\Big)\\ & (D-E[D|X]) \end{align}\]
Neyman-orthogonal score (Frisch-Waugh-Lovell)
\[\begin{align} \eta &= (\ell(X), m(X)), \\ \eta_0 &= ( \ell_0(X), m_0(X)), \\ &= ( \mathbb{E} [Y \mid X], \mathbb{E}[D \mid X]). \end{align}\]
Inference is based on a moment condition that satisfies the Neyman orthogonality condition \(\psi(W; \theta, \eta)\) \[E[\psi(W; \theta_0, \eta_0)] = 0,\]
where \(W:=(Y,D,X,Z)\) and with \(\theta_0\) being the unique solution that obeys the Neyman orthogonality condition \[\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta)] \right|_{\eta=\eta_0} = 0.\]
\(\partial_{\eta}\) denotes the pathwise (Gateaux) derivative operator
Neyman orthogonality ensures that the moment condition identifying \(\theta_0\) is insensitive to small pertubations of the nuisance function \(\eta\) around \(\eta_0\)
Using a Neyman-orthogonal score eliminates the first order biases arising from the replacement of \(\eta_0\) with a ML estimator \(\hat{\eta}_0\)
PLR example: Partialling-out score function \[\psi(\cdot)= (Y-E[Y|X]-\theta (D - E[D|X]))(D-E[D|X])\]
The nuisance parameters are estimated with high-quality (fast-enough converging) machine learning methods.
Different structural assumptions on \(\eta_0\) lead to the use of different machine-learning tools for estimating \(\eta_0\) Chernozhukov et al. (2018) (Section 3)
Rate requirements depend on the causal model and orthogonal score, e.g. (see Chernozhukov et al. (2018)),
To avoid the biases arising from overfitting, a form of sample splitting is used at the stage of producing the estimator of the main parameter \(\theta_0\).
Efficiency gains by using cross-fitting (swapping roles of samples for train / hold-out)
There exist regularity conditions, such that the DML estimator \(\tilde{\theta}_0\) concentrates in a \(1/\sqrt{N}\)-neighborhood of \(\theta_0\) and the sampling error is approximately \[\sqrt{N}(\tilde{\theta}_0 - \theta_0) \sim N(0, \sigma^2),\] with \[\begin{align}\begin{aligned}\sigma^2 := J_0^{-2} \mathbb{E}(\psi^2(W; \theta_0, \eta_0)),\\J_0 = \mathbb{E}(\psi_a(W; \eta_0)).\end{aligned}\end{align}\]
DoubleML Deep