# Estimating Treatment Effects After Weighting with Multiply Imputed Data

Multiply imputed data always makes things a little harder. Essentially, you have to perform each step of the analysis in each imputed dataset and then combine the results together in a special way. For basic regression analysis, the `mice`

package makes fitting models and combining estimates simple. But when we want to do propensity score matching or weighting before fitting our regression models, and when the quantity we want to estimate is not just a coefficient in a regression model, things get a bit harder.

For doing matching or weighting in multiply imputed data, the R package `{MatchThem}`

does the job. It essentially provides wrappers for `MatchIt::matchit()`

and `WeightIt::weightit()`

for multiply imputed data. It extends `{mice}`

’s functionality for fitting regression models in multiply imputed data by automatically incorporating the matched or weighted structure into the estimation of the outcome models. It uses `mice::pool()`

to pool estimates across multiply imputed data.

But for estimating treatment effects, it’s often not as simple as using a regression coefficient. If we include covariates in our outcome model but want a marginal effect, we need to use an average marginal effects procedure (i.e., g-computation) to compute it within each imputed dataset, and then combine the results afterward. The `{marginaleffects}`

package provides a wonderful interface for performing g-computation, but for multiply imputed data, it can require some programming by the analyst. In this guide, I’ll show you how to do that programming to combine treatment effect estimates across multiple imputed datasets.

An alternative to using `{marginaleffects}`

is to use the `{clarify}`

package. `{clarify}`

can also be used to perform g-computation, but it uses simulation-based inference to compute the uncertainty bounds for the estimate. An advantage of simulation-based inference for multiply imputed data is that combining estimates across imputed datasets is much more straightforward. In this guide, I’ll also show you how to use `{clarify}`

to combine treatment effect estimates across imputed datasets.

### Packages we’ll need

We will need the following packages for this demonstration: `cobalt`

, `mice`

, `MatchThem`

, `WeightIt`

, `marginaleffects`

, and `clarify`

.

### The data

As usual, we’ll be using a version of the `lalonde`

dataset. Here will use the `lalonde_mis`

dataset in `{cobalt}`

, which has missing values.

```
data("lalonde_mis", package = "cobalt")
summary(lalonde_mis)
```

```
## treat age educ race married nodegree re74 re75 re78
## Min. :0.0000 Min. :16.00 Min. : 0.00 black :243 Min. :0.0000 Min. :0.0000 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:20.00 1st Qu.: 9.00 hispan: 72 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 238.3
## Median :0.0000 Median :25.00 Median :11.00 white :299 Median :0.0000 Median :1.0000 Median : 984.5 Median : 585.4 Median : 4759.0
## Mean :0.3013 Mean :27.36 Mean :10.27 Mean :0.4158 Mean :0.6303 Mean : 4420.2 Mean : 2170.3 Mean : 6792.8
## 3rd Qu.:1.0000 3rd Qu.:32.00 3rd Qu.:12.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 7626.9 3rd Qu.: 3202.0 3rd Qu.:10893.6
## Max. :1.0000 Max. :55.00 Max. :18.00 Max. :1.0000 Max. :1.0000 Max. :35040.1 Max. :25142.2 Max. :60307.9
## NA's :20 NA's :40 NA's :39
```

You can see there are some missing values in `married`

, `re74`

, and `re75`

.

### Imputing the data

Here, we’ll use `{mice}`

to impute the data. Although typically something like 20 imputation is sufficient, for the method `{clarify}`

uses, it needs way more, so we’ll use 50. We’ll use the default settings, but you should tailor the imputation to fit the needs of your dataset. (I always like to use a machine learning method for my imputations). We’ll also set a seed to ensure replicability.

```
library("mice")
set.seed(12345)
imp <- mice(lalonde_mis, m = 50, printFlag = FALSE)
```

`mice()`

returns a `mids`

object, which contains the imputed datasets. Although we could extract the datasets using `complete()`

, we’ll supply this object directly to our function for estimating the propensity score weights.

### Weighting the imputed data

We’ll use `MatchThem::weightthem()`

to estimate propensity score weights in the imputed datasets. We could also use `MatchThem::matchthem()`

to do matching; the process is basically identical^{1}. Here we’ll use logistic regression (🤢) to estimate ATT weights to keep things quick and simple.

```
library("MatchThem")
w.imp <- weightthem(treat ~ age + educ + race + married + nodegree +
re74 + re75, data = imp, method = "ps",
estimand = "ATT")
```

Let’s assess balance using `{cobalt}`

.

```
library("cobalt")
bal.tab(w.imp, stats = c("m", "ks"), abs = TRUE)
```

```
## Balance summary across all imputations
## Type Mean.Diff.Adj Max.Diff.Adj Mean.KS.Adj Max.KS.Adj
## prop.score Distance 0.0235 0.0379 0.1166 0.1327
## age Contin. 0.1120 0.1343 0.3053 0.3146
## educ Contin. 0.0352 0.0485 0.0369 0.0412
## race_black Binary 0.0024 0.0036 0.0024 0.0036
## race_hispan Binary 0.0003 0.0007 0.0003 0.0007
## race_white Binary 0.0022 0.0030 0.0022 0.0030
## married Binary 0.0168 0.0236 0.0168 0.0236
## nodegree Binary 0.0191 0.0250 0.0191 0.0250
## re74 Contin. 0.0097 0.0281 0.2027 0.2261
## re75 Contin. 0.0075 0.0286 0.1388 0.1648
##
## Average effective sample sizes across imputations
## 0 1
## Unadjusted 429. 185
## Adjusted 100.19 185
```

Balance could be a bit better on `age`

, but we’re going to move on because we have things to do.

### Fitting the outcome models

Our next step is to fit the outcome model in each imputed dataset. Here, our outcome will be `re78 == 0`

, i.e., whether a unit’s earnings in 1978 were 0. Ideally, treatment reduces this risk. Although our estimand will be a risk ratio, because we’re doing g-computation, we can fit a model for the outcome that actually makes sense rather than choosing one based on the convenient interpretation of its coefficients. So, we’ll fit a probit outcome model to really hit home that we need a post-estimation method to estimate our quantity of interest and can’t rely on our model coefficients.

Although `{MatchThem}`

has functionality for fitting models to the imputed datasets that incorporate the weights, for our purposes, it is better to extract the imputed datasets and fit each model manually in a loop. We’ll use `glm()`

to do so, though the `{MatchThem}`

and `{WeightIt}`

documentation may recommend `survey::svyglm()`

because it correctly computes the robust standard errors. We’ll do that later using `{marginaleffects}`

and `{clarify}`

functions so it’s okay that we don’t do it now. We’ll use a quasi-binomial model because we have weights.

```
fits <- lapply(complete(w.imp, "all"), function(d) {
glm(I(re78 == 0) ~ treat + age + educ + married + race +
nodegree + re74 + re75, data = d,
weights = weights, family = quasibinomial("probit"))
})
```

If we wanted to interpret the pooled coefficients from our outcome model (and we had included correct estimation of the standard errors, which we didn’t here), we could use `pool(fits) |> summary()`

to get them. But none of that is true here so we’ll move on and save the pooling till after we estimate the quantity of interest.

## The `{marginaleffects}`

workflow

Now we have our list of models. Our next step is to estimate the ATT risk ratio in each one (with the correct standard error) and pool the results. If the only quantity we want is the treatment effect, this is easy. We can use `marginaleffects::avg_comparisons()`

on each model and then use `mice::pool()`

to pool the results. In our call to `avg_comparisons()`

, we need to subset the data used to fit each model to just the treated units and supply this to `newdata`

, supply the name of the variable containing the weights to `wts`

^{2}, supply the robust standard error type (HC3) to `vcov`

, and specify that we want the log risk ratio of the average estimated potential outcomes by supplying `"lnratioavg"`

to `transform_pre`

^{3}.

```
library("marginaleffects")
comp.imp <- lapply(fits, function(fit) {
avg_comparisons(fit, newdata = subset(fit$data, treat == 1),
variables = "treat", wts = "weights", vcov = "HC3",
transform_pre = "lnratioavg")
})
pooled.comp <- mice::pool(comp.imp)
```

Finally, we can use `summary()`

on the resulting object, adding the arguments `conf.int = TRUE`

to request confidence intervals and `exponentiate = TRUE`

to get the risk ratio from the log risk ratio.

```
summary(pooled.comp, conf.int = TRUE,
exponentiate = TRUE)
```

```
## term contrast estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 treat ln(mean(1) / mean(0)) 0.9321569 0.2097534 -0.3349366 610.5055 0.737788 0.617436 1.407298
```

We find a risk ratio of approximately 0.932, 95% CI: [0.617, 1.407], indicating that in our sample, the risk of having zero earnings in 1978 decreased slightly for those who received treatment, but we don’t have strong evidence for such an effect in the population.

Although this is nice and simple, things get a bit more complicated when we want to estimate multiple comparisons at the same time, estimate the marginal risks, or perform a more complex analysis. Additional programming is required to make `mice::pool()`

compatible with these more complex quantities. We’ll demonstrate how to hack `{marginaleffects}`

to make it work using the instructions in the `{marginaleffects}`

vignette on multiple imputation.

We’ll be using `avg_predictions()`

on each model to compute the marginal risks under each treatment level, which uses a similar syntax to `comparisons()`

. The challenge comes in that `avg_predictions()`

produces two rows of output (one for each treatment level), which are not correctly distinguished by `mice::pool()`

. So, we’ll have to create a new custom class and write a new `tidy()`

method for our class.

First, we’ll generate our marginal risks and assign the output our new class, which is arbitrary but which I will call `"pred_imp_custom"`

.

```
pred.imp <- lapply(fits, function(fit) {
out <- avg_predictions(fit, newdata = subset(fit$data, treat == 1),
variables = "treat", wts = "weights",
vcov = "HC3", by = "treat")
# the next line assigns our custom class
class(out) <- c("pred_imp_custom", class(out))
return(out)
})
```

Next, we’ll write our new `tidy()`

method. (Make sure to replace `treat`

everywhere you see it with the name of your treatment variable.) We won’t actually be using this function at all; it is called internally by `mice::pool()`

.

```
tidy.pred_imp_custom <- function(x, ...) {
out <- marginaleffects:::tidy.predictions(x, ...)
out$term <- paste("treat =", out$treat)
return(out)
}
```

Finally, we can use `mice::pool()`

and `summary()`

to get our marginal risks:

`mice::pool(pred.imp) |> summary(conf.int = TRUE)`

```
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 treat = 0 0.2607090 0.04264062 6.114100 609.4350 1.734761e-09 0.1769686 0.3444494
## 2 treat = 1 0.2430092 0.03197686 7.599534 611.9484 1.120645e-13 0.1802115 0.3058069
```

Taking the ratio of these risks gives us the risk ratio we computed above.

Note that you have to customize the `tidy()`

method in a slightly different way when you are estimating treatment effects in subgroups. I’ll leave that as an exercise to the reader, or you can hire me to do it for you :)

## The `{clarify}`

workflow

The `{clarify}`

workflow for multiply imputed data is very similar to its workflow for regular data. How simulation-based inference works broadly is that sets of parameters are drawn from a distribution after fitting the model; this distribution is often assumed to be multivariate normal with the mean vector equal to the estimated coefficients and the covariance equal to the asymptotic covariance matrix of the coefficients. Many (e.g., 1000) sets of coefficients are drawn, and a quantity of interest is computed using each set, forming a “posterior” distribution of the quantity of interest. This posterior is then used for inference: its standard deviation can be used as the quantity’s standard error, and its quantiles can be used as confidence intervals. For more information on this methodology, see the `{clarify}`

website and its references.

With multiply imputed data, this process is done for the model fit to each imputed dataset, and then the distributions of the quantities of interest are simply combined to form a single distribution, which is used for inference. In Bayesian terms, this would be called “mixing draws”. The variance of this mixture distribution approaches the variance of the estimate computed using Rubin’s rules when the number of imputations is high.

To use `{clarify}`

, we supply the list of fitted models to `clarify::misim()`

, which draws the coefficients from their implied distributions from each model. We also need to specify the method for computing the covariance matrix (here, using the same HC3 robust covariance we used with `{marginaleffects}`

to account for the weights). We will only request 200 replications per fitted model since we have 50 imputations, which gives us 10,000 replicates (likely more than enough for stable inference).

```
library("clarify")
sim.imp <- misim(fits, n = 200, vcov = "HC3")
sim.imp
```

```
## A `clarify_misim` object
## - 10 coefficients, 50 imputations with 200 simulated values each
## - sampled distributions: multivariate t(604)
```

(Note: because we used a quasi-binomial model, a scaled t-distribution was used to draw the coefficients. In practice this will give similar draws to a normal distribution.)

The output of `misim()`

is then fed to a function for computing the quantity of interest in each draw; here, we’ll be using `clarify::sim_ame()`

, which is appropriate for computing marginal risks in a subset of the data (i.e., the ATT risk ratio). We supply the treatment variable to `var`

and subset the data to just the treated units using `subset`

to request the ATT. Although we can use the `contrast`

argument to request the (log) risk ratio, we can compute that afterward quickly from the marginal risks. (Using `cl = 3`

uses parallel computing with 3 cores but only if you are on a Mac. See the `sim_ame()`

documentation for more information on how to use the `cl`

argument.)

```
sim.att <- sim_ame(sim.imp, var = "treat",
subset = treat == 1, cl = 3,
verbose = FALSE)
sim.att
```

```
## A `clarify_est` object (from `sim_ame()`)
## - Average marginal effect of `treat`
## - 10000 simulated values
## - 2 quantities estimated:
## E[Y(0)] 0.2605322
## E[Y(1)] 0.2428401
```

To compute the risk ratio, we can use `transform()`

:

`sim.att <- transform(sim.att, RR = `E[Y(1)]`/`E[Y(0)]`)`

Finally, we can compute out confidence intervals and p-values around the estimated marginal risks and risk ratio using `summary()`

:

`summary(sim.att, null = c(RR = 1))`

```
## Estimate 2.5 % 97.5 % P-value
## E[Y(0)] 0.261 0.187 0.354 .
## E[Y(1)] 0.243 0.188 0.313 .
## RR 0.932 0.630 1.421 0.76
```

Here, we find a risk ratio of approximately 0.932, 95% CI: [0.63, 1.421]. The estimates, confidence intervals, and p-values we get from the two methods line up well.

By default, `{clarify}`

uses quantile-based confidence intervals and computes the p-values by inverting them (i.e., finding the largest confidence level that yields an interval that excludes the null value and computing the p-value as one minus that level). Wald confidence intervals and p-values can also be request by setting `method = "wald"`

in the call to `summary()`

, but these are only recommended if the quantity has a normal distribution (which the risk ratio does not).

### Explaining differences between the approaches

Both the delta method- and simulation-based inference approaches are valid, but sometimes you will get results that disagree. The estimates of the quantities of interest may disagree because of how `mice::pool()`

and `clarify::sim_ame()`

combine estimates across imputations.

Rubin’s rules involve simply taking the mean of the estimates across imputations. This works well when the quantity is collapsible, linear, or has a symmetric (ideally normal) distribution. If the quantity of interest is none of those but can be transformed from a quantity that does have those properties, Rubin’s rules can be apply to this intermediate quantity before transforming the estimate to get the final results. This is exactly what we did in the `{marginaleffects}`

workflow when we computed the log risk ratio before pooling and then exponentiating the pooled log risk ratio to arrive at the risk ratio. If we had gone straight into pooling the risk ratio, the resulting estimate might not have been consistent.

`{clarify}`

works by first using Rubin’s pooling rules on the model coefficients, which we assume to be normally distributed, and then computing the quantity of interest in each imputed dataset using draws from the pooled coefficients. A benefit of this strategy is that we don’t have to wonder whether the quantity of interest satisfies the above properties. The resulting estimates will be consistent because no pooling is done on them; the pooling happens only in the first step.

Confidence intervals may differ slightly between the two methods, and this could be due to two reasons: 1) the delta method and simulation-based inferences naturally compute confidence intervals in different ways, with the delta method using a first-order Taylor series approximation and assuming normality of the quantity of interest, and simulation-based inference using simulation to generate a “posterior” for the quantity of interest and using its quantiles as the interval; and 2) simulation-based inference requires many imputations for the variance of the posterior to equal the variance of the Rubin’s rules pooled estimate. More imputations is always better for both methods, so do as many as you can.

How should you choose between the delta method and simulation-based inference? Use whichever will get you published, of course! (Just kidding.) Use the one you find most trustworthy, that your audience will find the most trustworthy, and that balances the assumptions you are willing to make with the desired precision of the estimate. You might also use the one that seems more natural to you, either conceptually or in terms of usability. Frankly, I find `{clarify}`

to be easier to use when the quantity of interest is more complicated than a single comparison (e.g., for subgroup analysis or for computing average marginal risks), but `{marginaleffects}`

can be faster, doesn’t rely on a stochastic process, and is better-backed by statistical theory. Confirming you get similar results with both methods is always a good idea, and the plotting diagnostics in `{clarify}`

can be used to determine whether any difference might be due to the failure of the delta method due to violation of one of its assumptions.

The key differences is that pair membership needs to be accounted for in estimation of the variance of the outcome model coefficients; this is usually as simply as specifying

`vcov = ~subclass`

to functions in`{marginaleffects}`

or`{clarify}`

.↩︎This actually isn’t necessary for the ATT but it’s walys good practice.↩︎

Note: we need the log risk ratio because Rubin’s pooling rules don’t apply to the risk ratio but do to the log risk ratio. We will exponentiate the log risk ratio and its confidence interval after pooling.↩︎