What's New in `WeightIt` Version 1.0.0

Mar 21, 2024 R

My R package WeightIt has a huge new update, making it one of the biggest updates since I started the project. Version 1.0.0 introduces a few breaking changes, including the possibility that old results will not align with results from newer version of the package. In most cases, though, this only means improvements (i.e., better balance).

For those that don’t know, WeightIt is an R package designed to provide access to propensity score weighting (also known as inverse probability weighting) and its variations in a way that facilitates the use of advanced and modern methods and best practices. It provides a simple, unified interface to many different methods of estimating weights to balance groups in observational studies, including those that use generalized linear models, machine learning methods, and convex optimization. WeightIt provides support for binary, multi-category, continuous, and longitudinal treatments and directly interfaces with cobalt for assessing balance after weighting.

Version 1.0.0 has several new features that I wanted to explain in a blog post rather than just in the NEWS document associated with the package because it’s important to me that these new features are appreciated. There are three primary updates that I’ll discuss, along with some minor ones. Those three are updates to the covariate balancing propensity score, a new method called inverse probability tilting, and new support for fitting weighted outcome regression models that account for estimation of the weights in the standard errors.

New implementation of CBPS

The covariate balancing propensity score (CBPS, Imai and Ratkovic 2014) is a method of estimating propensity scores using generalized linear models (e.g., logistic regression). I described CBPS in my last blog post, so I’ll be brief here. Essentially, you have the score equations for a logistic regression, the roots of which are the logistic regression coefficients. You also have moment conditions that correspond to mean balance on the included covariates in the weighted sample. There are two versions of CBPS: the just-identified version, which finds the logistic regression coefficients that only solve the balance moment conditions, and the over-identified version, which finds the logistic regression coefficients that attempt to solve both the logistic regression score equations and the balance moments conditions. Because in general it is impossible to solve both sets of conditions exactly with a single set of coefficients, the conditions are weighted in a specific way to manage the trade-off between achieving balance and maximizing the likelihood using generalized method of moments (GMM) estimation. The form of this weighting occurs with a weighting matrix, which is a function of the coefficients and can either be estimated once, in what is called the “two-step” estimator, or it can be continuously updated as the coefficients are estimated.

There are versions of CBPS for the ATT, ATC, and ATE (logistic regression on its own is a CBPS for the ATO), as well as versions for multi-category and continuous treatments. The multi-category version replaces the logistic regression coefficients with coefficients in a multinomial logistic regression, and the continuous treatment version replaces the logistic regression coefficients with linear regression coefficients for the generalized propensity score (Fong, Hazlett, and Imai 2018). There is also a version for longitudinal treatments (Imai and Ratkovic 2015), but that has never been supported by WeightIt so I won’t discuss it here.

Previously, estimating the CBPS propensity scores and weights by specifying method = "cbps" in weightit() was done entirely in the CBPS package, which was written by the developers of the method. The package is great and does what it claims to do, but its code is very hard to read and debug, it is not very customizable, and there are ways in which the package is limited or buggy. I decided to finally figure out what CBPS was for myself as part of a broader task of understanding M-estimation (discussed later), and in doing so I realized I had the skills to program my own CBPS, and so that’s what I did, taking some inspiration from the original package. This means you can use CBPS in WeightIt without requiring CBPS as a dependency, and there are some additional options not available in CBPS that are now available to WeightIt users. I’ll outline the main changes below:

The default methods are now different, which is important when comparing the output of weightit() to that of CBPS::CBPS() or versions of WeightIt prior to 1.0.0. Now, the default is to use the just-identified version of CBPS, whereas previously the default was the over-identified version. The just-identified version has several benefits: 1) it is faster (since the over-identified version has to fit the just-identified version first anyway), 2) it yields better balance on the means, and 3) it is compatible with using M-estimation to account for uncertainty in the weights. It is still possible to request the over-identified version by setting over = TRUE, and whether the two-step estimator is to be used can be controlled by the twostep argument, which has the same default behavior as previously (TRUE by default, which is faster but potentially less accurate).
With multi-category treatments, the ATT and ATE can be targeted natively and the estimation supports any number of treatment groups. Previously, CBPS for the ATT and for more than 4 groups were only supported in an ad hoc way because these options are not available in the CBPS package. Now they are programmed directly in, so the results for them align with what one would expect based on theory. It should also be much faster to run because only one optimization is required (whereas previously multiple calls to CBPS::CBPS() were required). There was also a bug in the over-identified version of the CBPS for multi-category treatments in the CBPS package; it turns out this bug nudged the coefficients toward better balance, which is usually a good thing, but it meant they no longer lined up with what you would expect based on theory. This bug is not present in the new WeightIt implementation.
With continuous treatments, the weighting matrix for the over-identified CBPS now has a slightly different form from that in CBPS. (Specifically, it does not integrate out the treatment, which is done for the other treatment types). This makes it more likely to converge. That said, one should probably always use the just-identified version.
Link functions other than logit are allowed for binary treatments. The CBPS as described in Imai and Ratkovic (2014) was general enough that any generalized linear model could be used, but focus was primarily on logistic regression because it is the most common model for binary outcomes. The CBPS package only supports logistic regression. WeightIt supports almost any link (in particular, logit, probit, complimentary log-log, cauchit, log, and identity). It turns out these are very easy to implement; the only differences are changing the inverse link used to compute the propensity scores and modifying the logistic regression score equations to include additional terms that are required for non-canonical links. In practice, logistic regression is all one should need, but it is nice to have these other options available in case you get better balance or precision using one of them, just as is the case when using propensity scores estimated from generalized linear models alone.

Some things are missing from the original CBPS implementation, though. Any other arguments to CBPS::CBPS() (e.g., baseline.formula and diff.formula) are no longer supported, and any auxiliary CBPS outputs, like the regression coefficients on the scale of the original predictors or the J-statistic for testing the over-identification conditions, are no longer returned. I doubt these are considered by most users of CBPS, so I didn’t want to spend time computing them and returning them when people just want the propensity scores and weights. If you do want these values, let me know and I’ll add them in, or just use CBPS().

I’m really proud of this implementation of CBPS, as it represents a culmination of my recent learning about M-estimation and generalized method of moments. There is still plenty about CBPS theory I don’t understand, but I’m happy to reduce a dependency and take charge of new advancements on the method and its implementation. I give great acknowledgement to the original authors of all papers explaining CBPS and to the developers and maintainers of the CBPS package, especially Kosuke Imai and Christian Fong, who have long supported my work and appreciated its value. I also want to thank Syd Amerikaner on CrossValidated, who provided a key insight that helped me program the over-identified estimators.

New weighting method: inverse probability tilting (IPT)

IPT is a method of estimating propensity score weights described by Graham, De Xavier Pinto, and Egel (2012). It works almost identically to the just-identified CBPS: estimating equations corresponding to covariate balance are specified, and their roots, which correspond to coefficients in a generalized linear model for the propensity score, are found and used to compute the estimated propensity scores. For the ATT and ATC, the implementation is identical to the just-identified CBPS. For the ATE, IPT uses a slightly different method which estimates propensity scores for the treated and control units separately to ensure exact mean balance not just between the treatment groups but also between each group and the original sample (so-called “three-way balance,” Chan, Yam, and Zhang 2016). The original paper on IPT only described the method for “missing data problems”, which usually means estimating a counterfactual mean in an ATE framework, but Sant’Anna and Zhao (2020) describe a version for the ATT that is just as easy to implement. For multi-category treatments, rather than fitting a multinomial logistic regression model like CBPS does, IPT does something like fitting a logistic regression model for each category. This again guarantees three-way balance, but differs from the CBPS implementation.

Previously, IPT was not available in any package in R. I wrote a gist to implement it before I really understood what it was, but now it is fully available in WeightIt by setting method = "ipt" in weightit(). Its only dependency is the rootSolve package, which quickly finds the roots of score equations. I could have just used optim() like I did with CBPS and entropy balancing, but rootSolve is quite a bit faster and often more reliable. As with CBPS, links besides the logit link are available, and all methods are compatible with M-estimation.

The original authors proposed a specification test that involves testing whether the coefficients for the treated and control groups in the ATE differ from each other, and that is available in their Stata package. I didn’t put it into WeightIt, but it wouldn’t be crazy to implement, so, again, let me know if you’re interested.

In general, I would recommend IPT over CBPS and entropy balancing for binary and multi-category treatments due to its speed and some theoretical advantages. For all estimands, IPT is doubly robust, meaning if the link function is correctly specified or the outcome model is linear in the covariates, the estimate is consistent. For the ATE, entropy balancing doesn’t retain this property, and CBPS is less protected against incidental error in the presence of heavy effect modification. Entropy balancing is best suited for continuous treatments (IPT doesn’t even have a version for continuous treatments), as the limitations of the linear model used in CBPS prevent certain important balancing conditions from being satisfied.

Fitting Weighted Regression Models

Okay, this is probably the biggest of the three biggest updates. WeightIt now has functionality to fit weighted outcome regression models in a way so that the uncertainty in the estimation of the weights is accounted for in the variance matrix of the estimated parameters. This is huge because all otherwise accessible methods for estimating the variance of regression parameters after weighting treat the weights as fixed, which in most cases yields inferences that are too conservative (i.e., inappropriately too imprecise, Austin 2022), and in some cases yields inferences that are anti-conservative (inappropriately too precise, Reifeis and Hudgens 2022). This functionality is facilitated through the new function glm_weightit() (and its wrapper, lm_weightit()), which are mostly wrappers around glm(), but estimate the variance matrix in a away that incorporates information about the estimation of the weights. This is done in a few ways, but the most notable is through M-estimation, which I briefly mentioned above.

M-estimation is a way of estimating a system of estimating equations in which the parameters can have arbitrary relationships, including across models. In this case, the parameters of the weighted outcome model depend on the weights, which in turn depend on the parameters used to estimate the weights (e.g., the coefficients in the propensity score model). By combining the estimating equations for the weight models and the estimating equations for the outcome model, we can fully account for uncertainty in both models when computing the variance of the outcome model parameters. This means that when we use g-computation based on the outcome model, we get a treatment effect estimate, the standard error of which correctly accounts for uncertainty in estimation of the weights.

The way this works is as follows: one uses weightit() or weightitMSM() to estimate the balancing weights, and then one fits a model using glm_weightit() with the output of weightit() or weightitMSM() supplied to the weightit argument. This fits a weighted regression model using glm(), and, if available, estimates the variance matrix using M-estimation. Components in the weightit object contain the required information to account for the uncertainty of the weights in estimation of the variance of the outcome model parameters.

This is only available for methods that support M-estimation, which unfortunately does not include all of them. Only estimation of weights using GLM propensity scores, entropy balancing, IPT, and just-identified CBPS are supported. This is still a pretty good list. For the others, the default is to return the HC0 robust variance matrix, which treats the weights as fixed but is robust to heteroscedasticity or other misspecification of the outcome model likelihood (the is also true of the M-estimation variance).

Currently, no other package in R lets you do this, despite many recommendations for more accurate standard error estimation in the literature (Reifeis and Hudgens 2022; Austin 2022; Lunceford and Davidian 2004; Williamson, Forbes, and White 2014; Gabriel et al. 2024). You would have to program this manually yourself or use the highly context-specific code provided in these articles. Other packages, in particular PSweight, do allow you to estimate treatment effects that account for estimation of the weights in the standard error of the treatment effect, but only for the weighted difference in means or augmented inverse probability weighting (AIPW) (i.e., not for weighted g-computation) and only for GLM-based propensity scores. Stata offers teffects ipwra, which provides the corrected standard error estimates for weighted g-computation-based estimates of the treatment effect, but it also only supports GLM propensity scores and a few estimands.

glm_weightit() supports another method of estimating uncertainty, and that is the bootstrap. Both the traditional bootstrap and the fractional weighted bootstrap (Xu et al. 2020) are supported. These bootstrap the original data, estimate the weights, and estimate the weighted outcome model automatically. Previously, one would have to program the bootstrap themselves, which wasn’t really that hard, but now this is done in one single function call. The bootstrap confidence intervals computed are Wald-type intervals, though, which perform worse than more advanced types like the BCa interval. Still, though, because the estimators are asymptotically normal, the Wald-type intervals should work well. I recommend using the fractional weighted bootstrap when possible rather than the traditional bootstrap. The choice of variance estimator is controlled by the vcov argument.

glm_weightit() also supports estimation of multinomial logistic regression models by setting family = "multinomial". The outcome model can be flexible, e.g., including polynomials or splines, and its output is compatible with most regression post-processing functions, including all functions in marginaleffects, which can be used to perform g-computation to estimate the treatment effect. glm_weightit() also has support for clustered covariance matrices for all variance estimation methods. My hope is to add support for fitting Cox proportional hazards models as well, though I still barely understand them. Any advice on this would be greatly appreciated.

For general theory on M-estimation, I recommend Stefanski and Boos (2002), and for its application to weighted regression models, I recommend Gabriel et al. (2024). I also think looking at the WeightIt source code can be illuminating; I think I did a good job of keeping it pretty simple and providing example code that anyone could use to program their own M-estimators. It’s really a lot easier than I thought and I’m glad I took the time to learn. I know it’s a fundamental technique for many biostatisticians and econometricians, but I always used to see it as a highly advanced and arcane method that I would never come to understand, until I came to understand it (3.5 years out from my PhD).

Other updates

There are a few smaller updates to WeightIt that I would feel bad about omitting.

First, the package documentation now uses Roxygen. This doesn’t affect use but might make it easier for someone reading the source code.
I added a new function, calibrate(), which calibrates propensity scores using Platt scaling as recommended by Gutman, Karavani, and Shimoni (2022). This can slightly improve performance when a machine learning model is used to estimate the propensity score, but is unlikely to be useful otherwise.
A new argument, quantile, can be supplied to weightit() for methods that also accept moments and int which can be used to balance quantiles of the covariate distribution (e.g., the median, etc.) instead of just the means and means of moments. This is based on Beręsewicz (2023).
trim() now has a drop argument; setting to TRUE sets the weights of all trimmed units to 0 (effectively dropping or censroing them). Previously you could only “truncate” the weights, i.e., set all weights higher than a given quantile to the weight at that quantile, but weight trimming (dropping units with extreme weights) has an extensive history as well (Thoemmes and Ong 2016; Crump et al. 2009). In my opinion, it is always better to change your estimand prospectively (e.g., to the ATO) than to censor units, but the option is available.
I made a small change to how g-computation is done; following Gabriel et al. (2024), I no longer recommend incorporating the weights in the g-computation step beyond including them in the outcome model. An exception is made when targeting an estimand other than the ATT, ATC, or ATE or when using sampling weights. In practice, it will not make a big difference whether the weights are included or not for the ATE (which is the only estimand affected by this change), but this brings recommendations in line with those in the literature and consistent with teffects ipwra in Stata.
For multi-category treatments, the default multinomial logistic regression propensity scores (i.e., with method = "glm") require no dependencies. Other ways of estimating the propensity scores are possibly by specifying the multi.method argument.

Closing Thoughts

Thank you so much for reading and checking out WeightIt! I hope these new features improve your research or lead to new discoveries. Please let me know if any of my packages have helped you in your work; it makes a big difference to know that my efforts help people.

References

Austin, Peter C. 2022. “Bootstrap Vs Asymptotic Variance Estimation When Using Propensity Score Weighting with Continuous and Binary Outcomes.” Statistics in Medicine 41 (22): 4426–43. https://doi.org/10.1002/sim.9519.

Beręsewicz, Maciej. 2023. “Survey Calibration for Causal Inference: A Simple Method to Balance Covariate Distributions.” https://doi.org/10.48550/arXiv.2310.11969.

Chan, Kwun Chuen Gary, Sheung Chi Phillip Yam, and Zheng Zhang. 2016. “Globally Efficient Non-Parametric Inference of Average Treatment Effects by Empirical Balancing Calibration Weighting.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (3): 673–700. https://doi.org/10.1111/rssb.12129.

Crump, R. K., V. J. Hotz, G. W. Imbens, and O. A. Mitnik. 2009. “Dealing with Limited Overlap in Estimation of Average Treatment Effects.” Biometrika 96 (1): 187–99. https://doi.org/10.1093/biomet/asn055.

Fong, Christian, Chad Hazlett, and Kosuke Imai. 2018. “Covariate Balancing Propensity Score for a Continuous Treatment: Application to the Efficacy of Political Advertisements.” The Annals of Applied Statistics 12 (1): 156–77. https://doi.org/10.1214/17-AOAS1101.

Gabriel, Erin E., Michael C. Sachs, Torben Martinussen, Ingeborg Waernbaum, Els Goetghebeur, Stijn Vansteelandt, and Arvid Sjölander. 2024. “Inverse Probability of Treatment Weighting with Generalized Linear Outcome Models for Doubly Robust Estimation.” Statistics in Medicine 43 (3): 534–47. https://doi.org/10.1002/sim.9969.

Graham, Bryan S., Cristine Campos De Xavier Pinto, and Daniel Egel. 2012. “Inverse Probability Tilting for Moment Condition Models with Missing Data.” The Review of Economic Studies 79 (3): 1053–79. https://doi.org/10.1093/restud/rdr047.

Gutman, Rom, Ehud Karavani, and Yishai Shimoni. 2022. “Propensity Score Models Are Better When Post-Calibrated.”

Imai, Kosuke, and Marc Ratkovic. 2014. “Covariate Balancing Propensity Score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 243263. https://doi.org/10.1111/rssb.12027.

———. 2015. “Robust Estimation of Inverse Probability Weights for Marginal Structural Models.” Journal of the American Statistical Association 110 (511): 1013–23. https://doi.org/10.1080/01621459.2014.956872.

Lunceford, Jared K., and Marie Davidian. 2004. “Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study.” Statistics in Medicine 23 (19): 29372960. https://doi.org/10.1002/sim.1903.

Reifeis, Sarah A, and Michael G Hudgens. 2022. “On Variance of the Treatment Effect in the Treated When Estimated by Inverse Probability Weighting.” American Journal of Epidemiology 191 (6): 1092–97. https://doi.org/10.1093/aje/kwac014.

Sant’Anna, Pedro H. C., and Jun Zhao. 2020. “Doubly Robust Difference-in-Differences Estimators.” Journal of Econometrics 219 (1): 101–22. https://doi.org/10.1016/j.jeconom.2020.06.003.

Stefanski, Leonard A., and Dennis D. Boos. 2002. “The Calculus of m-Estimation.” The American Statistician 56 (1): 29–38. https://doi.org/10.1198/000313002753631330.

Thoemmes, Felix J., and Anthony D. Ong. 2016. “A Primer on Inverse Probability of Treatment Weighting and Marginal Structural Models.” Emerging Adulthood 4 (1): 40–59. https://doi.org/10.1177/2167696815621645.

Williamson, Elizabeth J., Andrew B. Forbes, and Ian R. White. 2014. “Variance Reduction in Randomised Trials by Inverse Probability Weighting Using the Propensity Score.” Statistics in Medicine 33 (5): 721–37. https://doi.org/10.1002/sim.5991.

Xu, Li, Chris Gotwalt, Yili Hong, Caleb B. King, and William Q. Meeker. 2020. “Applications of the Fractional-Random-Weight Bootstrap.” The American Statistician 74 (4): 345–58. https://doi.org/10.1080/00031305.2020.1731599.

R propensity-scores