The 2026 Stata Biostatistics and Epidemiology Virtual Symposium is a meeting of researchers in biostatistics and epidemiology from around the world discussing current theory and applied methods using Stata. The proceedings consist of invited talks by top Stata users in a virtual platform that allows you to experience this one-day event from wherever you are.
All times Central Standard Time
10:00 a.m.
Estimating breast cancer incidence using multiple imputation with chained equations (MICE)
Anna Johansson, Karolinska Institutet
View abstract
Breast cancer is not one disease but many different subtypes. When estimating breast cancer incidence in the population, we use routine registry data. Information on breast cancer subtype is sometimes missing in these registry data, and such missingness is more common in certain patient groups and thus not random. Hence, it is appropriate to use multiple imputation with chained equations (MICE) when estimating subtype-specific breast cancer incidence. I will give examples on how we have applied MICE to Swedish breast cancer data, which choices we made in order to build an imputation model (using mi impute), as well as challenges in combining the imputed estimates using Rubin's rules (using mi estimate).
10:30 a.m.
Regression models for accuracy estimation
Niels Henrik Bruun, Aalborg University Hospital
View abstract
I present regression-based methods for estimating and comparing diagnostic accuracy measures while addressing the STARD 2015 requirements. Key metrics include sensitivity, specificity, AUC, PPV, NPV, and accuracy. True-positive and false-positive rates, independent of prevalence, are estimated using OLS regression with robust variance. The derived measures, PPV, NPV, and accuracy, are computed from prevalence, sensitivity, and specificity using nonlinear formulas. For single-modality analysis, sensitivity and specificity are obtained by regressing test outcomes on the "true" values, such as those obtained from pathology. For multimodality studies on the same subjects, data are stacked with a modality indicator, and mixed-effects models with random intercepts are used to account for correlation. A new confreg command combines regression and nonlinear estimation to estimate accuracy metrics under dependency structures. These methods provide a flexible framework for robust comparisons of diagnostic performance across instruments.
11:00 a.m.
Break
11:15 a.m.
wqsreg - A Stata command for weighted quantile sum regression
Marta Ponzano, Department of Life Sciences, Health and Health Professions, Link Campus University, Department of Health Sciences, University of Genoa
Additional authors:
Stefano Renzetti, Department of Medicine and Surgery, University of Parma
Andrea Bellavia, Department of Environmental Health, Harvard T.H. Chan School of Public Health, TIMI Study Group, Brigham and Women's Hospital, Harvard Medical School
View abstract
Weighted quantile sum (WQS) regression is a statistical method for quantifying the association between a set of possibly correlated predictors and a health outcome, estimating the joint effect of the predictors as well as their individual contributions to the total effect. We present wqsreg, the first Stata command for WQS regression, implemented for continuous, binary, and count outcomes. The execution of the command involves two sequential steps: 1) estimating the weights and constructing the WQS index under specific constraints and 2) modeling its association with the outcome. wqsreg integrates several flexible components of the framework such as bootstrap, training/validation, and repeated holdout procedures; it returns regression estimates as well as graphical displays of the individual weights. wqsreg requires Stata version 11 or higher and is freely available on GitHub. We present an application of the command on exposome data exploring the association between 38 exposures and a continuous outcome while adjusting for a set of covariates. To the best of our knowledge, wqsreg provides the first command to conduct WQS regression in Stata. We anticipate that our contribution will further promote the use of appropriate statistical methods for handling multiple correlated predictors.
11:45 a.m.
Summarizing data from continuous glucose monitors using the cgmstats package
Natalie Daya Malek, Johns Hopkins University
View abstract
The use of wearable CGMs is growing rapidly. The latest generation of CGM systems do not require fingerstick calibration, are minimally invasive, and are frequently used in research studies. CGM sensors are typically worn for up to 2 weeks and record interstitial glucose measurements every minute to every 15 minutes, depending on the sensor used. CGM systems generate hundreds of measurements per day and thousands of measurements in one person over a single wear. There is a need for tools that allow researchers to efficiently organize and summarize the wealth of data on glucose patterns produced by CGM systems. We developed the cgmstats package, which generates CGM summary measures from a variety of CGM systems and allows the user to flexibly define ranges and generate data visualizations. We provide an overview of the cgmstats package and examples of its use. The cgmstats package supports rigorous and reproducible analyses of CGM data.
12:15 p.m.
Lunch
1:15 p.m.
Demographic estimation and projection methods using Stata: Mortality, fertility, and multistate population dynamics
Jerônimo Muniz, Federal Unversity of Minas Gerais
View abstract
Reliable demographic analysis in settings with incomplete or imperfect data requires flexible and transparent estimation and projection tools. This paper presents an integrated suite of Stata-based methods for estimating mortality and fertility and for projecting populations by age, sex, and additional characteristics. First, I revisit intercensal approaches to mortality estimation, including census-based, death distribution, and iterative methods, and introduce tools for constructing single-decrement life tables and estimating age-specific net migration using two population age distributions and intercensal deaths. Second, I describe an enhanced implementation of the own-children method for estimating age-specific fertility rates, providing graphical summaries of recent fertility patterns, weighted subgroup estimates, and a wide range of reproductive indicators derived from biological mother-child links. Third, I present a matrix-based projection framework for forecasting population dynamics under specified schedules of fertility, mortality, and migration, supporting one- and two-sex models as well as multistate classifications such as region, race, or health status. Empirical illustrations draw on census and register data from Vietnam, Brazil, and Sweden, demonstrating applicability across diverse demographic contexts. Together, these methods offer a coherent and extensible toolkit for demographic estimation and projection using standard data sources.
2:00 p.m.
Modeling longitudinal core temperature in a crossover trial of farmworkers in California
Maria Montez Rath, Stanford University
View abstract
Analyzing core body temperature in field settings presents unique challenges, including high-frequency longitudinal measurements, individual physiological variability, and the environmental noise of active work shifts. This presentation discusses a comprehensive workflow in Stata for processing and modeling data from a crossover trial designed to evaluate cooling interventions (bandanas and mitts) among California farmworkers. I detail the steps necessary to move from raw, minute-by-minute sensor data to statistical inference. Key methodological hurdles addressed include (1) high-frequency data cleaning and the use of mipolate for data interpolation; (2) data smoothing using lowess to manage data artifacts; (3) the calculation of area under the curve (AUC) using integ; and (4) the application of mixed-effects REML regression to account for the crossover design, including trial week, carryover effects, and time-invariant physiological covariates (BMI, age, and sex). While the primary focus is on the analytical steps rather than the efficacy of the interventions, I demonstrate how Stata’s margins, contrast, and coefplot packages can be used to visualize complex longitudinal results and their sensitivity to model specifications. This toolkit offers a reproducible framework for researchers handling complex thermal or physiological time-series data in occupational health.
3:00 p.m.
Adjourn