
The 19th Italian Stata Conference will take place on 25 September 2025 in Milan. There will also be an optional workshop on 26 September.
Meet researchers from different disciplinary areas, discover new applications highlighting Stata’s potential capabilities for applied research, exchange new community-contributed commands developed for Stata, and interact directly with statisticians from StataCorp.
All times are CEST (UTC +2)
8:30–9:00 | Registration |
9:00–10:25 | Session I: Exploiting the potential of Stata 19, ILinking frames in Stata Abstract:
This presentation gives an overview of data frames in Stata.
I demonstrate the basics of working with multiple datasets in
Stata. I cover most of the frames suite of commands, touching
on frame creation and management, linking frames, copying
variables from linked frames, alias variables, and working with
a set of frames.
Jeff Pitblado
StataCorp
The new cate command: An overview Abstract:
This presentation offers a concise overview of the cate command,
a new tool introduced in Stata 19 for estimating conditional
average treatment effects (CATEs). CATEs quantify how the
impact of a treatment varies across individuals or subgroups
defined by observed characteristics, thus enabling a more nuanced
understanding of treatment-effect heterogeneity and supporting
the design of targeted policy interventions.
Giovanni Cerulli
IRCrES-CNR
|
10:25–10:45 | Break |
10:45–12:15 | Session II: Community-contributed commands, Ixtbreak: Testing and estimating structural breaks in time-series and panel data in Stata Abstract:
Identifying structural change is a crucial step in analysis of time series
and panel data. The longer the time span, the higher the likelihood
that the model parameters have changed as a result of major
disruptive events, such as the 2007–2008 financial crisis and the 2020
COVID-19 outbreak. Detecting the existence of breaks and dating
them is therefore necessary not only for estimation purposes but also
for understanding drivers of change and their effect on relationships.
This talk introduces a new community-contributed command called
xtbreak, which provides researchers with a complete toolbox for
analyzing multiple structural breaks in time-series and panel data.
xtbreak can detect the existence of breaks, determine their number
and location, and provide break date confidence intervals. A special
emphasis of the talk will be put on Python integration to gain
speed advantages.
Jan Ditzen
Libera Università di Bolzano
Variance components in panel data Abstract:
A preliminary and crucial step in any empirical research on panel
data, whether longitudinal, time-series cross-section, or multilevel, is
to study the nature and relevance of the components that influence
the variability of the variables, particularly the dependent variable.
Each panel dataset can be considered as a set of grouped data,
whether these are temporal observations nested within individuals or
individuals nested within groups and supergroups. The fundamental
steps for guiding the modeling strategies to be adopted are as follows: breaking
down the total variability into variances between and within clusters,
also in terms of percentage shares; assessing whether there are
relevant common factors within clusters and, in the case of temporal
observations, whether these are stationary or not; and comparing the
relevance and significance of group and individual effects depending
on whether they are considered fixed or random.
Maria Elena Bontempi
Università di Bologna
fffuroot: Implementing in Stata unit-root and stationarity tests with smooth breaks approximated by flexible Fourier forms Abstract:
This work describes the Stata implementation of unit-root and
stationarity tests with flexible Fourier forms as in Enders and Lee
(2012a), (2012b) and Becker, Enders, and Lee (2006).
Giovanni Bruno
Università Bocconi
|
12:15–1:00 | Session III: Stata tips and tricksxtplot2 Abstract:
The xtplot2 command investigates the structure of panel datasets
with respect to unbalancedness and values using heat plots. It
allows the researcher a quick and efficient way to gain insights
into the structure.
Jan Ditzen
Libera Università di Bolzano
Automating episode splitting: Introducing the splitting command for Stata Abstract:
Event history analysis (also known as survival analysis) is a well-established
analytical tool in the social sciences and research
more broadly, and it is particularly useful when researchers aim
to estimate the effect of time-varying variables. Survival analysis
is well supported in Stata via numerous built-in commands. In
particular, stsplit facilitates breaking the time axis into episodes
to include time-varying covariates in the analysis. While
stsplit is straightforward to use when the time axis must be split
at the point a change occurs in a dichotomous variable, the
procedure becomes less intuitive when dealing with polytomous
variables.
Davide Bussi
Università degli Studi di Milano-Bicocca
xtgetpca Abstract:
Extracting principal components in panel
data is common. However no Stata solution exists. xtgetpca fills
this gap. It allows for different types of standardization, removal of
fixed effects, and unbalanced panels.
Jan Ditzen
Libera Università di Bolzano
|
1:00–2:00 | Lunch |
2:00–3:20 | Session IV: Exploiting the potential of Stata 19, IIMeta-analysis in Stata Abstract:
Many studies attempt to answer similar research questions. For
instance, you may have results from studies asking, “What is the
association between unemployment and mental health?” Or you
may have results from studies asking, “How does motherhood
affect women’s wages?” The results from different studies may be
inconclusive or conflicting. Meta-analysis is a statistical technique
for combining the results from several similar studies. It allows
us to explore the variation across studies and, when appropriate,
provide a single estimate for the effect size of interest. In this
presentation, I show how to use the meta suite of commands to
perform meta-analysis in Stata.
Gabriela Ortiz
StataCorp
Consensus clustering in Stata Abstract:
This work considers consensus clustering in Stata, combining
bootstrapped k-means with hierarchical clustering based on a coassociation
matrix. The method addresses the possible
inherent instability of partitioning-based clustering by aggregating
results from multiple bootstrap samples, improving robustness
and reproducibility. In this respect, at each iteration, k-means
clustering is applied, and the results are collected in a large-scale
cluster assignment matrix. A consensus matrix is then created
to measure the cooccurrence of observations within the same
cluster across all iterations. This matrix is transformed into a
dissimilarity structure and in this way subjected to hierarchical
clustering in order to obtain a final, stable partition.
This framework shows how consensus clustering can be performed robustly and efficiently in Stata. It uses a combination of Stata routines, bootstrap sampling, and optimized Mata routines to compute the co-association matrix, ensuring computational efficiency. The approach is broadly applicable to clustering tasks in the social sciences, economics, epidemiology, and other fields where cluster stability is critical.
Carlo Drago
Università degli Studi Niccolò Cusano
|
3:20–3:35 | Break |
3:35–4:35 | Session V: Community-contributed commands, IIoutdetect: Outlier detection for inequality and poverty analysis Abstract:
Extreme values are common in survey data and represent a
recurring threat to the reliability of both poverty and inequality
estimates. The adoption of a consistent criterion for outlier
detection is useful in many practical applications, particularly
when international and intertemporal comparisons are involved. In
this talk, I discuss a simple univariate detection procedure to
flag outliers. I present outdetect, a command that implements
the procedure and provides useful diagnostic tools. The output
of outdetect compares statistics obtained before and after the
exclusion of outliers, with a focus on inequality and poverty
measures. Finally, I carry out an extensive sensitivity exercise
where the same outlier detection method is applied consistently
to per capita expenditure across more than 30 household budget
surveys. The results are clear and provide a sense of the influence
of extreme values on poverty and inequality estimates.
Giulia Mancini
Università degli Studi di Sassari
rdlasso: A Stata command for high-dimensional regression discontinuity designs Abstract:
The rdlasso command implements regression discontinuity
designs (RDD) with high-dimensional covariates in Stata.
The procedure is based on the methodology developed by
Kreiss and Rothe (2023), and extends it to both sharp and
fuzzy designs. Covariate selection is performed through a
lasso-based local estimation, ensuring valid inference under
approximate sparsity.
The command is built using Stata’s Python integration via the SFI module and automates all steps of the estimation process—from covariate selection to bandwidth choice and bias-corrected treatment-effect estimation. The syntax allows for flexible user control while remaining fully embedded in the Stata environment. rdlasso enables Stata users to apply machine learning techniques for causal inference without requiring programming in external platforms such as R or Python. The command generates output variables that can be used for further postestimation analysis within the same session. An option automatically distinguishes between sharp and fuzzy designs, making the tool both user-friendly and methodologically complete. The implementation is illustrated through a step-by-step example and an empirical application. The command contributes to the growing set of tools for modern causal analysis in Stata, particularly in high-dimensional settings.
Marianna Nitt
Sapienza – Università di Roma
|
4:35–5:40 | Session VI: Exploiting the potential of Stata 19, IIIAutomated data extraction from unstructured text using LLMs: A scalable workflow for Stata users Abstract:
In several data-rich domains such as finance, medicine, law,
and scientific publishing, most of the valuable information is
embedded in unstructured textual formats, from clinical notes
and legal briefs to financial statements and research papers.
These sources are rarely available in structured formats suitable
for immediate quantitative analysis. This presentation introduces
a scalable and fully integrated workflow that employs large
language models (LLMs), specifically ChatGPT 4.0 via API, in
conjunction with Python and Stata to extract structured variables
from unstructured documents and make them ready for further
statistical processing in Stata.
As a representative use case, I demonstrate the extraction of information from a SOAP clinical note, treated as a typical example of unstructured medical documentation. The process begins with a single PDF and extends to an automated pipeline capable of batch-processing multiple documents, highlighting the scalability of this approach. The workflow involves PDF parsing and text preprocessing using Python, followed by prompt engineering designed to optimize the performance of the LLM. In particular, the temperature parameter is tuned to a low value (for example, 0.0–0.3) to promote deterministic and concise extraction, minimizing variation across similar documents and ensuring consistency in output structure. Once the LLM returns structured data, typically in JSON or CSV format, it is seamlessly imported into Stata using custom .do scripts that handle parsing (insheet), transformation (split, reshape), and data cleaning. The final dataset is used for exploratory or inferential analysis, with visualization and summary statistics executed entirely within Stata. The presentation also addresses critical considerations including the computationala cost of using commercial LLM APIs (token-based billing), privacy and compliance risks when processing sensitive data (such as patient records), and the potential for bias or hallucination inherent to generative models. To assess the reliability of the extraction process, I report evaluation metrics such as cosine similarity (for text alignment and summarization accuracy) and F1-score (for evaluating named entity and numerical field extraction). By bridging the capabilities of LLMs with Stata’s powerful analysis tools, this workflow equips researchers and analysts with an accessible method to unlock structured insights from complex unstructured sources, extending the reach of empirical research into previously inaccessible text-heavy datasets.
Loreta Isaraj
IRCrES-CNR
Text mining and hierarchical clustering in Stata: An applied approach for real-time policy monitoring, forecasting, and literature mapping. Abstract:
This presentation shows an applied framework for text mining and
clustering in the Stata environment and provides practical tools
for policy-relevant research in economics and health economics.
With the growing amount of unstructured textual data—from
financial news and analyst reports to scientific publications—
there is an increasing demand for scalable methods to classify
and interpret such information for evidence-based policy and
forecasting.
A first relevant concept is the Stata capacity to be integrated with Python with aim to implement hierarchical clustering from scratch using TF-IDF vectorization and cosine distance. This technique is specifically applied to economic text sources—such as headlines or institutional communications—with the aim to segment documents into a fixed or silhouette-optimized number of clusters. This approach allows researchers to identify patterns on data, uncover latent themes, and organize information for macroeconomic forecasting, sentiment analysis, or real-time policy monitoring. In the second part, I focus on literature mapping in health economics. Using a curated corpus of article titles related to telemedicine and diabetes, I apply a native Stata pipeline based on text normalization and clustering to identify thematic areas within the literature. The approach promotes organized reviews in health technology assessment and policy evaluation and makes evidence synthesis more accessible. By combining native Stata capabilities with Python-enhanced workflows, I provide applied researchers with an accessible and policy-relevant toolkit for unsupervised text classification in multiple domains.
Carlo Drago
Università degli Studi Niccolò Cusano
|
5:40–6:00 | Open panel discussion with Stata developers
Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users.
|
8:00 | Conference social dinner (Optional) |
Workshop information forthcoming
Conference fees include breaks, lunch, and course materials.
Conference fees (VAT not incl.) |
Student | Other |
---|---|---|
Conference only | €70 | €110 |
Conference + workshop | €262 | €420 |
Registration deadline is 15 September 2025.
Visit the official conference page for more information.
TStat is delighted to sponsor, via our project “Investing in Young Researchers”, two (2) full-time PhD students from any of the countries for which TStat is the official Stata distributor. Sponsorship covers both the first day of the conference and the workshop . Travel expenses are to be paid for the participant. To apply for sponsorship, please send your curriculum vitae to [email protected].
The logistics organizer for the 2025 Italian Stata Conference is TStat S.r.l., the distributor of Stata for Italy, Albania, Bosnia and Herzegovina, Greece, Kosovo, North Macedonia, Malta, Montenegro, Serbia, Slovakia, and Slovenia.
View the proceedings of previous Stata Conferences and Users Group meetings.