The 2018 Canadian Stata Conference was held on 27 July at the Morris J. Wosk Centre for Dialogue, but you can view the proceedings and presentation slides below.
Approaches to imputing missing data in complex survey data
Abstract: Complex survey data collected by government agencies are both expensive and valuable. Producing a complete dataset is important, but missing data in complex survey data pose some unique challenges. Commonly used statistical software packages such as Stata, SAS, and SUDAAN each have a procedure to impute the missing data. However, unlike the procedures for describing and analyzing complex survey data, the procedures implemented by these three software programs are fundamentally different. The three approaches will be described, and an example will show the similarities and differences. The recent developments in this area of the Census Bureau will also be discussed.
Calibrating survey weights in Stata
Abstract: Calibration is a method for adjusting the sampling weights and often used to account for nonresponse and underrepresented groups in the population. Another benefit of calibration is smaller variance estimates compared with estimates using unadjusted weights. Stata implements two methods for calibration: the raking-ratio method and the generalized regression method. Stata supports calibration for the estimation of totals, ratios, and regression models. Calibration is also supported by each survey variance-estimation method implemented in Stata. In this presentation, I will show how to use calibration in survey data analysis using Stata.
Causal Inference, Endogeneity, and Data Science
Multiple fractional response with endogenous binary explanatory variables: An application to consumers
Abstract: Contactless credit cards are a payment innovation combining the speed and convenience of paying cash with desirable features of credit card payments, for example, enhanced record keeping and the ability to earn rewards. There have been several attempts to measure the impact that contactless credit card adoption has on consumers' use of cash for making point-of-sale transactions. Fung, Huynh, and Sabetti (2014) use data from the Bank of Canada's 2009 Methods-of-Payment survey to estimate that contactless adoption results in a decline of 10% for the volume share of purchases made with cash. This analysis was undertaken when use and acceptance of contactless payment was still nascent. Chen, Felt, and Huynh (2017), by contrast, find no impact on the cash share. Their work exploited panel-data structure to better control for unobserved heterogeneity across consumers. Part of the difficulty in measuring the impact of contactless adoption on cash usage is the obvious endogeneity issue: it is unclear whether adoption of contactless technology lowers cash usage or whether cash intensive consumers are less likely to adopt contactless, perhaps for other reasons, for example, a preference for anonymity. Huynh, Schmidt-Dengler, and Stix (2014) show that merchant acceptance also plays a crucial role in cash usage, further complicating the causality issue as contactless terminals, while increasing over time, are certainly not ubiquitous. Recent work by Nam (2016) using an approach developed by Woolridge (2014) allows us to address this problem and provide a more robust model of payment choice and contactless adoption. We utilize data from the Bank of Canada's 2013 Methods-of-Payment survey. The survey included a three-day payments diary that tracks respondents' purchases over the course of three days; this allows us to calculate cash, debit, and credit shares. These shares have an obvious dependence — an increase in the cash share will necessarily lead to a decrease in either debit or credit because the shares must add to one. Nam's estimator allows us to model this effect while simultaneously accounting for the endogenous contactless adoption decision, hence providing more reliable estimates of the impact on cash. We implement the estimator in Stata and provide a method for bootstrapping error estimates.
Chen, H., M. H. Felt, and K. P. Huynh. 2017. Retail payment innovations and cash usage: Accounting for attrition using refreshment samples. Journal of the Royal Statistical Society Series A, 180, 503–530.
Fung, B,. K. P. Huynh, and L. Sabetti. 2014. The impact of retail payment innovations on cash usage. Journal of Financial Market Infrastructure 12: 1–29.
Huynh, K. P., P. Schmidt-Dengler, and H. Stix. 2014. The role of card acceptance in the transaction demand for money. Bank of Canada Staff Working Paper 2014-44.
Nam. S. 2016. Multiple fractional response variables with a binary endogenous explanatory variable. Mimeo.
Woolridge, J. M. 2014. Quasi-maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables. Journal of Econometrics 182: 226–234.
Bank of Canada
Bounding a causal effect using relative correlation restrictions
Abstract: Causal inference generally relies on strong assumptions of exogeneity and selection on observables. Applied researchers regularly make these assumptions but are often concerned that their results may be sensitive to small violations of them. This presentation will describe one approach to this problem: inferring the correlation between the treatment and unobservables from the observed correlation between the treatment and the observable/control variables. I will also describe implementation of this method in Stata and some practical considerations in its use.
Simon Fraser University
Learning about selection: An improved correction procedure
Abstract: Machine learning techniques are utilized in this presentation to improve upon the selection correction procedure of Dahl (2002). Dahl's nonparametric method is widely used in the empirical economics literature to control for selection bias; however, it relies on a strong identification assumption. This single index sufficiency assumption (SISA) imposes restrictions on the error terms of the selection equation that are likely violated in many applications. This contribution establishes a modified correction procedure that uses variable selection techniques to relax this assumption. Identification in this alternative procedure relies on a restriction that is data driven and is a relaxation of the SISA. Variable selection is performed by employing the post-double-lasso estimator of Belloni, Chernozhukov, and Hansen (2014). This is implemented in Stata using lassopack, a set of community-contributed commands by Ahrens, Hansen, and Schaffer. I perform a numerical experiment that establishes that this method is preferable to traditional correction procedures in all cases, except where researchers have strong a priori reasons to suspect that the SISA holds. Machine learning methods, combined with the insights of Lee (1983), can therefore be used to control for selection bias, while overcoming the curse of dimensionality, without the imposition of overly strong distributional assumptions.
University of British Columbia
A new Stata command for the Random Forest algorithm
Abstract: Random Forest is a statistical machine-learning algorithm for prediction and classification under supervised learning. Our Stata command randomforest implements this algorithm through a plugin to the WEKA library. randomforest is available for Windows/Mac/Linux. We will review the algorithm and illustrate randomforest with two examples: 1) prediction of the election outcomes for individual constituencies of the 2017 British Election Study data and 2) prediction of household income from the 2016 US Consumer Finance Survey data.
University of Waterloo
Efficient dynamic documents using Stata
Abstract: Stata 15 includes three new commands for producing dynamic documents: dyndoc, putdocx, and putpdf. These commands have generated much interest in the user community; this has led to a large amount of community-contributed software. In this presentation, I'll give some tips about how to use the commands efficiently both with official Stata software and with some of these community-contributed tools.
Exporting cartography data from Stata to GIS systems
Abstract: geotools is a community-contributed set of tools for exporting data from Stata datasets in ubiquitous ShapeFile and GeoJSON formats. These formats are supported by numerous online and offline GIS systems, including ESRI's ArcView/ArcGIS products, Google API, and other GIS and data-visualization systems. The input data may be coming from own data collection, such as with the use of GPS sensors in the growing segment of CAPI data collection software, or it can be a product of geospatial data analysis in Stata. The produced output can be utilized as layers in composite multilayer maps, as interactive maps, etc. geotools does not require online access or other software to produce its output. In the presentation, I will overview the functionality and options of geotools and establish relations with other community-contributed Stata modules related to GIS capabilities/file formats.
The World Bank
Stata for an introductory biostatistics course—Some useful insights
Abstract: I present instructional aids using Stata that I have found useful for an introductory course on biostatistics taught at the University of Toronto. Particularly useful tools include CDF graphs that highlight the fact that treatment effects in logit and other binary response models depend on the variance of the latent underlying continuous variable; animations that show the relationship between hypothesis tests on a parameter value and the corresponding confidence interval; and a slightly generalized form of the power by a simulation Stata program developed by A. H. Feiveson.
University of Toronto
Upgrading business statistics curriculum to meet the needs of knowledge workers
Abstract: Business faculties are often the largest units in universities and colleges in North America. During 2013–14, over 300,000 graduate and undergraduate business degrees were conferred by North American business schools accredited by the Association to Advance Collegiate Schools of Business (AACSB). For the same period, over 1.1 million students were enrolled in graduate and undergraduate programs in business faculties. At the undergraduate level, most business students are required to take at least one, and in most cases two, courses in business statistics and analytics. A quick review of course outlines and the table of contents of popular textbooks in business statistics will reveal that not much has changed over the past few decades in the way statistics are taught to business students. Despite the emergence of big data, advances in computing power, availability of open-source software and open data, business statistics curricula still follow the learning paths established before the successive revolutions in computing. Thus, students are still taught how to conduct a battery of inferential tests, while most tests could be replaced with regression models. Consider that instructors continue to spend one or more lectures introducing t-tests in undergraduate courses, while the same output could be readily obtained from a regression model with a continuous dependent and a categorical explanatory variable. In this presentation, I highlight the need to update the curriculum for courses in business statistics. I make the case to replace inferential tests, for example, t-tests and correlation tests, with regression models and introduce regression-driven inferential statistics sooner in the course than at the very end, which continues to be the case today. I also highlight the need to introduce basic machine-learning algorithms to the curriculum so that one can narrow the gap between the analytic skills desired by businesses and the statistical training imparted to business students.
Inference with clustered data
Abstract: This article introduces clusteff, a new Stata command for checking the severity of cluster heterogeneity in cluster–robust analyses. Cluster heterogeneity can cause a size distortion leading to under-rejection of the null hypothesis. Carter, Schnepel, and Steigerwald (2015) develop the effective number of clusters to reflect a reduction in the degrees of freedom, thereby mirroring the distortion caused by assuming homogenous clusters. clusteff generates the effective number of clusters. We provide a decision tree for cluster–robust analysis, demonstrate the use of clusteff, and recommend methods to minimize the size distortion.
UC Santa Barbara
Fast and wild: Bootstrap inference in Stata using boottest
Abstract: The Stata package boottest implements a wide variety of bootstrap tests, including tests for linear regression models that are robust to one-way or multiway clustering. I explain how these tests work and provide empirical examples. In the one-way case, the program can generate the bootstrap data in two different ways, using the wild bootstrap or the wild cluster bootstrap. In the two-way case, it can do so in four different ways, using the wild bootstrap or three variants of the wild cluster bootstrap. For each method, four different p-values can be calculated to handle all types of one-sided and two-sided tests.
Wishes and grumbles
Abstract: Stata developers present will carefully and cautiously consider wishes and grumbles from Stata users in the audience. Questions, and possibly answers, may concern reports of present bugs and limitations or requests for new features in future releases of the software.
Leslie-Anne Keown (Chair)
University of California–San Francisco
Bank of Canada
University of Waterloo
Calgary Statistical Support
Registration and accommodations
Registration is now closed.
The optional users dinner was at Blue Water Cafe
27 July at 6:00.
Blue Water Cafe
1095 Hamilton Street
Vancouver BC V6B 5T4
Vancouver Marriott Pinnacle Downtown Hotel
1128 West Hastings Street
Vancouver BC V6E 4R5
Morris J. Wosk Centre for Dialogue
Simon Fraser University
Asia Pacific Hall
580 W. Hastings St.
Vancouver, BC V6B 5K3