 
									 
     
       July 27–28
The Stata Conference was July 27-28, 2017, but you can view the program and presentation slides (below) and the conference photos.
| 9:00–9:20 | 
              Abstract:
                This presentation introduces two user-written Stata
                commands related to the data and calculations of
                demographic life tables, whose most prominent feature is
                the calculation of life expectancy at birth. The first
                command, hmddata, provides a convenient interface
                to the Human Mortality Database (HMD,
                www.mortality.org), a database widely used for mortality
                data by demographers, health researchers, and social
                scientists. Different subcommands of hmddata
                allow data from this database to be easily loaded,
                transformed, reshaped, tabulated, and graphed. The
                second command, lifetable, produces demographic
                period life tables. The main features are that life
                table columns can be flexibly calculated using any valid
                minimum starting information; abridged tables can be
                generated from complete ones; finally, a Stata dataset
                can hold any number of life tables, and the various
                lifetable subcommands can operate on any subset
                of them.
               Additional information: Baltimore17_Schneider.pdf 
 Daniel C. Schneider Max Planck Institute for Demographic Research | 
| 9:40–10:00 | 
              Abstract:
                There has been extensive research indicating
                gender-based differences among STEM subjects,
                particularly mathematics (Albano and Rodriguez, 2013;
                Lane, Wang, and Magone 1996). Similarly, gender-based
                differential item functioning (DIF) has been researched
                because of the disadvantages females face in STEM
                subjects when compared with their male counterparts.
                Given that, this study will apply the multiple
                indicators multiple causes (MIMIC) model, a type of
                structural equation model, to detect the presence of
                gender-based DIF using the Program for International
                Student Assessment (PISA) mathematics data from students
                in the United States of America and then predict the DIF
                using math-related covariates. This study will build
                upon a previous study that explored the same data using
                the hierarchical generalized linear model and will be
                confirmatory in nature. Based on the results of the
                previous study, it is expected that several items will
                exhibit DIF that disadvantages females and that
                mathematics-based self-efficacy will predict the DIF.
                However, additional covariates will also be explored,
                and the two models will be compared in terms of their
                DIF detection and the subsequent modeling of DIF.
                Implications of these results include females
                underachieving when compared with their male
                counterparts, thus continuing the current trend. These
                gender differences can further manifest at the national
                level, causing U.S. students as a whole to underperform
                at the international level. Last, the efficacy of the
                MIMIC model to detect and predict DIF will be
                illustrated and become increasingly used to model and
                better understand differences and DIF.
               Additional information: Baltimore17_Krost.pptx 
 Kevin Krost Virginia Tech Joshua Cohen Virginia Tech | 
| 10:00–10:20 | 
              Abstract:
                In 2001, I gave a presentation on three-valued logic.
                Since then, I have developed some ideas that grew out of
                that investigation, leading to new insights about
                missing values and to the development of five-valued
                logic. I will also show how these notions extend to
                numeric computation and to an abstract generalization of
                the principles involved. This is not about analysis;
                this is about data construction and preparation, and it
                is a possibly interesting conceptual tool.
               Additional information: Baltimore17_Kantor.pptx multi_valued_logic.docx 
 David Kantor Data for Decisions | 
| 10:40–11:10 | 
              Abstract:
                Parallel lets you run Stata faster, sometimes faster
                than MP itself. By organizing your job in several Stata
                instances, parallel allows you to work with
                out-of-the-box parallel computing. Using the 'parallel'
                prefix, you can get faster simulations, bootstrapping,
                reshaping big data, etc., without having to know a thing
                about parallel computing.  With no need of having
                Stata/MP installed on your computer, parallel has showed
                to dramatically speed up computations up to two, four,
                or more times depending on how many processors your
                computer has.
               Additional information: Baltimore17_Quistorff.pdf 
 Brian Quistorff Microsoft George G. Vega Yon University of Southern California | 
| 11:10–11:40 | 
              Abstract:
                The inclusion of the Java API for Stata provides users,
                and user programmers, with exciting opportunities to
                leverage a wide array of existing work in the context of
                their Stata workflow. This talk will introduce a few
                tools designed to help others wanting to integrate Java
                libraries into their workflow, the Stata Maven
                Archetype, and the StataJavaUtilities library. In
                addition to a higher-level overview, the presentation
                will also show examples of using existing Java libraries
                to expand statistical models in psychometrics and send
                yourself emails when your job is complete, of phonetic
                string encodings and string distances, of accessing
                file/operating system properties, and examples to use as
                starting points for developing Java plugins in Stata.
               Additional information: Baltimore17_Buchanan (http:) 
 Billy Buchanan Fayette County Public Schools | 
| 11:40–12:00 | 
              Abstract:
                In recent years, very large datasets have become
                increasingly prevalent in most social sciences. However,
                some of the most important Stata commands
                (collapse, egen, merge,
                sort, etc.)  rely on algorithms that are not well
                suited for big data. In my talk, I will present the
                ftools package, which contains plugin alternatives to
                these commands and performs up to 20 times faster on
                large datasets [1].  Further, I will explain the
                underlying algorithm and Mata function and show how to
                use this function to create new Stata commands and to
                speed up existing packages. [1]: See benchmarks here:
                
		https://github.com/sergiocorreia/ftools/#benchmarks
               Additional information: Baltimore17_Correia.pdf 
 Sergio Correia Board of Governors of the Federal Reserve System | 
| 1:00–1:30 | 
              Abstract:
                Part of the art of coding is writing as little as
                possible to do as much as possible. The presentation
                expands on this truism.  Examples are given of Stata
                code to yield graphs and tables in which most of the
                real work is happily delegated to workhorse commands. In
                graphics, a key principle is that graph twoway is
                the most general command, even when you do not want
                rectangular axes. Variations on scatter- and line plots
                are precisely that, variations on scatter- and line
                plots. More challenging illustrations include commands
                for circular and triangular graphics, in which x and y
                axes are omitted, with an inevitable but manageable cost
                in re-creating scaffolding, titles, labels, and other
                elements. In tabulations and listings, the better-known
                commands sometimes seem to fall short of what you want.
                However, a few preparation commands (such as
                generate, egen, collapse, or
                contract) followed by list,
                tabdisp, or _tab can get you a long way.
                The examples range in scope from a few lines of
                interactive code to fully developed programs. The
                presentation is thus pitched at all levels of Stata
                users.
               Additional information: Baltimore17_Cox.pptx 
 Nicholas Cox Durham University, United Kingdom | 
| 1:30–2:20 | 
              Abstract:
 		Part of reproducible research is eliminating manual steps such
		as hand-editing documents. Stata 15 introduces several commands
		which facilitate automated document production, including
		dyndoc for converting dynamic Markdown documents to
		web pages, putdocx for creating Word documents, and
		putpdf for creating PDF files.
		These commands allow you to mix formatted text and Stata
		output, and allow you to embed Stata graphs, in-line Stata
		results, and tables containing the output from selected
		Stata commands.
		We will show these commands in action, demonstrating automating
		the production of documents in various formats, and including Stata
		results in those documents.
               Additional information: Baltimore17_Peng (http:) 
 Hua Peng StataCorp | 
| 2:40–3:10 | 
              Abstract:
                We compare a variety of methods for predicting the
                probability of a binary treatment (the propensity
                score), with the goal of comparing otherwise like cases
                in treatment and control conditions for causal inference
                about treatment effects. Better prediction methods can
                under some circumstances improve causal inference by
                reducing both the finite sample bias and variability of
                estimators, but sometimes, better predictions of the
                probability of treatment can increase bias and variance,
                and we clarify the conditions under which different
                methods produce better or worse inference (in terms of
                mean squared error of causal impact estimates).
               Additional information: Baltimore17_Nichols.pdf 
 Austin Nichols Abt Associates Linden McBride Cornell University | 
| 3:10–3:40 | 
              Abstract:
                In this paper, we create an algorithm to predict which
                students are eventually going to drop out of U.S. high
                school using information available in ninth grade. We
                show that using a naive model—as implemented in
                many schools—leads to poor predictions. In
                addition to this, we explain how schools can obtain more
                precise predictions by exploiting the big data available
                to them, as well as more sophisticated quantitative
                techniques. We also compare the performances of
                econometric techniques like logistic regression with
                machine learning tools such as support vector machine,
                boosting and LASSO. We offer practical advice on how to 
		apply machine learning methods using Stata to the 
		high-dimensional datasets available in education. 
                Model parameters are calibrated by taking into account
                policy goals and budget constraints.
               Additional information: Baltimore17_Sansone.pdf 
 Dario Sansone Georgetown University | 
| 4:00–4:20 | 
              Abstract:
                We present a new Stata package for small-area
                estimations of poverty and inequality implementing
                methodologies from Elbers, Lanjouw, and Lanjouw (2003).
                Small-area methods attempt to solve low
                representativeness of surveys within areas or the lack
                of data for specific areas and subpopulations. This is
                accomplished by incorporating information from outside
                sources. A common outside source is census data, which
                often lack detailed information on welfare. Thus far, a
                major limitation toward such analysis in Stata has been
                the memory required to work with census data . The
                povmap package introduces new Mata functions and a
                plugin used to circumvent memory limitations that will
                arise when working with big data.
               Additional information: Baltimore17_Nguyen.pdf 
 Minh Nguyen World Bank Paul Andres Corral Rodas; Joao Pedro Wagner De Azevedo; Qinghua Zhao World Bank | 
| 4:20–4:40 | 
              Abstract:
                We present examples of how to construct interactive maps
                in Stata, using only built-in commands available even in
                secure environments. One can also use built-in commands
                to smooth geographic data as a pre-processing step.
                Smoothing can be done using methods from twoway contour,
                or predictions from a GMM model as described in Drukker
                , Prucha, and Raciborski (2013). The basic approach to
                creating a map in Stata is twoway area, with the options
                nodropbase cmiss(no) yscale(off) xscale(off),
                with a polygon “shape file” dataset (often created
                by the user-written shp2dta by Kevin Crow,
                possible with a change of projection using programs by
                Robert Picard) and multiple calls to area with if
                qualifiers to build a choropleth or scatter to
                superimpose point data. This approach is automated by
                several user-written commands and works well for static
                images but is less effective for web content where a
                Javascript entity is desirable.  However, it is
                straightforward to write out the requisite information
                using the file command and to use open-source map
                tools to create interactive maps for the web. We present
                two useful examples.
               Additional information: Baltimore17_Lauer.pdf 
 Ali Lauer Abt Associates | 
| 4:40–5:00 | 
              Abstract:
                We provide examples of how one can use satellite or
                other remote sensing data in Stata, with a variety of
                analysis methods, including examples of measuring
                economic disadvantage using satellite imagery.
               Additional information: Baltimore17_Nisar.pdf 
 Hiren Nisar Abt Associates | 
| 9:00–9:20 | 
              Abstract:
                We developed an ado-file to easily estimate three
                selected occupational segregation indicators with
                standard errors using a bootstrap procedure.  The
                indicators are the Duncan and Duncan (1955)
                dissimilarity index, the Gini coefficient based on the
                distribution of jobs by gender (see Deutsch et al.
                [1994]) and the Karmel and MacLachlan (1988) index of
                labor market segregation. This routine can be easily
                applied to conventional labor market microdata in which
                information regarding the occupation classification,
                industry, and occupational category variables is usually
                available. As an illustration of the application of this
                ado-file, we present estimates of both occupational and
                industry segregation by gender drawn from household
                surveys' Colombian microdata.  The estimation of
                occupational segregation measures with standard errors
                proves to be useful in assessing statistical differences
                in segregation measures within labor market groups and
                over time.
               Additional information: Baltimore17_Isaza-Castro.pdf 
 Jairo G Isaza-Castro Universidad de la Salle Karen Hernandez; Karen Guerrero; Jessy Hemer Universidad de la Salle | 
| 9:20–9:40 | 
              Abstract:
                Cluster randomized trials (CRTs), where clusters (for
                example, schools or clinics) are randomized but
                measurements are taken on individuals, are commonly used
                to evaluate interventions in public health and social
                science. Because CRTs typically involve only a few
                clusters, simple randomization frequently leads to
                baseline imbalance of cluster characteristics across
                treatment arms, threatening the internal validity of the
                trial. In CRTs with a small number of clusters, classic
                approaches to balancing baseline characteristics—such
                as matching and stratification—have several drawbacks,
                especially when the number of baseline characteristics
                the researcher desires to balance is large (Ivers et al.
                2012). An alternative approach is constrained
                randomization, whereby an allocation scheme is randomly
                selected from a subset of all possible allocation
                schemes based on the value of a balancing criterion
                (Raab and Butcher 2001). Subsequently, an adjusted
                permutation test can be used in the analysis, which
                provides increased efficiency under constrained
                randomization compared with simple randomization (Li et
                al. 2015). We describe constrained randomization and
                permutation tests for the design and analysis of CRTs
                and provide examples to demonstrate the use of our newly
                created Stata package (cvcrand), which uses Mata to
                efficiently process large allocation matrices—to
                implement constrained randomization and permutation
                tests.
               Additional information: Baltimore17_Gallis.pdf 
 John Gallis Duke University Fan Li; Hengshi Yu; Elizabeth L. Turner Duke University | 
| 9:40–10:00 | 
              Abstract:
                Researchers constructing measurement models must decide
                how to proceed when an initial specification fits
                poorly. Common approaches include search algorithms that
                optimize fit and piecemeal changes to the item list or
                the error specification. The former approach may yield a
                good-fitting model that is inconsistent with theory or
                may fail to identify the best-fitting model because of
                local optimization issues.  The latter suffers from poor
                reproducibility and may also fail to identify the
                optimal model. We outline a new approach that defines a
                computationally tractable specification space based on
                theory. We use the example of a hypothesized latent
                variable with 25 candidate indicators divided across 5
                content areas. Using Stata’s tuples command, we
                identify all combinations of indicators containing >=1
                indicator per content area. In our example, this yields
                7,294 models. We estimate each model on a derivation
                dataset and select candidate models with fit statistics
                that are acceptable or could be rendered acceptable by
                permitting correlated errors. Eight models fit these
                criteria. We evaluate modification indices, respecify if
                there is theoretical justification for correlated
                errors, and select a final model based on fit
                statistics. In contrast to other methods, this approach
                is easily replicable and may result in a model that is
                consistent with theory and has acceptable fit.
               Additional information: Baltimore17_Dougherty.pptx 
 Geoff Dougherty Johns Hopkins Bloomberg School of Public Health Dr. Lorraine Dean Johns Hopkins Bloomberg School of Public Health | 
| 10:40–11:10 | 
              Abstract:
                We present response surface coefficients for a large
                range of quantiles of the Elliott, Rothenberg, and Stock
                (Econometrica 1996) DF-GLS unit-root tests for
                different combinations of the number of observations and
                the lag order in the test regressions, where the latter
                can be either specified by the user or endogenously
                determined. The critical values depend on the method
                used to select the number of lags.  The Stata command
                ersur is presented, and its use illustrated with
                an empirical example that tests the validity of the
                expectations hypothesis of the term structure of
                interest rates.
	       Additional information: Baltimore17_Baum.pdf 
 Christopher Baum Boston College and DIW Berlin Jesús Otero Universidad del Rosario, Colombia | 
| 3:10–3:40 | 
              Abstract:
                Estimating the causal effect of a treatment is
                challenging when selection into the treatment is based
                on contemporaneous unobservable characteristics, and the
                outcome of interest is represented by a series of
                correlated binary outcomes. Under these assumptions,
                traditional nonlinear panel-data models, such as the
                random-effects logistic model, will produce biased
                estimates of the treatment effect because of correlation
                between the treatment variable and model unobservables.
                In this presentation, I will introduce a new Stata
                estimation command, etxtlogit, that can estimate
                a model where the outcome is a series of J-correlated
                logistic binary outcomes and selection into the
                treatment is based on contemporaneous unobservable
                characteristics. The presentation will introduce the new
                estimation command, present Monte Carlo evidence, and
                offer empirical examples. Special cases of the model
                will be discussed, including applications based on the
                explanatory (behavioral) Rasch model, a model from item
                response theory (IRT).
               Additional information: Baltimore17_Rabbitt.pdf 
 Matthew P. Rabbitt Economic Research Service, U.S. Department of Agriculture | 
| 11:40–12:00 | 
              Abstract:
                A continuation ratio model represents a variant of an
                ordered regression model that is suited to modeling
                processes that unfold in stages, such as educational
                attainment. The parameters for covariates in
                continuation ratio models may be constrained to be
                equal, subject to a proportionality constraint across
                stages, or freely vary across stages.  Currently, there
                are three user-written Stata commands that fit
                continuation ratio models. Each of these commands fits
                some subset of continuation ratio models involving
                parameter constraints, but none of them offer complete
                coverage of the range of possibilities. In addition, all
                the commands rely on reshaping the data into a
                stage-case format to facilitate estimation. The new
                crreg command expands the options for
                continuation ratio models to include the possibility for
                some or all of the covariates to be constrained to be
                equal, to freely vary, or to have a proportionality
                constraint across stages. The crreg command
                relies on Stata’s ML routines for estimation and
                avoids reshaping the data. The crreg command
                includes options for three different link functions (the
                logit, probit, and cloglog) and supports Stata’s
                survey and multiple imputation suites of commands.
               Additional information: Baltimore17_Bauldry.pdf 
 Shawn Bauldry Purdue University Jun Xu Ball State University Andrew Fullerton Oklahoma State University | 
| 1:00–1:30 | 
              Abstract:
                When I was in graduate school, I was taught that
                multivariate methods were the future of data analysis.
                In that dark computer stone age, multivariate meant
                multivariate analysis of variance (MANOVA), linear
                discriminant function analysis (LDA), canonical
                correlation analysis (CA), and factor analysis (which
                will not be discussed in this presentation). Statistical
                software has evolved considerably since those ancient
                days. MANOVA, LDA, and CA are still around but have been
                eclipsed and pushed aside by newer, sexier
                methodologies. These three methods have been consigned
                to the multivariate dustbin, so to speak.  This
                presentation will review MANOVA, LDA, and CA, discuss
                the connections among the three approaches, and
                highlight the positives and negatives of each approach.
               Additional information: Baltimore17_Ender.pdf 
 Phil Ender UCLA (Ret.) | 
| 1:30–2:20 | 
              Abstract:
		In survival analysis, right-censored data have been studied extensively
		and can be analyzed using Stata's extensive suite of survival commands,
		including streg for fitting parametric survival models. Right-censored
		data are a special case of interval-censored data. Interval-censoring
		occurs when the failure time of interest is not exactly observed but is
		only known to lie within some interval. Left-censoring, which occurs
		when the failure is known to happen some time before the observed time,
		is also a special case of interval-censoring. Survival data may contain
		a mixture of uncensored, right-censored, left-censored, and
		interval-censored observations. In this talk, I will describe basic
		types of interval-censored data and demonstrate how to fit parametric
		survival models to these data using Stata's new stintreg command. I
		will also discuss postestimation features available after this command.
               Additional information: Baltimore17_Yang.pdf 
 Xiao Yang StataCorp | 
| 2:40–3:10 | 
		I use the new extended regression command eoprobit to esitmate
		the effect of an endogenous treatment on an ordinal profit outcome.
               Additional information: Baltimore17_Drukker.pdf 
 David M. Drukker StataCorp | 
| 3:40–4:30 | Wishes and grumbles StataCorp | 
Registration is now closed.
          
          Renaissance Baltimore Harborplace Hotel
          202 East Pratt Street
          Baltimore, MD 21202
        
The conference venue is near several tourist attractions, including the USS Constellation and other vessels in the harbor, the American Visionary Arts Museum, and the National Aquarium.
        Joe Canner (Chair)  
        Department of Surgery 
        Johns Hopkins University
        
      
        John McGready  
        Department of Biostatistics 
        Johns Hopkins University
        
     
        Austin Nichols  
        Abt Associates 
	 
        
      
        Sharon Weinberg  
        Applied Statistics and Psychology 
        New York University