Home  /  Products  /  Stata 9  /  Data management

This page contains only historical information and is not about the current release of Stata. Please see our Stata 18 page for information on the current version of Stata.


Data management in Stata 9

Stata’s data-management features are now documented in a single volume for easy reference. This includes match-merges, file and variable management, and sorting, as well as more advanced features, such as collecting statistics from any command over groups and reshaping datasets from wide to long and vice versa.

New features include the ability to read and write datasets in the format required for FDA NDAs, support for reading and writing XML, support for simultaneous multiple language labels, filtering files, and additional support for ODBC.

Here are the details.

  • There is a new manual [D] Data Management, and the data-management commands have been moved from [R] to [D]. See [D] intro for an expanded what’s new for data-management capabilities.

  • Existing command set type now has a permanently option. You can now permanently set the default datatype to either float (the factory default) or double.

  • New commands xmlsave and xmluse save and restore datasets in Extended Markup Language (XML) format. Data may be saved or used in either Stata dta XML format or Microsoft Excel’s SpreadsheetML format. See [D] xmlsave.

  • New commands fdasave, fdause, and fdadescribe save, use, and describe files in the format required by the U.S. Food and Drug Administration (FDA) for new drug and device applications (NDAs). These commands are designed to assist people making submissions to the FDA, but the commands are general enough for use in transferring data between SAS and Stata. The FDA format is identical to the SAS XPORT Transport format. See [D] fdasave.

  • Value labels may now be up to 32,000 characters long.

  • Existing command label has a new subcommand language that lets you create and use datasets containing different variable, value, and data labels, which might be in different languages. See [D] label language.

  • Datasets from the examples in the Stata manuals can now be browsed, described, and used. Type help dta contents, or select File Example datasets... from the Stata menu.

  • statsby is now a prefix command; see [U] 11.1.10 Prefix commands. For information on its new syntax, see [D] statsby. Enhancements to statsby include

    • Rather than requiring a list of expressions for the statistics to collect, statsby now collects a default set.

    • Expressions to be computed and saved can now be grouped together as equations; see exp list.

    • String variables are now allowed.

    • Weights are now allowed.

    • New option force forces statsby to work with survey estimators. By default, this is prevented because the method statsby uses to select subsamples will generally not produce appropriate standard error estimates with survey data (the subpop option must be used with survey data).

    • Dots showing the progress of computations are now shown by default.

    • New option nolegend suppresses the table reporting on what statsby is running.

  • New command filefilter copies an input file to an output file while converting specified ASCII or binary pattern to another pattern; see [D] filefilter.

  • New command expandcl replicates clusters of unique observations, much like an expand, but for clustered data; see [D] expandcl.

  • New command tostring converts numeric variables to string; see [D] tostring.

  • Existing command codebook now allows if and in qualifiers; see [D] codebook.

  • New command rmdir removes an existing directory (folder); see [D] rmdir.

  • New command clonevar makes an identical copy of an existing variable; see [D] clonevar.

  • Existing commands icd9 and icd9p have been updated to use the V21 codes; see [D] icd9 and [D] icd9p.

  • Existing command encode has new option noextend that prevents adding new value label mappings; see [D] encode.

  • Existing command odbc for accessing Open DataBase Connectivity (ODBC) data sources has the following enhancements:

    • ODBC is now supported under Mac OS X and Linux systems that use the iODBC Driver Manager. For more information on configuring ODBC for Mac and Linux, see the FAQ at http://www.stata.com/support/faqs/data-management/configuring-odbc/.

    • odbc has new subcommands odbc insert and odbc exec for writing data to an ODBC data source. Positioned updates can be performed using the odbc exec command.

    • odbc has a new subcommand sqlfile for batch processing SQL instructions.

    • odbc load has a new option sqlshow for debugging SQL communication with ODBC drivers.

    • odbc load has new options allstring and datestring, which import either all data or just dates as strings.

    See [D] odbc.

  • Existing command merge has the following new features:

    • It now accepts multiple using files.

    • New option nosummary suppresses creating variables that summarize how the records were merged.

    • New option sort option sorts the master and using datasets if they are not already sorted.

    • Existing options unique, uniqmaster, and uniqusing now require you to specify matching variables.

    • Warning messages are now given when matching variables do not uniquely identify observations.

    See [D] merge.

  • Existing commands merge and append now incorporate all notes from the using dataset that do not already appear in the master dataset, unless new option nonotes is specified; see [D] merge and [D] append.

  • Existing command contract has new options cfreq(), percent(), cpercent(), float, and format() to create frequency and percentage variables; see [D] contact.

  • Existing commands corr2data and drawnorm now support triangular specification of the correlation or covariance matrix; see [D] corr2data and [D] drawnorm.

  • Existing command separate has new option shortlabel to specify that shorter variable labels be created; see [D] separate.

  • Existing command outfile has new option missing that preserves both standard and extended missing values when the comma option is also specified; see [D] outfile.

  • Existing command clear now performs mata: mata clear in addition to everything else; see [D] clear.

Functions and expressions

  • The limit for the number of dyadic operators has been increased from 200 to 500; see limits.

  • The default matrix size (matsize) for Intercooled Stata is now 200, rather than 40. The default for Stata/SE remains 400, and for Small Stata, 40.

  • The following new functions have been added in the context of expressions, such as generate newvar = exp or if exp:
    name purpose
    binormal() bivariate normal cumulative
    atan2() two-argument arc tangent
    regexm() regular expression matching
    regexr() regular expression replacement
    regexs() regular subexpressions
    indexnot() first string s1 not in s2
    See [D] functions or type help followed by the function name, such as help binormal().

    In addition, a host of new functions are available through Mata; see [M-4] intro — Index and guide to functions.

  • The following existing functions have been renamed:
    old name new name
    Old names continue to work. Functions were renamed because the new name is better and because Mata uses the new name, and you want to be able to use the same names in both environments.

  • The following existing functions now have two names, and you can use either:
    Name 1 Name 2
    lower() strlower()
    upper() strupper()
    proper() strproper()
    ltrim() strltrim()
    rtrim() strrtrim()
    trim() strtrim()
    reverse() strreverse()
    string() strofreal()
    int() trunc()
    length() strlen()
    In this case, throughout the Stata documentation, we use name 1, but you can use name 1 or name 2 in your Stata expressions. Name 2 matches the name of the Mata function that does the same thing, so you may want to standardize on name 2.

  • The following egen functions have been renamed:
    old name new name
    any() anyvalue()
    eqany() anymatch()
    neqany() anycount()
    rfirst() rowfirst()
    rlast() rowlast()
    rmean() rowmean()
    rmin() rowmin()
    rmiss() rowmiss()
    robs() rownonmiss()
    rsd() rowsd()
    rsum() rowtotal()
    sum() total()
    The new names are more consistent. Old names continue to work but are not documented.