Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Confirming whether a variable is binary or continuous

 From Nick Cox <[email protected]> To [email protected] Subject Re: st: Confirming whether a variable is binary or continuous Date Mon, 19 Mar 2012 10:11:45 +0000

```I agree with Cameron to this extent: There isn't a precise soluble
problem here without a precise definition of binary variable and,
implicitly or explicitly, of continuous variable.

Bert did flag an interest in 0s and 1s, but other people might have
question.

I don't get the impression that Bert wants to do statistical analysis
on his data to investigate what measurement scale it is, or might be.
I get the impression he wants to do data management.

I don't want to get bogged down in terminology here, but binary
variables have also been called dichotomous, indicator, dummy,
quantal, Boolean and logical and no doubt other names too. (If you
know of other names used in English, I'd like to add them to my
collection.)

Tying together various earlier comments, and making some others:

String included or numeric only? A string variable with values "male"
and "female" is binary in many people's eyes and can easily be mapped
to a numeric binary variable.

Two distinct values? One criterion is weak, that just two distinct
values occur. That would mean 24 and 42, but that way the obvious
comment is that it is difficult to tell apart variables that just
happen to have two distinct values but in principle could have many
more.

Zero or one only? One strict definition is that the values must be
restricted to 0 or 1.

What if there is only one distinct value in practice? That could be a
problem: various analyses won't be possible. but you have to decide
what to do.

Missing values? Except that missing values may occur. No data
management program in Stata is serious unless it copes intelligently
with missing data when they occur.

It's implicit in Bert's postings that for his purposes, if a variable
isn't binary, then it's continuous. I'll comment only that many
researchers use much more elaborate taxonomies and terminologies.

Here is a sketch of one possible program:

program isbinary, rclass
version 10.1
syntax [varlist]
qui ds `varlist', has(type numeric)
local varlist `r(varlist)'

foreach v of local varlist {
capture assert missing(`v') | inlist(`v', 0, 1)
if _rc == 0 local binary `binary' `v'
}

if "`binary'" != ""  describe `binary', simple
return local varlist "`binary'"
end

Notes.

1. No -varlist- need be specified, but either way string variables are ignored.

2. This example program encapsulates one choice (binary variables are
numeric, and may be 0, 1 or missing (and nothing else)). Other choices
are, as emphasised, clearly possible.

3. To be useful (most) data management programs leave saved results too.

4. Continuous variables are just the complement on this definition.
See -ds- or -findname- (SJ, SSC) for tools here.

-distinct- (SSC, SJ) before this thread got under way. A version with
those options is in press from the Stata Journal. I'll send a copy to
Kit Baum for SSC. With this version of -distinct-

distinct, max(2)
distinct, min(2) max(2)

would be other answers to this question. Note: as just mentioned, you
can't do this yet with any -distinct- that you have unless you are one
of the program authors.

I can't see that there is a Stata way to tell apart a variable which
is just 0 or 1 in practice from one which can only be 0 or 1 in
principle. (Again, missing values aside.)  It's a subject-matter
decision.

Nick

On Mon, Mar 19, 2012 at 3:02 AM, Cameron McIntosh <[email protected]> wrote:
> I think that the only way to decide how to proceed is to first approach this issue conceptually (i.e, think about it) -- based on your content area expertise, is the covariate in question truly binary (qualitative) or do the observed cate gories merely discretize a latent continuous process? If the former, you can use observed categorical variable methodology to examine the covariate distributions by treatment group. (e.g., chi-square tests of independence and related methods for contingency tables); if the latter, then you may be into tetrachorics and the like.
> MacDonald, P.L., & Gardner, R.C. (2000). Type I Error Rate Comparisons of Post Hoc Procedures for I j Chi-Square Tables. Educational and Psychological Measurement, 60(5), 735-754.
>
> Bentler, P.M. (2011). Can Interval-level Scores be Obtained from Binary Responses? UCLA Preprint #622.http://preprints.stat.ucla.edu/622/Bentler%20Interval%20Scores%20from%20Binary%20Responses.pdf
>
> Ulrich, R., & Wirtz, M. (2004). On the correlation of a naturally and an artificially dichotomized variable. British Journal of Mathematical and Statistical Psychology, 57(2), 235–251.
>
> Ledesma, R.D., Macbeth, G., & Valero-Mora, P. (2011). Software for Computing the Tetrachoric Correlation Coefficient. Revista Latinoamericana de Psicología, 43(1), 181-189. http://openjournal.konradlorenz.edu.co/index.php/rlpsi/article/viewFile/459/463
>
> Greer, T., Dunlap, W.P., & Beatty, G.O. (2003). A Monte Carlo Evaluation of the Tetrachoric Correlation Coefficient. Educational and Psychological Measurement, 63(6), 931-950.
>
> Bonett, D.G., & Price, R.M. (2005). Inferential Methods for the Tetrachoric Correlation Coefficient. Journal of Educational and Behavioral Statistics, 30(2), 213-225.
>
> Long, M.A., Berry, K.J., & Milke, P.W., Jr. (2009). Tetrachoric Correlation: A Permutation Alternative. Educational and Psychological Measurement, 69(3), 429-437.
>
> Genest, C., & Lévesque, J.-M. (2009). Estimating correlation from dichotomized normal variables. Journal of Statistical Planning and Inference, 139(11), 3785-3794.
>
> Choi, J., Peters, M., & Mueller, R.O. (2010). Correlational analysis of ordinal data: from Pearson’s r to Bayesian polychoric correlation. Asia Pacific Education Review, 11(4), 459-466.
>
> Cam
>
>> Date: Mon, 19 Mar 2012 00:28:27 +0000
>> Subject: Re: st: Confirming whether a variable is binary or continuous
>> From: [email protected]
>> To: [email protected]
>>
>> Your program just echoes its own input, confirming that what you
>> specify is a binary variable is indeed binary and what you specify is
>> a continuous variable is indeed continuous. It does no checking
>> whatsoever.
>>
>> I am puzzled about why you think that is useful and indeed in what
>> sense it is a solution to your original problem.
>>
>> Nick
>>
>> On Sun, Mar 18, 2012 at 5:07 PM, Bert Jung <[email protected]> wrote:
>>
>> > Thanks all for these helpful insights.  I wanted to share my solution
>> > which, if clumsy, works for me.  The basic idea is to check whether a
>> > particular variable is part of the continuous or binary varlist and
>> > then proceed as appropriate.
>> >
>> > This approach keeps intact the order specified in varlist.  I am
>> > collecting estimation output and wanted the order to remain as
>> > specified by the user.
>> >
>> > This is just a minimum working example, obviously various checks and
>> > balances are of order.
>> >
>> > Cheers Bert
>> >
>> >
>> >
>> > cap program drop varcheck
>> > program varcheck, nclass
>> >
>> >        syntax varlist, contvars(varlist) binaryvars(varlist)
>> >
>> >        * Loop over all variables in varlist; this approach keeps the order
>> > in -varlist- intact
>> >        foreach v of local varlist {
>> >
>> >                * (a) Is variable part of the variables specified in "contvars"?
>> >                local contvar: list v in contvars
>> >
>> >                if `contvar'==1 {
>> >                        di "`v' is specified as continuous variable"
>> >                }
>> >
>> >
>> >                * (b) Is variable part of the variables specified in "binaryvars"?
>> >                local propvar: list v in binaryvars
>> >
>> >                if `propvar'==1 {
>> >                        di "`v' specified as binary variable"
>> >                }
>> >        }
>> >
>> > end
>> >
>> >
>> > sysuse auto, clear
>> >
>> > varcheck mpg price foreign weight, contvars(mpg price weight)
>> > binaryvars(foreign)
>> >
>> >
>> >
>> >

>> >> On 03/16/12, Bert Jung  <[email protected]> wrote:
>>
>> >>> I am writing a short program to make a balance table that compares
>> >>> covariates across a treatment and control group.  I am looking for a
>> >>> way to confirm whether a variable is binary in order to use -prtest-
>> >>> for proportions rather than -ttest- for continous variables.
>> >>>
>> >>> One option is to check the actual data values and do -prtest- if there
>> >>> are only 0's and 1's.  But a continuous but rare outcome could
>> >>> accidentally also take these values, e.g. the number of
>> >>> hospitalizations in the past 3 months.
>> >>>
>> >>> Is there a safer way to confirm that a variable is binary?
>> >>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```