Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -label define- and -replace- when a variable may be missing

From	Michael McCulloch <[email protected]>
To	[email protected]
Subject	Re: st: -label define- and -replace- when a variable may be missing
Date	Sun, 9 Mar 2014 20:50:40 -0700

Thanks Joseph and Phil, for the elegant suggestions.


Best wishes,
Michael McCulloch

--
Pine Street Foundation, since 1989
124 Pine Street | San Anselmo | California | 94960-2674  
P: (415) 407-1357 | F: (206) 338-2391 | http://www.PineStreetFoundation.org

On Mar 9, 2014, at 6:44 PM, Joseph Coveney wrote:

> Squaring the dataset to a standardized form can also be done by -append-ing to a
> standardized template empty dataset (see below); this would avoid the use of
> -capture- in production code if that's a concern.  Regardless, if I were the OP,
> I would want to understand why the datasets arrive in an inconsistent format.
> And I would worry whether the dataset supplier could accidentally (or otherwise)
> flag more than one type of training as positive, because then the collection of
> training types into a single variable with nine value labels (one for each
> possibility) would fail.  Perhaps it's time for the OP to wade upstream closer
> to the source in order to take a look at how the data are recorded and
> processed.
> 
> Joseph Coveney
> 
> . clear *
> 
> . set more off
> 
> . set seed `=date("2014-03-10", "YMD")'
> 
> . 
> . *
> . * Incoming dataset
> . *
> . quietly set obs 20
> 
> . foreach i in 1 3 5 9 {
>  2.     generate byte what_types_of_training_did___`i' = 0
>  3. }
> 
> . replace what_types_of_training_did___1 = 1
> (20 real changes made)
> 
> . generate byte pid = _n
> 
> . generate str1 sex = cond(runiform() < 0.5, "F", "M")
> 
> . generate int age = floor(20 + 20 * runiform())
> 
> . set linesize 79
> 
> . describe, fullnames
> 
> Contains data
>  obs:            20                          
> vars:             7                          
> size:           160                          
> -------------------------------------------------------------------------------
>              storage  display     value
> variable name   type   format      label      variable label
> -------------------------------------------------------------------------------
> what_types_of_training_did___1
>                byte   %8.0g                  
> what_types_of_training_did___3
>                byte   %8.0g                  
> what_types_of_training_did___5
>                byte   %8.0g                  
> what_types_of_training_did___9
>                byte   %8.0g                  
> pid             byte   %8.0g                  
> sex             str1   %9s                    
> age             int    %8.0g                  
> -------------------------------------------------------------------------------
> Sorted by:  
>     Note:  dataset has changed since last saved
> 
> . tempfile incoming
> 
> . quietly save `incoming'
> 
> . 
> . *
> . * Standardized template empty dataset
> . *
> . drop _all
> 
> . forvalues i = 1/9 {
>  2.     quietly generate byte what_types_of_training_did___`i' = .
>  3. }
> 
> . 
> . *
> . * Squaring incoming dataset(s) by appending to template
> . *
> . append using `incoming'
> 
> . 
> . describe, fullnames
> 
> Contains data
>  obs:            20                          
> vars:            12                          
> size:           260                          
> -------------------------------------------------------------------------------
>              storage  display     value
> variable name   type   format      label      variable label
> -------------------------------------------------------------------------------
> what_types_of_training_did___1
>                byte   %8.0g                  
> what_types_of_training_did___2
>                byte   %8.0g                  
> what_types_of_training_did___3
>                byte   %8.0g                  
> what_types_of_training_did___4
>                byte   %8.0g                  
> what_types_of_training_did___5
>                byte   %8.0g                  
> what_types_of_training_did___6
>                byte   %8.0g                  
> what_types_of_training_did___7
>                byte   %8.0g                  
> what_types_of_training_did___8
>                byte   %8.0g                  
> what_types_of_training_did___9
>                byte   %8.0g                  
> pid             byte   %8.0g                  
> sex             str1   %9s                    
> age             int    %8.0g                  
> -------------------------------------------------------------------------------
> Sorted by:  
>     Note:  dataset has changed since last saved
> 
> . list pid-age *1 in -2/l, abbreviate(30) noobs
> 
>  +--------------------------------------------------+
>  | pid   sex   age   what_types_of_training_did___1 |
>  |--------------------------------------------------|
>  |  19     M    38                                1 |
>  |  20     F    39                                1 |
>  +--------------------------------------------------+
> 
> . 
> . exit
> 
> end of do-file
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Phil Schumm
> Sent: Monday, March 10, 2014 05:12
> To: Statalist Statalist
> Subject: Re: st: -label define- and -replace- when a variable may be missing
> 
> [OP redacted for brevity]
> 
> 
> You can do this two ways: (1) write the code to do the desired translation
> (i.e., from 9 vars into 1) in a way that can accommodate fewer than 9 input
> variables, or (2) fill in any missing variables first, and then perform the
> translation.  I tend to prefer the latter, which results in a workflow like
> 
>               standardized        transformed
>    raw  --->    dataset     --->    dataset
>          A                   B
> 
> where A is a series of steps which yield a dataset in "standard" form, and B
> includes whatever transformations of the data are necessary prior to analysis,
> distribution, or whatever.  Thus, in the example above, A might include
> something like
> 
>    forv i = 1/9 {
>        cap gen byte what_types_of_training_did___`i' = 0
>    }
> 
> which you might then follow with a few tests, such as ensuring that the 9 items
> are truly mutually exclusive (as required if you want to collapse them into a
> single variable).  Note that this would even handle the case where none of the
> variables exists (e.g., if none of the first batch of respondents provided an
> answer to the question).
> 
> Separation of A and B (typically in different set(s) of do-files) in the data
> management context has two important advantages:
> 
> 1) It allows you to write simpler code in B, which makes it more readable,
> maintainable and cuts down on errors, and
> 
> 2) It makes it easier to reuse the code in B in different contexts (as long as
> you pass it a dataset in standard form, which is where automated testing comes
> in handy).
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: -label define- and -replace- when a variable may be missing
  - From: Michael McCulloch <[email protected]>
- Re: st: -label define- and -replace- when a variable may be missing
  - From: Phil Schumm <[email protected]>
- Re: st: -label define- and -replace- when a variable may be missing
  - From: "Joseph Coveney" <[email protected]>

Prev by Date: st: re-sorting display order after -encode-
Next by Date: st: MLE problem could not converge
Previous by thread: Re: st: -label define- and -replace- when a variable may be missing
Next by thread: st: How can I combine several events or variables into one?
Index(es):
- Date
- Thread