Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: FW: Cleanup of messy variable


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: FW: Cleanup of messy variable
Date   Thu, 19 Oct 2006 18:36:41 +0100

I think you need to clean up at source. 

Some of the problems look fairly clear 
and can be fixed with a -subinstr()- 
function in a -replace-. Some look more
difficult to diagnose. 

For example, "998" as an element looks 
a miscoding for "9 98" and the action would 
then be 

replace myvar = subinstr(myvar, "998", "998", .) 

Once you have cleaned up, some of your 
questions can be answered using -tabsplit-
from -tab_chi- on SSC. 

Others will requiring a different data structure
based on a -split- and then a -reshape-. 

Nick 
n.j.cox@durham.ac.uk 

Honey, Wayne, DOH
 
> We have a data set with a poorly designed string variable of 
> the form str%22s.  This variable allowed for multiple 
> responses to be coded in the following manner: 
> 
> 01.  Cards (21, Black Jack, Poker, etc.)
> 02.  Animals (Roosters, dogs, horses, frogs, ducks)
> 03.  Sports (football, baseball, pool, golf)(incl. pools, 
> w/friends or bookie)
> 04.  Dice games of any type (Craps, etc.)
> 05.  Lottery or numbers (Quick Pick, Road Runner, scratch cards, etc.)
> 06.  Bingo
> 07.  Raffles or sweepstakes
> 08.  Slot machines, video machines or other gambling machines
> 09.  Pull Tabs, punch cards 
> 10.  Internet Gambling
> 11.  Other, please specify: ______________________________  
> SAM (575-594)
> 
> 88.  Never Gamble  GO TO NEXT MODULE
> 98.  No other
> 77.  Don't Know/Not Sure
> 99.  Refused  GO TO NEXT MODULE
> 
> The respondent was free to respond in any way they chose and 
> the interviewers were trained to select from among 15 
> possible response codes.  Codes 01 through 10 were assigned 
> to particular forms of gambling.  Code 11 was used to 
> identify types of gambling that couldn't be coded according 
> to the 10 identified responses.  
> Codes 77, 88, and 99 are self-explanatory.  If the respondent 
> reported one or more types of gambling, the interviewer coded 
> as many forms as were relevant, then entered 98 to indicate 
> that no additional types of gambling were reported.  
> 
> Consequently, we have a variable with a wide variety of 
> responses (see frequency table, below, showing the first and 
> last few rows).
> 
> 	1 2 3 4 5 7 8 998     |          1        0.03        7.19
> 	1 2 3 4 5 898         |          1        0.03        7.22
> 	1 2 3 51098           |          1        0.03        7.25
> 	1 2 4 5 7 898         |          1        0.03        7.28
> 	1 2 498               |          1        0.03        7.31
> 	1 2 81098             |          1        0.03        7.34
> 	1 2 898               |          1        0.03        7.37
> 	1 298                 |          7        0.21        7.58
> 	1 3 898               |          1        0.03        7.61
> 	1 398                 |          3        0.09        7.70
> 	1 4 5 898             |          1        0.03        7.73
> 	1 4 598               |          2        0.06        7.79
> 	1 4 8 9 5 798         |          1        0.03        7.82
> 	1 4 898               |          1        0.03        7.85
> 	1 498                 |          3        0.09        7.94
> 	1 5 2 798             |          1        0.03        7.97
> 	50 85998              |          1        0.03       40.16
> 	5898                  |          1        0.03       40.19
> 	77                    |          1        0.03       40.22
> 	                   88 |          1        0.03       40.25
> 	88                    |      1,974       59.39       99.64
> 	89 898                |          1        0.03       99.67
> 	99                    |         11        0.33      100.00
> 
> 
> Ultimately, we would like to summarize the results in a few 
> simple ways:
> 1. Proportion of adults participating in gambling of any form
> 2. Proportion of adults participating in Internet gambling 
> (as a new form that should be monitored)
> 3. Most common form of gambling
> 4. 3 most common forms of gambling
> 
> Clearly, the structure of the variable does not lend itself 
> to efficient use.  Note that, in addition to the problem of 
> multiple responses stored in a single variable, spacing does 
> not appear to be consistent and some records even have a 
> right justification while most appear to be left justified 
> within the 22 columns.  I don't know if this justification is 
> real or only apparent.
> 
> Any advice on how to work with this variable using Stata 9.2 
> (generate other variables summarizing responses, etc.) would 
> be greatly appreciated.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index