Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: FW: Cleanup of messy variable


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: FW: Cleanup of messy variable
Date   Thu, 19 Oct 2006 18:41:10 +0100

replace myvar = subinstr(myvar, "998", "9 98", .)

Nick 
n.j.cox@durham.ac.uk 

> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu]On Behalf Of Nick Cox
> Sent: 19 October 2006 18:37
> To: statalist@hsphsun2.harvard.edu
> Subject: st: RE: FW: Cleanup of messy variable
> 
> 
> I think you need to clean up at source. 
> 
> Some of the problems look fairly clear 
> and can be fixed with a -subinstr()- 
> function in a -replace-. Some look more
> difficult to diagnose. 
> 
> For example, "998" as an element looks 
> a miscoding for "9 98" and the action would 
> then be 
> 
> replace myvar = subinstr(myvar, "998", "998", .) 
> 
> Once you have cleaned up, some of your 
> questions can be answered using -tabsplit-
> from -tab_chi- on SSC. 
> 
> Others will requiring a different data structure
> based on a -split- and then a -reshape-. 
> 
> Nick 
> n.j.cox@durham.ac.uk 
> 
> Honey, Wayne, DOH
>  
> > We have a data set with a poorly designed string variable of 
> > the form str%22s.  This variable allowed for multiple 
> > responses to be coded in the following manner: 
> > 
> > 01.  Cards (21, Black Jack, Poker, etc.)
> > 02.  Animals (Roosters, dogs, horses, frogs, ducks)
> > 03.  Sports (football, baseball, pool, golf)(incl. pools, 
> > w/friends or bookie)
> > 04.  Dice games of any type (Craps, etc.)
> > 05.  Lottery or numbers (Quick Pick, Road Runner, scratch 
> cards, etc.)
> > 06.  Bingo
> > 07.  Raffles or sweepstakes
> > 08.  Slot machines, video machines or other gambling machines
> > 09.  Pull Tabs, punch cards 
> > 10.  Internet Gambling
> > 11.  Other, please specify: ______________________________  
> > SAM (575-594)
> > 
> > 88.  Never Gamble  GO TO NEXT MODULE
> > 98.  No other
> > 77.  Don't Know/Not Sure
> > 99.  Refused  GO TO NEXT MODULE
> > 
> > The respondent was free to respond in any way they chose and 
> > the interviewers were trained to select from among 15 
> > possible response codes.  Codes 01 through 10 were assigned 
> > to particular forms of gambling.  Code 11 was used to 
> > identify types of gambling that couldn't be coded according 
> > to the 10 identified responses.  
> > Codes 77, 88, and 99 are self-explanatory.  If the respondent 
> > reported one or more types of gambling, the interviewer coded 
> > as many forms as were relevant, then entered 98 to indicate 
> > that no additional types of gambling were reported.  
> > 
> > Consequently, we have a variable with a wide variety of 
> > responses (see frequency table, below, showing the first and 
> > last few rows).
> > 
> > 	1 2 3 4 5 7 8 998     |          1        0.03        7.19
> > 	1 2 3 4 5 898         |          1        0.03        7.22
> > 	1 2 3 51098           |          1        0.03        7.25
> > 	1 2 4 5 7 898         |          1        0.03        7.28
> > 	1 2 498               |          1        0.03        7.31
> > 	1 2 81098             |          1        0.03        7.34
> > 	1 2 898               |          1        0.03        7.37
> > 	1 298                 |          7        0.21        7.58
> > 	1 3 898               |          1        0.03        7.61
> > 	1 398                 |          3        0.09        7.70
> > 	1 4 5 898             |          1        0.03        7.73
> > 	1 4 598               |          2        0.06        7.79
> > 	1 4 8 9 5 798         |          1        0.03        7.82
> > 	1 4 898               |          1        0.03        7.85
> > 	1 498                 |          3        0.09        7.94
> > 	1 5 2 798             |          1        0.03        7.97
> > 	50 85998              |          1        0.03       40.16
> > 	5898                  |          1        0.03       40.19
> > 	77                    |          1        0.03       40.22
> > 	                   88 |          1        0.03       40.25
> > 	88                    |      1,974       59.39       99.64
> > 	89 898                |          1        0.03       99.67
> > 	99                    |         11        0.33      100.00
> > 
> > 
> > Ultimately, we would like to summarize the results in a few 
> > simple ways:
> > 1. Proportion of adults participating in gambling of any form
> > 2. Proportion of adults participating in Internet gambling 
> > (as a new form that should be monitored)
> > 3. Most common form of gambling
> > 4. 3 most common forms of gambling
> > 
> > Clearly, the structure of the variable does not lend itself 
> > to efficient use.  Note that, in addition to the problem of 
> > multiple responses stored in a single variable, spacing does 
> > not appear to be consistent and some records even have a 
> > right justification while most appear to be left justified 
> > within the 22 columns.  I don't know if this justification is 
> > real or only apparent.
> > 
> > Any advice on how to work with this variable using Stata 9.2 
> > (generate other variables summarizing responses, etc.) would 
> > be greatly appreciated.
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index