[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: -word()- with non space separator

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	RE: st: -word()- with non space separator
Date	Wed, 23 Sep 2009 19:27:24 +0100

Because there are often occasions when you know that all variables in a
dataset should be fed to -destring-. 

Examples are when people copy and paste from a spreadsheet but include
too many header lines, so that even after -drop-ping the first few
observations every variable is then string. -destring- is then the tool
to clear up the mess. 

The problem in this thread lies at the opposite end of the spectrum from
that one as _only_ the variables just produced are of concern. 

In fact looking at your code again, I see that you could have used the
-destring- option of -split-, so that 

split stringanswer, generate(comp) parse(:)
destring, replace

would have been telescoped to 

split stringanswer, generate(comp) parse(:) destring

Nick 
[email protected] 

Martin Weiss

If these were the thoughts of the developers who adapted your code for
official Stata deployment, why would they make the -varlist- optional? 

Nick Cox

OK, so you intended to write that. 

I know how -destring- works; I did write it originally.... 

It is still inefficient, and poor style, to feed to -destring- anything
you know does not need its attention. Even worse, in doing this for
real, you might make unintentional changes to other variables lying
around in your dataset. So, I still think -destring, replace- an
ill-advised example. 

Nick 
[email protected] 

P.S. on-lookers should not fear that this is getting personal. Martin
and I bump into each other quite often, and very amiably. 

Martin Weiss

" P.S. -destring, replace- is a typo for -destring comp*, replace-."

Nope, fully intentional. -destring- does not require an argunment, but
picks
out the ones that are suitable for it automatically.

Nick Cox

Yes indeed. I'm focusing entirely on Jeph's objection to my solution. 

Your solution works, but the merits of other solutions, especially if
they are more direct, remain of interest. 

Nick 
[email protected] 

P.S. -destring, replace- is a typo for -destring comp*, replace-. 

Martin Weiss

I posted code that knows the maximum two hours ago...

Nick Cox

Not knowing the highest value in advance would bite equally hard with
the method in your previous post, which works from 1 upwards to a
specified maximum, so that objection seems unconvincing to me. 

Jeph Herrin

Thanks. I also thought of something like this, but
didn't want to pursue it, if that makes sense. For
one thing, I have literally thousands of variables and
don't know ahead of time what the highest number I
need is.

As for the structure, it may not be the worst, but it
is surely not the best.

Nick Cox wrote:

> Another way to do it: 
> 
> clonevar work = myvar 
> 
> qui forval i = 29(-1)1 { 
> 	gen myvar_`i' = strpos(work, "`i'") > 0 
> 	replace work = subinstr(work, "`i'", "", .) 
> } 
> 
> Here 29 is in general whatever highest number you need. 
> 
> In words, in addition to the -strpos()- logic, 
> 
> 1. Work on a copy, because we're going to change it. 
> 
> 2. Work downwards, from high values down to 1. 
> 
> 3. Once you've checked for a longer string, zap it so that it doesn't
> later confuse the search for shorter strings. 
> 
> Incidentally, don't knock the format (or structure). When Uli Kohler
and
> I wrote up the tricks we knew for multiple responses (in this sense),
it
> was pretty clear to us that all such formats or structures have some
big
> advantages and disadvantages. Our efforts are accessible at 
> 
> FAQ     . . . . . . . . . . . . . . . . . . .  Dealing with multiple
> responses
>         . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox and
U.
> Kohler
>         4/05    How do I deal with multiple responses?
>                 http://www.stata.com/support/faqs/data/multresp.html
> 
> SJ-3-1  pr0008   Speaking Stata: On structure & shape: the case of
mult.
> resp.
>         . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox &
U.
> Kohler
>         Q1/03   SJ 3(1):81--99                                   (no
> commands)
>         discussion of data manipulations for multiple response data
> 
> Nick 
> [email protected] 
> 
> Jeph Herrin
> 
> Solved - this does it:
> 
>      forv i=1/9 {
>           gen byte myvar_`i'= regexm(myvar,"^`i':|:`i':|:`i'$")
>      }
> 
> 
> Jeph Herrin wrote:
> 
>> I have a dataset in which many variables are in
>> the most useless format imaginable. If a question
>> has multiple checkboxes as possible answers, the
>> response is stored as a string, with a number indicating
>> each box checked and these numbers separated by colons.
>> Thus:
>>
>>                 myvar
>>       1:2:3:5:6:7:8:9
>>               1:2:3:6
>>       1:2:3:4:5:7:8:9
>>           1:2:3:5:7:9
>>         1:2:3:5:7:8:9
>>             2:3:4:6:9
>>       1:2:3:5:6:7:8:9
>>             1:2:7:8:9
>>                   7:9
>>
>> This variable takes 9 values, so I want to split into 9
>> different indicator variables, myvar_1-myvar_9, each
>> indicating whether that number was selected. -split()-
>> does not work, because of the differing number of values
>> per string. That is, it produces myvar_1 which equals "7"
>> for the last obs.
>>
>> So I am looking for a way to check whether a given string
>> contains a given integer, which would allow me to
>>
>>    forv i=1/9 {
>>     gen byte myvar_`i'= [`i' is in myvar list]
>>    }
>>
>> As long as there are just 9 values, I can use -strpos()-
>> to check for the presence of the digit, but some of my variables
>> run into tens and twenties, in which case eg searching for "1"
>> returns true even if there is only "11".
>>
>> The only solutions I see are to first -split()- and
>> then check all the new indicators, or run through a series of
>> checks such as (matches "1:" but not ":1").  I don't like
>> either: Is there a direct way to check to see if a given integer
>> is in the list?
>>
>> I think there may be a regex solution, but my Perl programming
>> days are so far behind me that I've not been able to come up
>> with one.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: -word()- with non space separator
  - From: "Nick Cox" <[email protected]>

References:
- st: -word()- with non space separator
  - From: Jeph Herrin <[email protected]>
- Re: st: -word()- with non space separator
  - From: Jeph Herrin <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Nick Cox" <[email protected]>
- Re: st: -word()- with non space separator
  - From: Jeph Herrin <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Nick Cox" <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Martin Weiss" <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Nick Cox" <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Martin Weiss" <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Nick Cox" <[email protected]>
- RE: st: -word()- with non space separator
  - From: "Martin Weiss" <[email protected]>

Prev by Date: RE: st: -word()- with non space separator
Next by Date: st: truncreg and discontinous likelihood
Previous by thread: RE: st: -word()- with non space separator
Next by thread: RE: st: -word()- with non space separator
Index(es):
- Date
- Thread