Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: cleaning a specific data structure


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: cleaning a specific data structure
Date   Fri, 21 Nov 2003 13:20:45 -0000

(I'm reposting the original mailing and my reply.
The original mailing was HTML, which I spotted, and corrected
for, and it carried an accompanying winmail.dat, which
I didn't spot; that stuck to my reply mail like dirt
on a shoe. The original posting will appear as
complete gibberish to recipients of the digest
version of the list. As often mentioned, please do
_not_ send mailjunk to the list.)

================================

Radu Ban

> The data is organized like this, numbers are made-up for this
description:
>
> id dummy descriptor
> 13 1 <blank>
> 13 0 abc
> 13 1 <blank>
> 14 0 <blank>
> 14 0 def
> 14 0 def
>
> The idea is that the id variable should be unique, but for some
> reason it is not.  This means that both the dummy and descriptor
> should have the same values accross the id groups. A complication
> is that for the dummy, if there's a "1" in a group all the group
> should be "1".
>
> I want to reduce this to a clean version which looks like this:
>
> id dummy descriptor
> 13 1 abc
> 14 0 def
>
> For the dummy part I dealt with it like this (probably a convoluted
method):
> bysort id: egen maxdummy = max(dummy)
> replace dummy = maxdummy
> bysort id: keep if _n == 1
>
> But I am a bit stuck on how to deal with the string descriptor. I
> mean I know one way of doing by splitting the data and then
> merging it back but there has to be a more efficient way.

I think you are right: you can do all you want in one place.

The dummy can be sorted out your way, or this way:

bysort id (dummy) : replace dummy = dummy[_N]

as 1s will get sorted to the end.

If I understand correctly, the descriptor can be
sorted out similarly

bysort id (descriptor) : replace descriptor = descriptor[_N]

as the empty strings will get sorted to the beginning.

However, before you do that you should test the
assumption that all (non-empty) descriptors are
identical within -id-:

gen empty = mi(descriptor)
bysort id empty (descriptor) :
	assert descriptor[1] == descriptor[_N]

On the last, see also
http://www.stata.com/support/faqs/data/diff.html

Nick
[email protected]


Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index