Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: levelsof for many categories without sorting


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: RE: levelsof for many categories without sorting
Date   Wed, 7 Sep 2005 15:44:02 +0100

See also Roger Newson's -sencode- on SSC, which 
is designed for an overlapping problem. 

Nick 
[email protected] 

Nick Cox
 
> Note for anyone interested: 
> 
> -levelsof- as implemented in Stata 9 differs 
> subtly from -levels- as added to Stata 8 
> during its lifetime. 
> 
> That aside, I am very surprised at Iwan's
> report that -levelsof- reports categories
> according to their order of occurrence in the data. 
> That contradicts not just the help file, but 
> also the code as I read it (and for that matter
> as I wrote it, originally). StataCorp would like
> to see evidence, I am sure. I suspect Iwan's 
> impression is mistaken, but I am not sure why 
> it arises. 
> 
> The general problem to which -levelsof- is 
> one solution is discussed in 
> 
> http://www.stata.com/support/faqs/data/foreach.html
> 
> A fairly general strategy for going through all
> possible levels 
> 
> *	according to their order of first occurrence 
> * 	in the data 
> 
> is as follows. 
> (This circumvents problems arising when -levelsof- 
> cannot cope.) 
> 
> Suppose we have an identifier, say -id-. 
> 
> First generate an observation number: 
> 
> gen long obs = _n 
> 
> Now we sort by -id-, breaking ties by 
> -obs-. The first observation in each block 
> then carries information on first occurrence. 
> We copy the observation number of first 
> occurrence to each other occurrence of the same id. 
> 
> bysort id (obs) : replace obs = obs[1] 
> 
> Now we tag ids from 1 to whatever, according 
> to first occurrence: 
> 
> bysort obs : gen group = _n == 1
> replace group = sum(group) 
> 
> Those familiar with -egen, group()- may
> recognise the basic idea here. 
> 
> Now the number of groups is identifiable from 
> 
> su group, meanonly 
> local max = r(max) 
> 
> Typically then you loop over groups: 
> 
> forval i = 1/`max' { 
> 	...
> } 
> 
> Within that loop, a look-up technique to 
> get the identifier concerned is, for 
> a numeric identifier: 
> 
> su id if group == `i', meanonly 
> 
> All identifiers in each group are the same, 
> so it matters little whether we pick up 
> the minimum, the mean or the maximum: 
> 
> local which = r(min) 
> 
> will do, for example. 
> 
> If the identifier -id- is a string variable, a little 
> more work is needed. Outside the loop, 
> 
> replace obs = _n 
> 
> Inside the loop, 
> 
> su obs if group == `i', meanonly 
> local which = id[`r(min)'] 
> 
> Nick 
> [email protected] 
> 
> Barankay, Iwan
> > 
> > I find the command "levelsof" very useful to cut down the 
> > time on loops when I run through the category of a variable 
> > (e.g. the location_ids of a large survey).
> > 
> > What I also like is that the local macro generated by 
> > levlesof is - so it seams to me - still in the order in which 
> > it appears in the data and does not sort it which is needed 
> > at times (even though the hlp file of levelsof says 
> > otherwise). When usually a list is entered into a local it is 
> > then sorted.
> > 
> > The problem of course is that there are constraints on 
> > levelsof when it hits the character limit.
> > 
> > My question is:
> > 
> > What can I use instead of levelsof for (i) a large number of 
> > categories to avoid the character constraint but which (ii) 
> > also keeps the categories in the order it appears in the data 
> > and does not sort it.
> > 
> > (i) is much more important than (ii) but if someone did an 
> > elegant solution for (ii) I would love to hear of it.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index