Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: levelsof for many categories without sorting


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: levelsof for many categories without sorting
Date   Wed, 7 Sep 2005 15:12:16 +0100

Note for anyone interested: 

-levelsof- as implemented in Stata 9 differs 
subtly from -levels- as added to Stata 8 
during its lifetime. 

That aside, I am very surprised at Iwan's
report that -levelsof- reports categories
according to their order of occurrence in the data. 
That contradicts not just the help file, but 
also the code as I read it (and for that matter
as I wrote it, originally). StataCorp would like
to see evidence, I am sure. I suspect Iwan's 
impression is mistaken, but I am not sure why 
it arises. 

The general problem to which -levelsof- is 
one solution is discussed in 

http://www.stata.com/support/faqs/data/foreach.html

A fairly general strategy for going through all
possible levels 

*	according to their order of first occurrence 
* 	in the data 

is as follows. 
(This circumvents problems arising when -levelsof- 
cannot cope.) 

Suppose we have an identifier, say -id-. 

First generate an observation number: 

gen long obs = _n 

Now we sort by -id-, breaking ties by 
-obs-. The first observation in each block 
then carries information on first occurrence. 
We copy the observation number of first 
occurrence to each other occurrence of the same id. 

bysort id (obs) : replace obs = obs[1] 

Now we tag ids from 1 to whatever, according 
to first occurrence: 

bysort obs : gen group = _n == 1
replace group = sum(group) 

Those familiar with -egen, group()- may
recognise the basic idea here. 

Now the number of groups is identifiable from 

su group, meanonly 
local max = r(max) 

Typically then you loop over groups: 

forval i = 1/`max' { 
	...
} 

Within that loop, a look-up technique to 
get the identifier concerned is, for 
a numeric identifier: 

su id if group == `i', meanonly 

All identifiers in each group are the same, 
so it matters little whether we pick up 
the minimum, the mean or the maximum: 

local which = r(min) 

will do, for example. 

If the identifier -id- is a string variable, a little 
more work is needed. Outside the loop, 

replace obs = _n 

Inside the loop, 

su obs if group == `i', meanonly 
local which = id[`r(min)'] 

Nick 
n.j.cox@durham.ac.uk 

Barankay, Iwan
> 
> I find the command "levelsof" very useful to cut down the 
> time on loops when I run through the category of a variable 
> (e.g. the location_ids of a large survey).
> 
> What I also like is that the local macro generated by 
> levlesof is - so it seams to me - still in the order in which 
> it appears in the data and does not sort it which is needed 
> at times (even though the hlp file of levelsof says 
> otherwise). When usually a list is entered into a local it is 
> then sorted.
> 
> The problem of course is that there are constraints on 
> levelsof when it hits the character limit.
> 
> My question is:
> 
> What can I use instead of levelsof for (i) a large number of 
> categories to avoid the character constraint but which (ii) 
> also keeps the categories in the order it appears in the data 
> and does not sort it.
> 
> (i) is much more important than (ii) but if someone did an 
> elegant solution for (ii) I would love to hear of it.
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index