Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: crosstab with a large dataset


From   Austin Nichols <[email protected]>
To   [email protected]
Subject   Re: st: RE: crosstab with a large dataset
Date   Thu, 2 Feb 2006 20:55:29 -0500

My guess is that you will want to condition not only on similarity of
names but on the number of students per name--the reason you are
assuming "Grant" and "Sonya Grant" are the same person is not only the
name in common but the fact that "Grant" has got one kid, right?  I
would (as a first step, anyway) calculate the number of kids per
teacher/school combo, and then look at the cases that seem to have low
numbers relative to their school (some of these might be special ed
classes or the like, but the majority are likely miscoded):
. egen nkids=count(schim), by(schim teacher)
. egen avgnkids=mean(nkids), by(schim)
. tab nkids avgnkids
. su schim if  nkids==1
. tab teacher if schim==r(min)
etc.

But the process cannot really be automated--you want to combine
"Grany" and "Grant" if the sum of their class sizes add up to the
right number, too, not just names that have a word in common, right?

On 2/2/06, Nick Cox <[email protected]> wrote:
> The main issue here seems to be getting Stata to
> be smart enough to recognise (for example)
> that "GRANT" and "SONYA GRANT" are the same
> person. You could try working in terms of
> last name only, which would be
>
> word(teacher, -1)
>
> -- but this might create the opposite problem
> of conflating different teachers.
>
> Alternatively there are various handles
> in -groups- on SSC that might be useful.
>
> Nick
> [email protected]

> Gushta, Matthew would
> > like to basically crosstab school and teacher variables, so that only
> > unique teacher values appear within each school. you can see that each
> > school is presented in a separate table and teacher "grant" appears
> > twice in school 2766 (see the syntax and sample output below).
> >
> > ...given 2105 districts and 5262 teachers, this output is quite
> > cumbersome.
> >
> > is there a simpler, more compressed format for such output? i.e., a
> > single table?
> >       TEACHER |      Freq.     Percent        Cum.
> > --------------+-----------------------------------
> >      CAMPBELL |         24        7.50        7.50
> >     DOLORESCO |         23        7.19       14.69
> > FLEMING RACHE |         25        7.81       22.50
> >         GRANT |          1        0.31       22.81
> >   SONYA GRANT |         25        7.81       84.38
> >      STAUFFER |         25        7.81       92.19
> >       WELLING |         25        7.81      100.00
> > --------------+-----------------------------------
> >         Total |        320      100.00

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index