Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RE: st: AW: Create a flag variable for 10 most frequent values


From   "Cohen, Elan" <cohened@upmc.edu>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: RE: st: AW: Create a flag variable for 10 most frequent values
Date   Tue, 17 Nov 2009 10:16:58 -0500

Thank you everyone.  I had just finished writing a solution similar to Jeph's but without the generalizations Nick's solution offers.  -nmodes- will definitely do the trick.

Thanks again,

- Elan


> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu 
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
> Sent: Tuesday, November 17, 2009 10:09 AM
> To: statalist@hsphsun2.harvard.edu
> Subject: RE: RE: st: AW: Create a flag variable for 10 most 
> frequent values
> 
> I agree with these criteria. In addition, a general solution to this
> should be able to tackle
> 
> Missing values
> Weights 
> Ties in frequency (e.g. there may not be exactly 10 modes) 
> 
> As promised earlier, here is an update of -modes- earlier published in
> the STB and the SJ. An update follows in the Stata Journal. 
> 
> *! NJC 1.4.0 17 November 2009 
> * NJC 1.3.0 13 May 2003            (SJ3-2: sg113_1)
> * NJC 1.2.0 15 June 1999 
> * NJC 1.1.2 23 December 1998
> * NJC 1.1.1 29 October 1998
> program modes, sort 
>         version 8.0
>         syntax varname [if] [in] [fweight aweight/] ///
> 	[ , Min(int 0) Nmodes(int 0) GENerate(str) ]
> 
> 	if "`generate'" != "" { 
> 		capture confirm new variable `generate' 
> 		if _rc { 
> 			di as err "generate() requires new variable
> name"
> 			exit _rc 
> 		}
> 	} 
> 
> 	if `min' & `nmodes' { 
> 		di as err "may not specify both min() and nmodes()"
> 		exit 198
> 	}
> 	
> 	quietly { 
> 		marksample touse, strok
> 		count if `touse' 
> 		if r(N) == 0 error 2000 
> 		
> 		tempvar freq 
> 		if "`exp'" == "" local exp = 1 
> 		bysort `touse' `varlist' : ///
> 			gen double `freq' = sum(`exp') * `touse'
> 		by `touse' `varlist' : ///
> 			replace `freq' = (_n == _N) * `freq'[_N] 
> 		label var `freq' "Freq."
> 
> 		if `min' > 0 { 
> 			local which "`freq' >= `min'" 
> 		}	
> 		else if `nmodes' > 0 { 
> 			sort `touse' `freq' `varlist' 
> 			count if `freq' 
> 			local nmodes = min(`nmodes', r(N)) 
> 			local which "`freq' >= `freq'[_N - `nmodes' +
> 1]"
> 		} 	
> 		else {
> 			su `freq', meanonly
> 			local max = r(max)
> 			local which "`freq' == `max'" 
> 		}	
> 		
> 		count if `which'
> 		if r(N) == 0 {
> 			di as err "no such modes in data"
> 			exit 498
> 		}
> 	}
> 
> 	tabdisp `varlist' if `which', c(`freq')
> 
> 	quietly	if "`generate'" != "" { 
> 		gen byte `generate' = `which' if `touse' 
> 		bysort `touse' `varlist' (`generate') : ///
> 		replace `generate' = `generate'[_N] if `touse' 
> 	} 		
> 	
> end
> 
> 
> --------------------------------------------------------------
> ----------
> help for modes                          (SJ9-4: sg113_2; 
> SJ3-2: sg113_1)
> --------------------------------------------------------------
> ----------
> 
> Tabulation of mode(s)
> 
>         modes varname [weight] [if exp] [in range] [ , { min(#) |
>                  nmodes(#) } generate(newvar) ]
> 
> 
> Description
> 
>     modes tabulates the mode(s) of varname, that is, the value(s) of
>     varname that occur most frequently. varname may be numeric or
>     string.  fweights and aweights are allowed. Missing values are
>     ignored.
> 
>     modes is most obviously useful with a discrete or categorical
>     variable.  Continuous variables may need to be placed in bins or
>     classes first.
> 
> 
> Options
> 
>     min(#) specifies that all values with a frequency of # or more
>         should be shown.
> 
>     nmodes(#) specifies that # modes should be shown. However, if ties
>         in frequency make identification of precisely # modes
>         arbitrary, all such tied modes will be shown. Note that fewer
>         modes will be shown if fewer than # modes exist.
> 
>         min() and nmodes() may not be specified together.
> 
>     generate(newvar) generates an indicator variable that is missing
>         if varlist is missing or observations are excluded by if or
>         in, 1 whenever the value of varlist is one of the displayed
>         modes, and 0 otherwise.
> 
> 
> Examples
> 
>     . modes rep78
>     . modes rep78 if foreign
>     . modes mpg, min(3)
>     . modes mpg, nmodes(3)
>     . modes turn, nmodes(10) gen(flag)
> 
> 
> Author 
> 
>     Nicholas J. Cox, Durham University, U.K.
>     n.j.cox@durham.ac.uk
> 
> 
> Acknowledgments 
> 
>     A problem posed by Sylvain Friederich led to the nmodes() option.
>     A problem posed by Elan Cohen led to the generate() option.
> 
> 
> Also see
> 
>     STB:     STB-50 sg113
>     Online:  help for tabulate, kdensity, egen
> 
> Nick 
> n.j.cox@durham.ac.uk 
> 
> Martin Weiss
> 
> As discussed last night between me and Sergiy: You want the whole
> dataset
> with all variables intact plus one that denotes membership in 
> the "club
> of
> most frequent values of mpg"...
> 
> gjhxmu@sina.com
> 
> Suppose we need to flag the 5 most frequent values, how about the
> following
> typings?
> 
> sysuse auto, clear
> keep mpg
> bys mpg: egen mycount=count(mpg)
> bys mycount: g num=_n
> gsort num -mycount
> g tag=_n<=5
> bys mycount: egen rank5=max(tag)
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index