Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RE: st: AW: Create a flag variable for 10 most frequent values


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: RE: st: AW: Create a flag variable for 10 most frequent values
Date   Tue, 17 Nov 2009 15:08:39 -0000

I agree with these criteria. In addition, a general solution to this
should be able to tackle

Missing values
Weights 
Ties in frequency (e.g. there may not be exactly 10 modes) 

As promised earlier, here is an update of -modes- earlier published in
the STB and the SJ. An update follows in the Stata Journal. 

*! NJC 1.4.0 17 November 2009 
* NJC 1.3.0 13 May 2003            (SJ3-2: sg113_1)
* NJC 1.2.0 15 June 1999 
* NJC 1.1.2 23 December 1998
* NJC 1.1.1 29 October 1998
program modes, sort 
        version 8.0
        syntax varname [if] [in] [fweight aweight/] ///
	[ , Min(int 0) Nmodes(int 0) GENerate(str) ]

	if "`generate'" != "" { 
		capture confirm new variable `generate' 
		if _rc { 
			di as err "generate() requires new variable
name"
			exit _rc 
		}
	} 

	if `min' & `nmodes' { 
		di as err "may not specify both min() and nmodes()"
		exit 198
	}
	
	quietly { 
		marksample touse, strok
		count if `touse' 
		if r(N) == 0 error 2000 
		
		tempvar freq 
		if "`exp'" == "" local exp = 1 
		bysort `touse' `varlist' : ///
			gen double `freq' = sum(`exp') * `touse'
		by `touse' `varlist' : ///
			replace `freq' = (_n == _N) * `freq'[_N] 
		label var `freq' "Freq."

		if `min' > 0 { 
			local which "`freq' >= `min'" 
		}	
		else if `nmodes' > 0 { 
			sort `touse' `freq' `varlist' 
			count if `freq' 
			local nmodes = min(`nmodes', r(N)) 
			local which "`freq' >= `freq'[_N - `nmodes' +
1]"
		} 	
		else {
			su `freq', meanonly
			local max = r(max)
			local which "`freq' == `max'" 
		}	
		
		count if `which'
		if r(N) == 0 {
			di as err "no such modes in data"
			exit 498
		}
	}

	tabdisp `varlist' if `which', c(`freq')

	quietly	if "`generate'" != "" { 
		gen byte `generate' = `which' if `touse' 
		bysort `touse' `varlist' (`generate') : ///
		replace `generate' = `generate'[_N] if `touse' 
	} 		
	
end


------------------------------------------------------------------------
help for modes                          (SJ9-4: sg113_2; SJ3-2: sg113_1)
------------------------------------------------------------------------

Tabulation of mode(s)

        modes varname [weight] [if exp] [in range] [ , { min(#) |
                 nmodes(#) } generate(newvar) ]


Description

    modes tabulates the mode(s) of varname, that is, the value(s) of
    varname that occur most frequently. varname may be numeric or
    string.  fweights and aweights are allowed. Missing values are
    ignored.

    modes is most obviously useful with a discrete or categorical
    variable.  Continuous variables may need to be placed in bins or
    classes first.


Options

    min(#) specifies that all values with a frequency of # or more
        should be shown.

    nmodes(#) specifies that # modes should be shown. However, if ties
        in frequency make identification of precisely # modes
        arbitrary, all such tied modes will be shown. Note that fewer
        modes will be shown if fewer than # modes exist.

        min() and nmodes() may not be specified together.

    generate(newvar) generates an indicator variable that is missing
        if varlist is missing or observations are excluded by if or
        in, 1 whenever the value of varlist is one of the displayed
        modes, and 0 otherwise.


Examples

    . modes rep78
    . modes rep78 if foreign
    . modes mpg, min(3)
    . modes mpg, nmodes(3)
    . modes turn, nmodes(10) gen(flag)


Author 

    Nicholas J. Cox, Durham University, U.K.
    n.j.cox@durham.ac.uk


Acknowledgments 

    A problem posed by Sylvain Friederich led to the nmodes() option.
    A problem posed by Elan Cohen led to the generate() option.


Also see

    STB:     STB-50 sg113
    Online:  help for tabulate, kdensity, egen

Nick 
n.j.cox@durham.ac.uk 

Martin Weiss

As discussed last night between me and Sergiy: You want the whole
dataset
with all variables intact plus one that denotes membership in the "club
of
most frequent values of mpg"...

gjhxmu@sina.com

Suppose we need to flag the 5 most frequent values, how about the
following
typings?

sysuse auto, clear
keep mpg
bys mpg: egen mycount=count(mpg)
bys mycount: g num=_n
gsort num -mycount
g tag=_n<=5
bys mycount: egen rank5=max(tag)


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index