Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: disregarding duplicate observations in a variable list


From   "Jessica Looze" <[email protected]>
To   [email protected]
Subject   Re: st: RE: disregarding duplicate observations in a variable list
Date   Wed, 7 Jan 2009 21:41:13 -0500

Thank you Nick and Scott for your suggestions. I tried Nick's
suggestion first, as an egen command seems the more efficient of the
two. However, when I entered the command

egen nvals = rownvals(emp1_97 emp2_97 emp3_97 emp1_98 emp2_98 emp_98)

(after saving Nick's ado files of course) I received the error message

unexpected end of line
<istmt> incomplete
r(3000);

Unsure what this meant, I did a search and found a reference to this
message in an archived Statalist coversation.

http://www.stata.com/statalist/archive/2006-04/msg00434.html

This discussion seems to indicate that this message has to do with the
pickyness of Mata when "if" is involved. I am not very advanced at
writing programs, so looking through your programs Nick, I am
uncertain how to tweak it (if tweaking is even the issue). Maybe there
is something else I need to be doing here?

Thank you again,
Jessica Looze

On Wed, Jan 7, 2009 at 10:58 AM, Nick Cox <[email protected]> wrote:
> The problem is that of counting duplicate _values_ across a varlist and
> within each observation. (The terminology of duplicate observations
> would imply a problem for -duplicates-, but that command does not help
> here.)
>
> Jessica's code borrowed from the -egenmore- package is to do with
> counting values that are positive and non-missing. That won't help
> either, as the values would be counted regardless of whether they are
> distinct, as Jessica realises. There isn't a very easy way to go further
> down that path, although it would be possible.
>
> Note that the -egenmore- package is on SSC. (Please remember to explain
> where programs you use come from.)
>
> The problem is however very close to that discussed in an FAQ
>
> FAQ     . . . . . . . . .  Counting distinct strings across a set of
> variables
>        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N.
> J. Cox
>        7/04    How do I count the number of distinct strings
>                across a set of variables?
>
> <http://www.stata.com/support/faqs/data/distinctstrings.html>
>
> One strategy discussed there starts with a -reshape-. Scott Merryman has
> followed a similar line in his suggestions.
>
> Since that FAQ was written writing an -egen- function based on a Mata
> workhorse has come to seem a good way to do this. In fact, the
> -rowmedian()- function for -egen- in the -egenmore- package has most of
> the code needed. As the problem arises for numeric variables as well for
> string variables, two functions could be useful.
>
> * -------------------- put in _grownvals.ado on your adopath
> * number of distinct non-missing numeric values in each observation
> * NJC 1.0.0 7 Jan 2009
> program _grownvals
>        version 9
>        gettoken type 0 : 0
>        gettoken h    0 : 0
>        gettoken eqs  0 : 0
>
>        syntax varlist(numeric) [if] [in] [, BY(string)]
>        if `"`by'"' != "" {
>                _egennoby rownvals() `"`by'"'
>                /* NOTREACHED */
>        }
>
>        marksample touse, novarlist
>        quietly {
>                mata : row_nvals("`varlist'", "`touse'", "`h'",
> "`type'")
>        }
> end
>
> mata :
>
> void row_nvals(string scalar varnames,
>                string scalar tousename,
>                string scalar nvalsname,
>                string scalar type)
> {
>        real matrix y
>        real colvector nvals, row
>
>        st_view(y, ., tokens(varnames), tousename)
>        nvals = J(rows(y), 1, .)
>
>        for(i = 1; i <= rows(y); i++) {
>                row = y[i,]'
>                nvals[i] = length(uniqrows(select(row, (row :< .))))
>        }
>
>        st_addvar(type, nvalsname)
>        st_store(., nvalsname, tousename, nvals)
> }
>
> end
> * end of _grownvals.ado
>
> * -------------------- put in _growsvals.ado on your adopath
> * number of distinct non-missing string values in each observation
> * NJC 1.0.0 7 Jan 2009
> program _growsvals
>        version 9
>        gettoken type 0 : 0
>        gettoken h    0 : 0
>        gettoken eqs  0 : 0
>
>        syntax varlist(string) [if] [in] [, BY(string)]
>        if `"`by'"' != "" {
>                _egennoby rowsvals() `"`by'"'
>                /* NOTREACHED */
>        }
>
>        marksample touse, novarlist
>        quietly {
>                mata : row_svals("`varlist'", "`touse'", "`h'",
> "`type'")
>        }
> end
>
> mata :
>
> void row_svals(string scalar varnames,
>                string scalar tousename,
>                string scalar svalsname,
>                string scalar type)
> {
>        string matrix y
>        string colvector row
>        real colvector nvals
>
>        st_sview(y, ., tokens(varnames), tousename)
>        svals = J(rows(y), 1, .)
>
>        for(i = 1; i <= rows(y); i++) {
>                row = y[i,]'
>                svals[i] = length(uniqrows(select(row, (row :!= ""))))
>        }
>
>        st_addvar(type, svalsname)
>        st_store(., svalsname, tousename, svals)
> }
>
> end
> * end of _growsvals.ado
>
>
> You can invoke these functions, once the program files are in place, by
>
> egen nvals = rownvals(<numeric varlist>)
>
> egen svals = rowsvals(<string varlist>)
>
> I'll add those functions to -egenmore- in due course.
>
> Nick
> [email protected]
>
> Jessica Looze
>
> I am trying to create a variable that indicates the number of jobs an
> individual has held during a period of years. The dataset I am using,
> NLSY97, records each respondents' work history in a roster format.
> This roster assigns each job a unique ID indicating the year the job
> began. For example, the roster for respondent #1 might look like:
>
> ID     Year     Job 1     Job2     Job3
> 1      1997     9701      9702    9703
> 1      1998     9801      9701    .
>
> So, during these two years, this respondent held four different jobs
> (9701 extending over into 1998).
>
> My data looks something like this:
>
> ID     EMP1_97     EMP2_97     EMP3_97     EMP1_98     EMP2_98
> EMP3_98
> 1      9701            9702             9703            9801
>  9701             .
> 2      9701            .                   .                  9701
>       .                   .
>
> I have been working with the row operations suggested in the egenmore
> help entry. My current working code looks like that on this manual
> page:
>
> gen any = 0
> gen all = 1
> gen count = 0
>     foreach v of varlist emp1_97 emp2_97 emp3_97 emp1_98 emp2_98
> emp3_98 {
>          replace any = max(any, inrange(`v', 0, .))
>          replace all = min(all, inrange(`v', 0, .))
>          replace count = count + inrange(`v', 0, .)
> }
>
> From here, I cannot figure out how to modify the variable count, so
> that it disregards duplicate IDs.
>
> Any suggestions would be much appreciated.
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index