Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: identify unique string values within lists of elements over chosen time windows

From   Denisa Mindruta <>
Subject   Re: st: identify unique string values within lists of elements over chosen time windows
Date   Fri, 22 Mar 2013 04:46:40 -0700 (PDT)

Dear Nick- this has been a very helpful conversation ! For anyone else 
potentially interested in this posting. 

Another solution proposed by Dimitriy on stackoverflow was to use:
collapse (sum) new=n, by(obs year)  after creating the indicator counting the 
first occurrence of a string value. But Dimitriy's solution requires the 
additional step of  merging the new variable back into the original dataset....
I also asked Nick whether reshaping is the most "efficient" way of approaching 
the issue and here is what he said. I quote Nick:

"(MORE) Further comments focused largely on efficiency, meaning here speed 
rather than space. (Storage space could be biting the poster.) 

Without a restructure, here using reshape, the problem  is a triple loop: over 
identifiers, over observations for each  identifier and over variables. Possibly 
the two outer loops can be  collapsed to one. But an explicit loop over 
observations is usually slow  in Stata. 

With the restructuring solutions proposed by Dimitriy and myself, by: operations 
go straight to compiled code and are relatively fast: reshape is interpreted 
code and entails file manipulations, so can be slow. On the other hand reshape 
can be fast to write down with some experience, and it really is worth acquiring 
the fluency with reshape which comes with experience. In addition to the help 
for reshape and the manual entry, see the FAQ on reshape I wrote on 

Another consideration is what else you want to do with this kind of  dataset. If 
there are going to be other problems of similar character,  they will usually be 
easier with a long structure as produced by reshape, so keeping that structure 
will be a good idea."

----- Original Message ----
From: Nick Cox <>
Sent: Fri, March 22, 2013 4:27:35 AM
Subject: Re: st: identify unique string values within lists of elements over 
chosen time windows

input obs     yr   str4 var1 str4  var2 str4   var3
1        90   str1    str2    str3
1        91    str1    str4    str5
2        90    str3    str4
2        91    str4    str5
2        93    str3    str5
2        94    str7
reshape long var , i(obs yr) j(which)
bysort obs var (yr) : gen new = _n == 1 & !missing(var)
bysort obs yr : replace new = sum(new)
by obs yr : replace new = new[_N]
reshape wide var, i(obs yr) j(which)


On Thu, Mar 21, 2013 at 11:22 PM, Denisa Mindruta <> wrote:
> Hi everyone. I have an unbalanced, large panel dataset, where each observation
> can take multiple string values (each string is stored in a separate 
> At each point in time, I need to count whether the string value(s) taken by an
> observation are "new" , meaning that they do not show up among the values 
> by the same observation in previous years. How should I approach  this problem 

> Thanks !  Below is a description of data. I need to calculate newval
> obs     yr   var1    var2    var3    newval
> 1        90   str1    str2    str3     3
> 1        91    str1    str4    str5     2
> 2        90    str3    str4              2
> 2        91    str4    str5              1
> 2        93    str3    str5              0
> 2        94    str7                       1
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index