Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: first|string option for collapse

From   "Nick Cox" <>
To   <>
Subject   st: RE: first|string option for collapse
Date   Fri, 6 Oct 2006 10:03:21 +0100

In problems like this, there is often a clash of 
experts, between the by-pilots and the merge mavens. 
The by-pilots love to show how a quick -sort- and an 
application of -by:- can get you where you want to be. 
The merge mavens set up multiple files and then put them 
together with some clever choreography, -merge- magic 
making light work of a mess. 

You have found a way to a solution that culminates in
a -merge-, but understandably want something simpler. 

If I understand this correctly, there is no need 
for a hit on -collapse-. You just need some prior 
surgery on -org_name- using -by:-. 

In case of name conflict, you want the first -org_name- 
used for each -id- to be definitive. Spreading this to 
all occurrences of a given -id- is a standard problem 
for -by:-. (There is some database terminology 
for this; is it indeed "spreading"?) 

sort id, stable 
by id : replace org_name = org_name[1] 

This works with both strings and numerics. 

Now your data are ready for -collapse-. 

Recall that under the aegis of -by:- subscripts are 
reckoned within each group of observations defined by
-by:-. The -stable- option takes account of observation
order, but in most such problems people will have a time
or sequence identifier. 

For a discursive tutorial on -by:-, see

SJ-2-1  pr0004  . . . . . . . . . . Speaking Stata:  How to move step by: step
        Q1/02   SJ 2(1):86-102                                   (no commands)
        explains the use of the by varlist : construct to tackle
        a variety of problems with group structure, ranging from
        simple calculations for each of several groups to more
        advanced manipulations that use the built-in _n and _N

Alternatively, use the manual index to identify various sections
on -by:-, and then read them all to consolidate your by-pilot 


Caleb Southworth
> A problem that comes up occasionally when collapsing data on 
> organizations
> is the inability to take the first instance of a variable in 
> the set being
> collapsed. This would be most useful when that variable is a 
> string, but I
> believe is also of general use. Below is my work around, but perhaps
> someone has a better one?
> Imagine data on organizations:
> id	org_name	_1935	_1936
> 1	foo		10	.
> 1	foo_blah	.	12
> 2	noo		54	55
> I have duplicate ids and I want to collapse the data, but I 
> don't want to
> lose the name. The non-existent command would be
> -collapse (first) org_name (sum) _*, by(id)-
> instead
> save orig
> duplicates drop id, force
> keep id org_name
> save temp_name, replace
> use orig, clear
> collapse (sum) _*, by(id)
> merge id using temp_name, sort
> (first) in collapse could also be used to priviledge the data 
> on one type
> of case, i.e. the start date of an organization: - sort id start- then
> collapse keeping earliest start date.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index