Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: questions about duplicate observations


From   "Wen Xia Ge" <[email protected]>
To   <[email protected]>
Subject   RE: st: questions about duplicate observations
Date   Tue, 27 May 2008 12:01:58 -0400

Hi Nick,

Thanks for your suggestion. Your suggestion works well for the second
approach. But I still do not figure out how to use -collapse- to get the
dataset described in the first approach. That is, for firms with
multiple bond issues in a year, I just want to keep the issue with the
largest offering_amt (firms with single bond issue will remain in the
dataset). I tried the following:

 bysort yeara cnum (offering_amt) : gen max = offering_amt[_N]
 collapse max bond_yield maturity (and some other variables which are
not listed here), by(yeara cnum)

It will give the means of the listed variables. In this case, max is OK
(it is the largest offering_amt), but I want to keep the orignial
bond_yield, maturity etc associated with the issue with the largest
offering_amt. e.g., for issue 7, 8 and 9, issue 7 and 8 should be
removed, and just variables associated with issue 9 will be remained in
the dataset. 

I tried to use -duplicates drop-, but I cannot sort data in descending
order, because the error message says -gsort- cannot be combined with
-by-. 

I would appreciate it if you could give me more clues. 

Wenxia




-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of n j cox
Sent: Tuesday, May 27, 2008 10:22 AM
To: [email protected]
Subject: Re: st: questions about duplicate observations

.
Use -by:- for your calculations. -collapse- only at the end. Recall that
the mean of a constant is that constant.

e.g.

bysort yeara cnum (offering_amt) : gen max = offering_amt[_N]

by yeara cnum : gen num = sum(bond_yield * offering_amt) by yeara cnum :
gem den = sum(offering_amt) by yeara cnum : gen avewt_bind_yield =
num[_N] / den[_N]

Nick
[email protected]

Wen Xia Ge <[email protected]>

I have a dataset that contains thousands of firms with bond issues from
1980-2004. Some firms have single bond issue in a given year, while some
firms have multiple bond issues in a given year. For example, the
dataset looks like this (I have more variables than listed here):

issue_id  issuer_id  issuer_cusip  offering_year  offering_amt
bond_yield  maturity	
1	           3	000361	    1989	     ###
###	       ###	
2	           3      000361	    1993	     ###
###	       ###	
3	           4	00077D	    1994	     ###
###	       ###	
4	           4	00077D	    1994	     ###
###	       ###	
5	           4	00077T	    1993	     ###
###	       ###	
6	           4	00077T	    1993	     ###
###	       ###	
7	           6	000800	    1982	   700000
###	       ###	
8	           6	000800	    1982	   680000
###	       ###	
9	           6	000800	    1982	   800000
###	       ###	
10	           6	000800	    1984	     ###
###	       ###	


For firms with single bond issue in a year, they will be kept in the
dataset as it is. I want to use two approaches to handle firms with
multiple bond issues in a given year:

First, I want to keep the issue with the largest offering_amt in the
dataset (for example, for company 000800, drop issue 7 and 8, keep issue
9, for year 1982, and keep issue 10 for year 1984), so that in the
resulting panel data, in a given year, a company can only show up at
most once (with the largest offering amount);

Second, I want to calculate weighted average bond_yield (using
offering_amt as the weights), average offering_amt and average maturity,
and save it as a new observation while dropping original multiple bond
issues. For example, in the resulting panel data, there will be one
observation for 000800 in 1982, the offering_amt will take the average
of 700000, 680000, and 800000, and the bond_yiled will take the weighted
value of the three bond_yield on issue 7, 8 and 9.

I tried the following command:
sort yeara cnum offering_amt
collapse (last) offering_amt, by(yeara cnum)

And:

sort yeara cnum
collapse (max) offering_amt, by(yeara cnum)

The resulting dataset contains only three variables: yeara cnum and
offering_amt, but I need to keep other variables.


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index