Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: recognizing patterns within two columns of data

From	Nick Cox <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	RE: st: recognizing patterns within two columns of data
Date	Thu, 7 Jul 2011 11:11:08 +0100

The advice here sounds an appropriate caution, but much bigger problems with this solution are not mentioned. 

Note that -vallist- (SSC) doesn't do here anything that -levelsof- (official Stata) does not do. In fact, there is much more engineering behind -levelsof-, which is just -vallist- made official, and much more tested for larger sets of values. (The main reasons for -vallist- to continue to be visible are nothing to do with anything used here.) 

Further, commands like 

local temp1=r(list)

will just truncate their arguments at 244 characters, so this code won't work for any serious dataset. Fixing this by something like 

local temp1 `r(list)' 

would remove that problem. The sticking-point for this solution then becomes the same kind of problem in another guise, namely an assumption that a list of holders can be held within a string variable, which cannot be more than 244 characters long. 

Without knowing anything about Dalhia's real data, my guess is that such an assumption may bite, so watch out. 

Nick 
[email protected] 

P.S. On a matter of style, note that Subrata's code 

egen group_hold=group(hold_list)
tostring group_hold, replace
vallist group_hold
local temp3=r(list)
foreach x of local temp3{
vallist company if group_hold=="`x'"
local temp4=r(list)
replace comp_list="`temp4'" if group_hold=="`x'"
}

incorporates some needless to-and-fro, turning a well-behaved integer variable into a string and then calling up -vallist- when the answer is predictable in advance: 

egen group_hold=group(hold_list)
su group_hold, meanonly
forval x = 1/`r(max)' {
	vallist company if group_hold==`x'
	replace comp_list="`r(list)''" if group_hold==`x'
}

should have the same effect. However, this is just tinkering, as the larger problems mentioned above still remain. 

SUBRATA BHATTACHARYYA

You might want to try this: (though you would need a package vallist
for this, please use -findit- to locate and install)
I stored the data (you provided) in a variable named as comp_hold and
then split them into company and holder. Then I used vallist to
identify distinct observation and used that in a macro to get this
output:
     +-------------------------------------------------+
     |       hold_list                 comp_list |
     |---------------------------------------------------|
  1. | holderA holderB   compA compB |
  2. |         holderB                  compC |
     +------------------------------------------------+

I hope this works. This is what I wrote:
split comp_hold
ren comp_hold1 company
ren comp_hold2 holder
sort company holder
gen hold_list=""
gen comp_list=""
vallist company
local temp1=r(list)
foreach x of local temp1{
vallist holder if company=="`x'"
local temp2=r(list)
replace hold_list="`temp2'" if company=="`x'"
}
egen group_hold=group(hold_list)
tostring group_hold, replace
vallist group_hold
local temp3=r(list)
foreach x of local temp3{
vallist company if group_hold=="`x'"
local temp4=r(list)
replace comp_list="`temp4'" if group_hold=="`x'"
}
duplicates drop hold_list, force
list hold_list comp_list
I hope this works for you. FYI, I used Stata 11.2. Just one small
advice, please be sure that vallist can capture all the company names
or holder names at one go, I am not sure whether it can return a full
list of the names if your data set is too large. In that case, you
might want to split your file into manageable pieces.

On Thu, Jul 7, 2011 at 11:37 AM, Dalhia <[email protected]> wrote:

> Hello, Thanks. But egen group won't work since the holders are not the same. CompA and B (which I want grouped together) are owned by holderA and by holderB. The link is that these two companies are owned by people who also own shares in the other company - holderA owns shares in compA and also compB; similarly holderB owns shares in compA and also in compB. I want to identify those companies that are linked by multiple common owners.
>
> Example:
> compA holderA
> compB holderA
> compA holderB
> compB holderB
> compC holderB
>
> What I want:
> compA group1
> compB group1
>
> Thanks for your help. I appreciate it.
>
> Dalhia
>
> --- On Wed, 7/6/11, Nick Cox <[email protected]> wrote:
>
> > From: Nick Cox <[email protected]>
> > Subject: RE: st: recognizing patterns within two columns of data
> > To: "'[email protected]'" <[email protected]>
> > Date: Wednesday, July 6, 2011, 7:50 PM
> > -egen, group()- ?
> >
> > Nick
> > [email protected]
> >
> >
> > Austin Nichols
> >
> > Do you want to make an identifier as in
> > http://www.stata.com/statalist/archive/2011-07/msg00170.html
> > ?
> >
> > On Wed, Jul 6, 2011 at 10:12 AM, Dalhia <[email protected]>
> > wrote:
> > >
> > > I would like some advice on how to do the following.
> > Here is how the data looks:
> > >
> > > compA holderA
> > > compB holderA
> > > compC holderL
> > > compD holderH
> > > compA holderB
> > > compB holderB
> > > compC holderB
> > >
> > > Above, there was more than one instance where compA
> > and compB had the same holder. In a large database, how do I
> > identify instances where a set of comps appear repeatedly
> > with the same holders?
> >

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: recognizing patterns within two columns of data
  - From: SUBRATA BHATTACHARYYA <[email protected]>

References:
- RE: st: recognizing patterns within two columns of data
  - From: Nick Cox <[email protected]>
- RE: st: recognizing patterns within two columns of data
  - From: Dalhia <[email protected]>
- Re: st: recognizing patterns within two columns of data
  - From: SUBRATA BHATTACHARYYA <[email protected]>

Prev by Date: st: RE: Re: Prompt user of ado file for input & output file names & path
Next by Date: st: VEC: Missing Output Data
Previous by thread: Re: st: recognizing patterns within two columns of data
Next by thread: Re: st: recognizing patterns within two columns of data
Index(es):
- Date
- Thread