Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Merging files based on name and year


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Merging files based on name and year
Date   Thu, 25 Aug 2011 09:52:22 +0100

I don't think regex is the only approach here. You could consider
using -split-.

A strategy here is to insert new parsing characters yourself. For
example suppose that ; is not used, which you can check by

assert strpos(name, ";") == 0

Then put ; after each terminal element such as "Inc" (there's probably
jargon I don't know)

clonevar work = name
replace work = subinstr(work, "Inc", "Inc;", .)
replace work = subinstr(work, "LLC", "LLC;", .)
replace work = subinstr(work, "Corp", "Corp;", .)

and so on.

You can get all the terminal elements from your file with just
individual names. It is the last word (word(,-1)) of the company name.
You can put that into a new variable and -tab- the results.

You may need to fix exceptions, which will be shown by the tabulation above.

Then -split- on ; and then -reshape-.

Sometimes a very primitive approach like this is much quicker than
spending hours trying to do it a cleverer way. (If someone were
exceptionally fluent with regular expressions that wouldn't be true.)

When regex works it can be a spectacular solution but with many messy
problems it is often a very long way round.

Nick

On Thu, Aug 25, 2011 at 9:24 AM, Joseph Monte <hmjc66@gmail.com> wrote:

> The file below is the master file containing names of companies and
> years. ABC Inc is one company, XYZ Corp is another company, PNG LLC is
> a third company. I have a total of 1100 different companies. As shown
> below, sometimes two or more companies are listed in the same field
> (there are a maximum of 5 companies listed in the same field). The
> year column has only one year for each observation. There are a total
> of 800 observations in this file.
>
> name                    year
> ABC Inc                 1986
> XYZ Corp                        1994
> ABC Inc XYZ Corp        2001
> PNG LLC                 2005
> XYZ Corp PNG LLC        2007
>
>
> I have a second file with data in the following format. The 1100
> companies are listed as shown below. YR8084 means the years 1980-1984,
> YR8591 means the years 1985-1991, and so on. The numbers below each
> year are scores assigned to each company during a certain period. For
> example, ABC Inc is assigned a score of 6 during 1980-1984, 7 from
> 1985-1991, and 9 from 2001-2004. ABC Inc is not assigned a score
> during other periods. Scores range from 1 to 9 and may be up to 3
> decimal places.
>
> name            YR8084  YR8591  YR9200  YR0104  YR0507  YR0809
> ABC Inc             6                7                              9
> XYZ Corp                                     2               5              6              6
> PNG LLC                              7               7              7
>  7                   7
>
>
> I want the master file to include a column with scores as shown below.
> For example, ABC Inc gets a score of 7 in 1986, XYZ Corp gets a score
> of 5 in 1994. For observations with two or more names, I want a simple
> average of scores. For example, for ABC Inc XYZ Corp, the score will
> be (9+6)/2=7.5. If a company has a year in the master file, then it
> definitely has a score for that year (i.e. time period) in the second
> file.
>
> name                    year                score
> ABC Inc                 1986                   7
> XYZ Corp                        1994                   5
> ABC Inc XYZ Corp        2001                  7.5
> PNG LLC                 2005                   7
> XYZ Corp PNG LLC        2007                  6.5
>
> I am using Stata 12. I expect I would need the -regexm()- command to
> split the company names, then -reshape- to get all company names one
> below the other and then -merge-. Since there are 1100 companies, I
> would need some kind of a loop to use the -regexm() command. I am
> having trouble writing the code.
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index