Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Merging files based on name and year


From   Joseph Monte <hmjc66@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Merging files based on name and year
Date   Sun, 4 Sep 2011 16:03:00 +0100

Thanks for the help, Nick.

Joe


On Sun, Sep 4, 2011 at 10:45 AM, Nick Cox <njcoxstata@gmail.com> wrote:
> Another is
>
> egen meanscore = mean(score), by(name)
> bysort name (score) : replace meanscore = . if missing(score[_N])
>
>
> On Sun, Sep 4, 2011 at 9:29 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>> Many ways to this. One is
>>
>> egen nmissing = total(missing(score)), by(name)
>> egen meanscore = mean(score) if nmissing == 0, by(name)
>>
>> Another is
>>
>> bysort name (score) : gen meanscore = sum(score)
>> by name : replace meanscore = cond(missing(score[_N}), ., meanscore[_N]/_N)
>>
>> Nick
>>
>> On Sun, Sep 4, 2011 at 9:11 AM, Joseph Monte <hmjc66@gmail.com> wrote:
>>> Nick,
>>>
>>> Thanks for the help. It worked well. Unfortunately, I'm stuck at the
>>> final stage where I need to take the average score for each name
>>> conditional on a score being available for each name. Unfortunately, I
>>> do have missing observations.
>>>
>>> . input str1 name score
>>>
>>>            name       score
>>>  1. "A" 7
>>>  2. "A" 8
>>>  3. "A" .
>>>  4. "B" 6
>>>  5. "B" 7
>>>  6. "C" 5
>>>  7. end
>>>
>>> . egen meanscore = mean(score), by(name)
>>>
>>> . list
>>>
>>>     +--------------------+
>>>     | name   score   meanscore |
>>>     |--------------------|
>>>  1. |  A      8      7.5 |
>>>  2. |  A      7      7.5 |
>>>  3. |  A      .      7.5 |
>>>  4. |  B      7      6.5 |
>>>  5. |  B      6      6.5 |
>>>     |--------------------|
>>>  6. |  C      5        5 |
>>>     +--------------------+
>>>
>>> I want meanscore not to be calculated for A since there is a missing
>>> observation (i.e. there should be 3 blanks instead of 7.5). How do I
>>> get around this issue? I tried adapting the code from the link below
>>> but was not successful.
>>>
>>> http://www.stata.com/support/faqs/data/anyall.html
>>>
>>> Thanks,
>>>
>>> Joe
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 25, 2011 at 9:52 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>> I don't think regex is the only approach here. You could consider
>>>> using -split-.
>>>>
>>>> A strategy here is to insert new parsing characters yourself. For
>>>> example suppose that ; is not used, which you can check by
>>>>
>>>> assert strpos(name, ";") == 0
>>>>
>>>> Then put ; after each terminal element such as "Inc" (there's probably
>>>> jargon I don't know)
>>>>
>>>> clonevar work = name
>>>> replace work = subinstr(work, "Inc", "Inc;", .)
>>>> replace work = subinstr(work, "LLC", "LLC;", .)
>>>> replace work = subinstr(work, "Corp", "Corp;", .)
>>>>
>>>> and so on.
>>>>
>>>> You can get all the terminal elements from your file with just
>>>> individual names. It is the last word (word(,-1)) of the company name.
>>>> You can put that into a new variable and -tab- the results.
>>>>
>>>> You may need to fix exceptions, which will be shown by the tabulation above.
>>>>
>>>> Then -split- on ; and then -reshape-.
>>>>
>>>> Sometimes a very primitive approach like this is much quicker than
>>>> spending hours trying to do it a cleverer way. (If someone were
>>>> exceptionally fluent with regular expressions that wouldn't be true.)
>>>>
>>>> When regex works it can be a spectacular solution but with many messy
>>>> problems it is often a very long way round.
>>>>
>>>> Nick
>>>>
>>>> On Thu, Aug 25, 2011 at 9:24 AM, Joseph Monte <hmjc66@gmail.com> wrote:
>>>>
>>>>> The file below is the master file containing names of companies and
>>>>> years. ABC Inc is one company, XYZ Corp is another company, PNG LLC is
>>>>> a third company. I have a total of 1100 different companies. As shown
>>>>> below, sometimes two or more companies are listed in the same field
>>>>> (there are a maximum of 5 companies listed in the same field). The
>>>>> year column has only one year for each observation. There are a total
>>>>> of 800 observations in this file.
>>>>>
>>>>> name                    year
>>>>> ABC Inc                 1986
>>>>> XYZ Corp                        1994
>>>>> ABC Inc XYZ Corp        2001
>>>>> PNG LLC                 2005
>>>>> XYZ Corp PNG LLC        2007
>>>>>
>>>>>
>>>>> I have a second file with data in the following format. The 1100
>>>>> companies are listed as shown below. YR8084 means the years 1980-1984,
>>>>> YR8591 means the years 1985-1991, and so on. The numbers below each
>>>>> year are scores assigned to each company during a certain period. For
>>>>> example, ABC Inc is assigned a score of 6 during 1980-1984, 7 from
>>>>> 1985-1991, and 9 from 2001-2004. ABC Inc is not assigned a score
>>>>> during other periods. Scores range from 1 to 9 and may be up to 3
>>>>> decimal places.
>>>>>
>>>>> name            YR8084  YR8591  YR9200  YR0104  YR0507  YR0809
>>>>> ABC Inc             6                7                              9
>>>>> XYZ Corp                                     2               5              6              6
>>>>> PNG LLC                              7               7              7
>>>>>  7                   7
>>>>>
>>>>>
>>>>> I want the master file to include a column with scores as shown below.
>>>>> For example, ABC Inc gets a score of 7 in 1986, XYZ Corp gets a score
>>>>> of 5 in 1994. For observations with two or more names, I want a simple
>>>>> average of scores. For example, for ABC Inc XYZ Corp, the score will
>>>>> be (9+6)/2=7.5. If a company has a year in the master file, then it
>>>>> definitely has a score for that year (i.e. time period) in the second
>>>>> file.
>>>>>
>>>>> name                    year                score
>>>>> ABC Inc                 1986                   7
>>>>> XYZ Corp                        1994                   5
>>>>> ABC Inc XYZ Corp        2001                  7.5
>>>>> PNG LLC                 2005                   7
>>>>> XYZ Corp PNG LLC        2007                  6.5
>>>>>
>>>>> I am using Stata 12. I expect I would need the -regexm()- command to
>>>>> split the company names, then -reshape- to get all company names one
>>>>> below the other and then -merge-. Since there are 1100 companies, I
>>>>> would need some kind of a loop to use the -regexm() command. I am
>>>>> having trouble writing the code.
>>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index