Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Joseph Monte <hmjc66@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Merging files based on name and year |

Date |
Sun, 4 Sep 2011 16:03:00 +0100 |

Thanks for the help, Nick. Joe On Sun, Sep 4, 2011 at 10:45 AM, Nick Cox <njcoxstata@gmail.com> wrote: > Another is > > egen meanscore = mean(score), by(name) > bysort name (score) : replace meanscore = . if missing(score[_N]) > > > On Sun, Sep 4, 2011 at 9:29 AM, Nick Cox <njcoxstata@gmail.com> wrote: >> Many ways to this. One is >> >> egen nmissing = total(missing(score)), by(name) >> egen meanscore = mean(score) if nmissing == 0, by(name) >> >> Another is >> >> bysort name (score) : gen meanscore = sum(score) >> by name : replace meanscore = cond(missing(score[_N}), ., meanscore[_N]/_N) >> >> Nick >> >> On Sun, Sep 4, 2011 at 9:11 AM, Joseph Monte <hmjc66@gmail.com> wrote: >>> Nick, >>> >>> Thanks for the help. It worked well. Unfortunately, I'm stuck at the >>> final stage where I need to take the average score for each name >>> conditional on a score being available for each name. Unfortunately, I >>> do have missing observations. >>> >>> . input str1 name score >>> >>> name score >>> 1. "A" 7 >>> 2. "A" 8 >>> 3. "A" . >>> 4. "B" 6 >>> 5. "B" 7 >>> 6. "C" 5 >>> 7. end >>> >>> . egen meanscore = mean(score), by(name) >>> >>> . list >>> >>> +--------------------+ >>> | name score meanscore | >>> |--------------------| >>> 1. | A 8 7.5 | >>> 2. | A 7 7.5 | >>> 3. | A . 7.5 | >>> 4. | B 7 6.5 | >>> 5. | B 6 6.5 | >>> |--------------------| >>> 6. | C 5 5 | >>> +--------------------+ >>> >>> I want meanscore not to be calculated for A since there is a missing >>> observation (i.e. there should be 3 blanks instead of 7.5). How do I >>> get around this issue? I tried adapting the code from the link below >>> but was not successful. >>> >>> http://www.stata.com/support/faqs/data/anyall.html >>> >>> Thanks, >>> >>> Joe >>> >>> >>> >>> >>> >>> >>> On Thu, Aug 25, 2011 at 9:52 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> I don't think regex is the only approach here. You could consider >>>> using -split-. >>>> >>>> A strategy here is to insert new parsing characters yourself. For >>>> example suppose that ; is not used, which you can check by >>>> >>>> assert strpos(name, ";") == 0 >>>> >>>> Then put ; after each terminal element such as "Inc" (there's probably >>>> jargon I don't know) >>>> >>>> clonevar work = name >>>> replace work = subinstr(work, "Inc", "Inc;", .) >>>> replace work = subinstr(work, "LLC", "LLC;", .) >>>> replace work = subinstr(work, "Corp", "Corp;", .) >>>> >>>> and so on. >>>> >>>> You can get all the terminal elements from your file with just >>>> individual names. It is the last word (word(,-1)) of the company name. >>>> You can put that into a new variable and -tab- the results. >>>> >>>> You may need to fix exceptions, which will be shown by the tabulation above. >>>> >>>> Then -split- on ; and then -reshape-. >>>> >>>> Sometimes a very primitive approach like this is much quicker than >>>> spending hours trying to do it a cleverer way. (If someone were >>>> exceptionally fluent with regular expressions that wouldn't be true.) >>>> >>>> When regex works it can be a spectacular solution but with many messy >>>> problems it is often a very long way round. >>>> >>>> Nick >>>> >>>> On Thu, Aug 25, 2011 at 9:24 AM, Joseph Monte <hmjc66@gmail.com> wrote: >>>> >>>>> The file below is the master file containing names of companies and >>>>> years. ABC Inc is one company, XYZ Corp is another company, PNG LLC is >>>>> a third company. I have a total of 1100 different companies. As shown >>>>> below, sometimes two or more companies are listed in the same field >>>>> (there are a maximum of 5 companies listed in the same field). The >>>>> year column has only one year for each observation. There are a total >>>>> of 800 observations in this file. >>>>> >>>>> name year >>>>> ABC Inc 1986 >>>>> XYZ Corp 1994 >>>>> ABC Inc XYZ Corp 2001 >>>>> PNG LLC 2005 >>>>> XYZ Corp PNG LLC 2007 >>>>> >>>>> >>>>> I have a second file with data in the following format. The 1100 >>>>> companies are listed as shown below. YR8084 means the years 1980-1984, >>>>> YR8591 means the years 1985-1991, and so on. The numbers below each >>>>> year are scores assigned to each company during a certain period. For >>>>> example, ABC Inc is assigned a score of 6 during 1980-1984, 7 from >>>>> 1985-1991, and 9 from 2001-2004. ABC Inc is not assigned a score >>>>> during other periods. Scores range from 1 to 9 and may be up to 3 >>>>> decimal places. >>>>> >>>>> name YR8084 YR8591 YR9200 YR0104 YR0507 YR0809 >>>>> ABC Inc 6 7 9 >>>>> XYZ Corp 2 5 6 6 >>>>> PNG LLC 7 7 7 >>>>> 7 7 >>>>> >>>>> >>>>> I want the master file to include a column with scores as shown below. >>>>> For example, ABC Inc gets a score of 7 in 1986, XYZ Corp gets a score >>>>> of 5 in 1994. For observations with two or more names, I want a simple >>>>> average of scores. For example, for ABC Inc XYZ Corp, the score will >>>>> be (9+6)/2=7.5. If a company has a year in the master file, then it >>>>> definitely has a score for that year (i.e. time period) in the second >>>>> file. >>>>> >>>>> name year score >>>>> ABC Inc 1986 7 >>>>> XYZ Corp 1994 5 >>>>> ABC Inc XYZ Corp 2001 7.5 >>>>> PNG LLC 2005 7 >>>>> XYZ Corp PNG LLC 2007 6.5 >>>>> >>>>> I am using Stata 12. I expect I would need the -regexm()- command to >>>>> split the company names, then -reshape- to get all company names one >>>>> below the other and then -merge-. Since there are 1100 companies, I >>>>> would need some kind of a loop to use the -regexm() command. I am >>>>> having trouble writing the code. >>>>> >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/statalist/faq >>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Merging files based on name and year***From:*Joseph Monte <hmjc66@gmail.com>

**Re: st: Merging files based on name and year***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Merging files based on name and year***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Thread-Index: Acxq/JXymUkm/inSSKKCXH1bWyPDJw==** - Next by Date:
**st: Missing value in -merge- output but not in _merge variable** - Previous by thread:
**Re: st: Merging files based on name and year** - Next by thread:
**Re: st: Effect size after bootstrapping** - Index(es):