Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Merging files based on name and year |

Date |
Sun, 4 Sep 2011 10:45:43 +0100 |

Another is egen meanscore = mean(score), by(name) bysort name (score) : replace meanscore = . if missing(score[_N]) On Sun, Sep 4, 2011 at 9:29 AM, Nick Cox <njcoxstata@gmail.com> wrote: > Many ways to this. One is > > egen nmissing = total(missing(score)), by(name) > egen meanscore = mean(score) if nmissing == 0, by(name) > > Another is > > bysort name (score) : gen meanscore = sum(score) > by name : replace meanscore = cond(missing(score[_N}), ., meanscore[_N]/_N) > > Nick > > On Sun, Sep 4, 2011 at 9:11 AM, Joseph Monte <hmjc66@gmail.com> wrote: >> Nick, >> >> Thanks for the help. It worked well. Unfortunately, I'm stuck at the >> final stage where I need to take the average score for each name >> conditional on a score being available for each name. Unfortunately, I >> do have missing observations. >> >> . input str1 name score >> >> name score >> 1. "A" 7 >> 2. "A" 8 >> 3. "A" . >> 4. "B" 6 >> 5. "B" 7 >> 6. "C" 5 >> 7. end >> >> . egen meanscore = mean(score), by(name) >> >> . list >> >> +--------------------+ >> | name score meanscore | >> |--------------------| >> 1. | A 8 7.5 | >> 2. | A 7 7.5 | >> 3. | A . 7.5 | >> 4. | B 7 6.5 | >> 5. | B 6 6.5 | >> |--------------------| >> 6. | C 5 5 | >> +--------------------+ >> >> I want meanscore not to be calculated for A since there is a missing >> observation (i.e. there should be 3 blanks instead of 7.5). How do I >> get around this issue? I tried adapting the code from the link below >> but was not successful. >> >> http://www.stata.com/support/faqs/data/anyall.html >> >> Thanks, >> >> Joe >> >> >> >> >> >> >> On Thu, Aug 25, 2011 at 9:52 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>> I don't think regex is the only approach here. You could consider >>> using -split-. >>> >>> A strategy here is to insert new parsing characters yourself. For >>> example suppose that ; is not used, which you can check by >>> >>> assert strpos(name, ";") == 0 >>> >>> Then put ; after each terminal element such as "Inc" (there's probably >>> jargon I don't know) >>> >>> clonevar work = name >>> replace work = subinstr(work, "Inc", "Inc;", .) >>> replace work = subinstr(work, "LLC", "LLC;", .) >>> replace work = subinstr(work, "Corp", "Corp;", .) >>> >>> and so on. >>> >>> You can get all the terminal elements from your file with just >>> individual names. It is the last word (word(,-1)) of the company name. >>> You can put that into a new variable and -tab- the results. >>> >>> You may need to fix exceptions, which will be shown by the tabulation above. >>> >>> Then -split- on ; and then -reshape-. >>> >>> Sometimes a very primitive approach like this is much quicker than >>> spending hours trying to do it a cleverer way. (If someone were >>> exceptionally fluent with regular expressions that wouldn't be true.) >>> >>> When regex works it can be a spectacular solution but with many messy >>> problems it is often a very long way round. >>> >>> Nick >>> >>> On Thu, Aug 25, 2011 at 9:24 AM, Joseph Monte <hmjc66@gmail.com> wrote: >>> >>>> The file below is the master file containing names of companies and >>>> years. ABC Inc is one company, XYZ Corp is another company, PNG LLC is >>>> a third company. I have a total of 1100 different companies. As shown >>>> below, sometimes two or more companies are listed in the same field >>>> (there are a maximum of 5 companies listed in the same field). The >>>> year column has only one year for each observation. There are a total >>>> of 800 observations in this file. >>>> >>>> name year >>>> ABC Inc 1986 >>>> XYZ Corp 1994 >>>> ABC Inc XYZ Corp 2001 >>>> PNG LLC 2005 >>>> XYZ Corp PNG LLC 2007 >>>> >>>> >>>> I have a second file with data in the following format. The 1100 >>>> companies are listed as shown below. YR8084 means the years 1980-1984, >>>> YR8591 means the years 1985-1991, and so on. The numbers below each >>>> year are scores assigned to each company during a certain period. For >>>> example, ABC Inc is assigned a score of 6 during 1980-1984, 7 from >>>> 1985-1991, and 9 from 2001-2004. ABC Inc is not assigned a score >>>> during other periods. Scores range from 1 to 9 and may be up to 3 >>>> decimal places. >>>> >>>> name YR8084 YR8591 YR9200 YR0104 YR0507 YR0809 >>>> ABC Inc 6 7 9 >>>> XYZ Corp 2 5 6 6 >>>> PNG LLC 7 7 7 >>>> 7 7 >>>> >>>> >>>> I want the master file to include a column with scores as shown below. >>>> For example, ABC Inc gets a score of 7 in 1986, XYZ Corp gets a score >>>> of 5 in 1994. For observations with two or more names, I want a simple >>>> average of scores. For example, for ABC Inc XYZ Corp, the score will >>>> be (9+6)/2=7.5. If a company has a year in the master file, then it >>>> definitely has a score for that year (i.e. time period) in the second >>>> file. >>>> >>>> name year score >>>> ABC Inc 1986 7 >>>> XYZ Corp 1994 5 >>>> ABC Inc XYZ Corp 2001 7.5 >>>> PNG LLC 2005 7 >>>> XYZ Corp PNG LLC 2007 6.5 >>>> >>>> I am using Stata 12. I expect I would need the -regexm()- command to >>>> split the company names, then -reshape- to get all company names one >>>> below the other and then -merge-. Since there are 1100 companies, I >>>> would need some kind of a loop to use the -regexm() command. I am >>>> having trouble writing the code. >>>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Merging files based on name and year***From:*Joseph Monte <hmjc66@gmail.com>

**References**:**Re: st: Merging files based on name and year***From:*Joseph Monte <hmjc66@gmail.com>

**Re: st: Merging files based on name and year***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: creating a foreach loop for principal components** - Next by Date:
**Re: st: Effect size after bootstrapping** - Previous by thread:
**Re: st: Merging files based on name and year** - Next by thread:
**Re: st: Merging files based on name and year** - Index(es):