Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Joseph Monte <hmjc66@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Merging files based on name and year |

Date |
Sun, 4 Sep 2011 09:11:30 +0100 |

Nick, Thanks for the help. It worked well. Unfortunately, I'm stuck at the final stage where I need to take the average score for each name conditional on a score being available for each name. Unfortunately, I do have missing observations. . input str1 name score name score 1. "A" 7 2. "A" 8 3. "A" . 4. "B" 6 5. "B" 7 6. "C" 5 7. end . egen meanscore = mean(score), by(name) . list +--------------------+ | name score meanscore | |--------------------| 1. | A 8 7.5 | 2. | A 7 7.5 | 3. | A . 7.5 | 4. | B 7 6.5 | 5. | B 6 6.5 | |--------------------| 6. | C 5 5 | +--------------------+ I want meanscore not to be calculated for A since there is a missing observation (i.e. there should be 3 blanks instead of 7.5). How do I get around this issue? I tried adapting the code from the link below but was not successful. http://www.stata.com/support/faqs/data/anyall.html Thanks, Joe On Thu, Aug 25, 2011 at 9:52 AM, Nick Cox <njcoxstata@gmail.com> wrote: > I don't think regex is the only approach here. You could consider > using -split-. > > A strategy here is to insert new parsing characters yourself. For > example suppose that ; is not used, which you can check by > > assert strpos(name, ";") == 0 > > Then put ; after each terminal element such as "Inc" (there's probably > jargon I don't know) > > clonevar work = name > replace work = subinstr(work, "Inc", "Inc;", .) > replace work = subinstr(work, "LLC", "LLC;", .) > replace work = subinstr(work, "Corp", "Corp;", .) > > and so on. > > You can get all the terminal elements from your file with just > individual names. It is the last word (word(,-1)) of the company name. > You can put that into a new variable and -tab- the results. > > You may need to fix exceptions, which will be shown by the tabulation above. > > Then -split- on ; and then -reshape-. > > Sometimes a very primitive approach like this is much quicker than > spending hours trying to do it a cleverer way. (If someone were > exceptionally fluent with regular expressions that wouldn't be true.) > > When regex works it can be a spectacular solution but with many messy > problems it is often a very long way round. > > Nick > > On Thu, Aug 25, 2011 at 9:24 AM, Joseph Monte <hmjc66@gmail.com> wrote: > >> The file below is the master file containing names of companies and >> years. ABC Inc is one company, XYZ Corp is another company, PNG LLC is >> a third company. I have a total of 1100 different companies. As shown >> below, sometimes two or more companies are listed in the same field >> (there are a maximum of 5 companies listed in the same field). The >> year column has only one year for each observation. There are a total >> of 800 observations in this file. >> >> name year >> ABC Inc 1986 >> XYZ Corp 1994 >> ABC Inc XYZ Corp 2001 >> PNG LLC 2005 >> XYZ Corp PNG LLC 2007 >> >> >> I have a second file with data in the following format. The 1100 >> companies are listed as shown below. YR8084 means the years 1980-1984, >> YR8591 means the years 1985-1991, and so on. The numbers below each >> year are scores assigned to each company during a certain period. For >> example, ABC Inc is assigned a score of 6 during 1980-1984, 7 from >> 1985-1991, and 9 from 2001-2004. ABC Inc is not assigned a score >> during other periods. Scores range from 1 to 9 and may be up to 3 >> decimal places. >> >> name YR8084 YR8591 YR9200 YR0104 YR0507 YR0809 >> ABC Inc 6 7 9 >> XYZ Corp 2 5 6 6 >> PNG LLC 7 7 7 >> 7 7 >> >> >> I want the master file to include a column with scores as shown below. >> For example, ABC Inc gets a score of 7 in 1986, XYZ Corp gets a score >> of 5 in 1994. For observations with two or more names, I want a simple >> average of scores. For example, for ABC Inc XYZ Corp, the score will >> be (9+6)/2=7.5. If a company has a year in the master file, then it >> definitely has a score for that year (i.e. time period) in the second >> file. >> >> name year score >> ABC Inc 1986 7 >> XYZ Corp 1994 5 >> ABC Inc XYZ Corp 2001 7.5 >> PNG LLC 2005 7 >> XYZ Corp PNG LLC 2007 6.5 >> >> I am using Stata 12. I expect I would need the -regexm()- command to >> split the company names, then -reshape- to get all company names one >> below the other and then -merge-. Since there are 1100 companies, I >> would need some kind of a loop to use the -regexm() command. I am >> having trouble writing the code. >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Merging files based on name and year***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Package -ghansen- now available in SSC** - Next by Date:
**Re: st: Merging files based on name and year** - Previous by thread:
**st: on converting dates in stata** - Next by thread:
**Re: st: Merging files based on name and year** - Index(es):