[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data management issue (names listed differently)

From   Jeph Herrin <>
Subject   Re: st: data management issue (names listed differently)
Date   Wed, 02 Jul 2008 17:30:14 -0400


I faced the problem recently with hospital data; we only had hospital
names, and needed to identify the hospitals uniquely for the analysis.
I also had the same problem once before with drug info. Here's what I
did both times, translated to your situation (if your variable with
teamnames is -teamname-); it may work for you.

1. create a list of all the text names

   . u mydatafile, clear
   . bys teamname: keep if _n==1
   . outsheet teamname using teamnames.csv, replace

2. read this file teamnames.csv into Excel and add a
   second column -code-. in the second column, give the
   same number to every row that represents the same team.
   this is slightly time consuming, but if they are sorted
   on -teamname- it will just take a few minutes. one
   advantage here is you can always recheck your codings.

3. export this from excel as codes.csv; read it into Stata
   and sort on teamname

   . insheet codes.csv, clear
   . compress
   . sort teamname
   . save codes, replace

4. Now you have a file (codes.dta) which has a list of the team names
   in all their variety, and you can merge this into your original
   file to assign a common code to every team that is the same

   . u mydatafile, clear
   . sort teamname
   . merge teamname using codes

5. Now you can identify the teams by their codes; if you want you can
   assign a common name

   . bys code: replace teamname=teamname[_n-1] if _n>1

hope this helps,

Rufus Peabody wrote:
Hey all,

I'm working with a dataset that contains a few variable containing the name of different college football teams. The problem is, they are not spelled consistently (i.e. Miami(FL) and Miami Florida; USC and Southern Cal). In many cases the spelling differs only in that there is an extra space after the school name for some. What I'd like to do (and I'm pretty sure is possible) is create a master file with all the school names and possible spellings, which I can then somehow merge with my original dataset (and any future datasets with these teams) to create a consistent spelling. How do I go about doing this? Specifically, if I have, say three variables containing spelling 1, spelling 2, and spelling 3 of a school, and I want to use spelling 1 in another dataset, how can I merge with a variable that has some schools with spellling 1 and others with spelling 2 or 3?

Thanks a lot,
* For searches and help try:

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index