Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data management issue (names listed differently)


From   David Bell <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: data management issue (names listed differently)
Date   Wed, 2 Jul 2008 12:51:50 -0400

Rufus,

I like Eva's first advice. It isn't a matter of how many observations you have; it's a matter of how many spellings there are. The advantage of -inlist- is that when you discover a new spelling, you can correct it in your code, whereas if you use a separate file you have to open, change, and close the file before -merge-ing. In my own experience, I find that a do-file provides better documentation than a separate file.

Dave
====================================
David C. Bell
Professor of Sociology
Indiana University Purdue University Indianapolis (IUPUI)
(317) 278-1336
====================================




On Jul 2, 2008, at 12:09 PM, Rufus Peabody wrote:


Eva,

Much thanks for the advice.  I am still wondering how I can merge with
a variable that has a mixture of CorrectSpelling and WrongSpelling.
Cleaning it up manually is extremely time-consuming since there are
thousands of observations.

Thanks,
Rufus

On Jul 2, 2008, at 8:42 AM, Eva Poen wrote:

Rufus,

are there too many schools/spellings to do it manually (i.e. -replace
school = "USC" if inlist(school, "Southern Cal","SouthCal")- )?

In any case, I would recommend that you clean up your school variable
to make your task as easy as possible. That includes stripping of
leading/trailling blanks using -trim()-, and converting everything to
lower case (-lower()-). -itrim()- will reduce multiple, consecutive
internal blanks to one for you. All of this will help in reducing the
number of replacements you have to do.

As a general strategy, you could compile a list (or data set) of all
the spellings you have, after cleaning up. If you go for a data set,
it could have two variables, CorrectSpelling and WrongSpelling. It
should then be possible to use -merge- to add the correct spelling to
data sets where the wrong spelling is present. For this to work you
need to make sure that there are no ambiguous wrong spellings, i.e.
abbreviations that may relate to more than one school.

Hope this helps,
Eva




2008/7/2 Rufus Peabody <[email protected]>:
Hey all,

I'm working with a dataset that contains a few variable containing
the name
of different college football teams. The problem is, they are not
spelled
consistently (i.e. Miami(FL) and Miami Florida; USC and Southern
Cal). In
many cases the spelling differs only in that there is an extra
space after
the school name for some. What I'd like to do (and I'm pretty sure
is
possible) is create a master file with all the school names and
possible
spellings, which I can then somehow merge with my original dataset
(and any
future datasets with these teams) to create a consistent spelling.
How do I
go about doing this? Specifically, if I have, say three variables
containing
spelling 1, spelling 2, and spelling 3 of a school, and I want to use
spelling 1 in another dataset, how can I merge with a variable that
has some
schools with spellling 1 and others with spelling 2 or 3?

Thanks a lot,
Rufus
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index