Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Combine uppercase and lowercase text


From   "Sergiy Radyakin" <Radyakin@aoek.uni-hannover.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Re: Combine uppercase and lowercase text
Date   Thu, 22 Feb 2007 11:48:35 +0100

Hi,

if you want to replace whole observations to a common spelling it is a couple of minutes to implement (things are a bit tuffer if you have to go word-by-word in every observation).

The general schema is:

1. Backup your data

2. Keep only this variable of interest

3. Create a frequency list. Use -contract- with the option -freq(n)- to create a new variable n, which shows how often you have this spelling. E.g.:
contract text,freq(n)

+------+
| var1 |
|------|
1. | Text |
2. | Text |
3. | text |
4. | text |
5. | TEXT |
+------+
. contract var1,freq(n)
+----------+
| var1 n |
|----------|
1. | TEXT 1 |
2. | Text 2 |
3. | text 2 |
+----------+

4. Now you have a frequency list. Create a new variable with a low-case spelling:
gen text_low=lower(text)
This variable will determine a dictionary-entry group with different spellings of the same word (in your case it can be several words).

5. sort by group and frequency
sort text_low n

+---------------------+
| text n text_low |
|---------------------|
1. | TEXT 1 text |
2. | Text 2 text |
3. | text 2 text |
+---------------------+


6. Now you can go by group and assign a representative spelling in each group (in my listing there is only 1 group "text"):
. by text_low : gen spelling=text[_N]

. l

+--------------------------------+
| text n text_low spelling |
|--------------------------------|
1. | TEXT 1 text text |
2. | Text 2 text text |
3. | text 2 text text |
+--------------------------------+

7. It just happened to be that spelling is equal to text_low here, need not be always like that.
Drop n and text_low

8. Now you have a dictionary which translates "TEXT" "Text" and "text" into "text".

9. Sort this data by text and save.

10 Get your original data back.

11 Merge the two datasets by the text variable.

12. Done


Things get a bit more complicated if you want to go word-by-word. Then you create a full list of all words going observation-by-observation in a cycle, and for each observation in a word-by-word cycle. Then you process this list as above to get a translation dictionary. You can't merge the two datasets anymore (unless you have a very limited dictionary, where you can create all possible "sentences" first). So you will have to go a double-cycle again (obs-by-obs, and word-by-word) looking for each word in the dictionary.

If the results are to be displayed to a human reader, it sometimes irritates if one sees Tokyo, new york, MOSCOW.
So even if these were the most common spellings in the original data, one would still prefer: Tokyo, New York, Moscow.
You might want to interface with Google or any other online reference to try to guess, what the spelling is (will take an incredible amount of time for a large dataset). Alternatively get a large local dictionary file, and try a search there. Google gives a plenty. One easily obtainable is here: http://wordlist.sourceforge.net/

Best regards, Sergiy




----- Original Message ----- From: "Friedrich Huebler" <huebler@rocketmail.com>
To: <statalist@hsphsun2.harvard.edu>
Sent: Thursday, February 22, 2007 1:15 AM
Subject: st: Combine uppercase and lowercase text



My data has string variables with text in uppercase or lowercase
letters. I would like to replace observations that are identical once
capitalization is ignored (e.g., "TEXT" and "text") by the most
common spelling. In some cases there are ties. So far I have only
managed to replace all such observations by their lowercase variant,
as in the example below. I am stumped and would appreciate any advice
on how I should proceed. I use Stata 8.2.

Friedrich Huebler

clear
gen str15 text = ""
input
"some text"
"Some Text"
"SOME TEXT"
"some other text"
"some other text"
"Some other text"
"Some other text"
"SoMe TeXt"
"SoMe TeXt"
"Some Other Text"
end
count
local n = r(N)
forvalues i = 1/`n' {
local t = lower(text[`i'])
replace text = "`t'" if lower(text) == "`t'"
}







____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index