Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Combine uppercase and lowercase text


From   "Sebastian F. Büchte" <sfbuechte@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Combine uppercase and lowercase text
Date   Thu, 22 Feb 2007 08:27:10 +0100

Friedrich,

my idea would be to first group text entries while ignoring the
capitalization, then count the occurence within these groups of each
entry with respect to capitalization and finally sort within each
group by occurence count and create a new variable which holds the
most common spelling. In case of a tie its somewhat random what
spelling will be chosen, it would be up to you to introduce some
further sort criterium.

My Stata solution would look like the follwowing:

clear
gen str15 text = ""
input
"some text"
"Some Text"
"SOME TEXT"
"some other text"
"some other text"
"Some other text"
"Some other text"
"SoMe TeXt"
"SoMe TeXt"
"Some Other Text"
end
tempvar lotext
tempvar textgrp
tempvar comspelling

gen `lotext'=lower(text)
bys `lotext': gen `textgrp'=1 if _n==1
replace `textgrp'=sum(`textgrp')

bys `lotext' text: gen `comspelling'=_N
bys `lotext' `comspelling': gen newtext=text[_N]

I bet there are more elegant ways out in the wild and I am just
looking forward to learn about them.

Regards
Sebastian


On 2/22/07, Friedrich Huebler <huebler@rocketmail.com> wrote:
My data has string variables with text in uppercase or lowercase
letters. I would like to replace observations that are identical once
capitalization is ignored (e.g., "TEXT" and "text") by the most
common spelling. In some cases there are ties. So far I have only
managed to replace all such observations by their lowercase variant,
as in the example below. I am stumped and would appreciate any advice
on how I should proceed. I use Stata 8.2.

Friedrich Huebler

clear
gen str15 text = ""
input
 "some text"
 "Some Text"
 "SOME TEXT"
 "some other text"
 "some other text"
 "Some other text"
 "Some other text"
 "SoMe TeXt"
 "SoMe TeXt"
 "Some Other Text"
end
count
local n = r(N)
forvalues i = 1/`n' {
 local t = lower(text[`i'])
 replace text = "`t'" if lower(text) == "`t'"
}







____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index