Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Combine uppercase and lowercase text


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Combine uppercase and lowercase text
Date   Thu, 22 Feb 2007 11:01:55 -0000

Sebastian's approach is mine too, but it can be 
done a little more directly. 

We agree that text is similar if the lower cased
version is identical. 

gen lowercase = lower(text) 

The frequencies of the other versions are 
calculated by 

bysort lowercase text : gen freq = _N 

The most frequent version has the highest 
value of -freq-. If you -sort- within 
values of -lowercase- by -freq-, then the 
most frequent value of -text- is at the end. 

bysort lowercase (freq text) : gen mostfrequent = text[_N] 

Here I am rather arbitrarily splitting ties. 

Nick 
n.j.cox@durham.ac.uk 

Sebastian F. Büchte
 
> my idea would be to first group text entries while ignoring the
> capitalization, then count the occurence within these groups of each
> entry with respect to capitalization and finally sort within each
> group by occurence count and create a new variable which holds the
> most common spelling. In case of a tie its somewhat random what
> spelling will be chosen, it would be up to you to introduce some
> further sort criterium.
> 
> My Stata solution would look like the follwowing:
> 
> clear
> gen str15 text = ""
> input
>  "some text"
>  "Some Text"
>  "SOME TEXT"
>  "some other text"
>  "some other text"
>  "Some other text"
>  "Some other text"
>  "SoMe TeXt"
>  "SoMe TeXt"
>  "Some Other Text"
> end
> tempvar lotext
> tempvar textgrp
> tempvar comspelling
> 
> gen `lotext'=lower(text)
> bys `lotext': gen `textgrp'=1 if _n==1
> replace `textgrp'=sum(`textgrp')
> 
> bys `lotext' text: gen `comspelling'=_N
> bys `lotext' `comspelling': gen newtext=text[_N]
> 
> I bet there are more elegant ways out in the wild and I am just
> looking forward to learn about them.
> 
> Regards
> Sebastian
> 
> 
> On 2/22/07, Friedrich Huebler <huebler@rocketmail.com> wrote:
> > My data has string variables with text in uppercase or lowercase
> > letters. I would like to replace observations that are 
> identical once
> > capitalization is ignored (e.g., "TEXT" and "text") by the most
> > common spelling. In some cases there are ties. So far I have only
> > managed to replace all such observations by their lowercase variant,
> > as in the example below. I am stumped and would appreciate 
> any advice
> > on how I should proceed. I use Stata 8.2.
> >
> > Friedrich Huebler
> >
> > clear
> > gen str15 text = ""
> > input
> >  "some text"
> >  "Some Text"
> >  "SOME TEXT"
> >  "some other text"
> >  "some other text"
> >  "Some other text"
> >  "Some other text"
> >  "SoMe TeXt"
> >  "SoMe TeXt"
> >  "Some Other Text"
> > end
> > count
> > local n = r(N)
> > forvalues i = 1/`n' {
> >  local t = lower(text[`i'])
> >  replace text = "`t'" if lower(text) == "`t'"
> > }
> >

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index