Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Ignore accents while sorting international characters


From   "Austin Nichols" <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Ignore accents while sorting international characters
Date   Thu, 19 Jun 2008 11:37:05 -0400

John LeBlanc <leblancj@dal.ca> et al.:
I would make a stronger statement than John-Paul Ferguson--it's
probably impossible to do for the general case, as different fonts can
map characters that are a bit like another modulo a diacritical mark
to different codes.  If you can specify the mapping you want (between
characters and numeric codes) you can write a gsort2.ado that will
sort as you want, but you can also just generate a new variable that
will sort as you want, which is what a gsort2.ado would do, so there
is little to be gained.  If you want to see how Stata will sort your
string, type:

forv i=32/255 {
 di char(`i') _c
}

and note that capital letters get sorted before lower-case, which come
before all characters with diacritical marks. So you can predict how
this will come out:

clear
input str2 a
 ok
 Ok
 no
 zz
 ˇk
end
sort a
li

Also note different folks might want different orderings, even if
numeric codes were perfectly stable, e.g. consider ÷ in Swedish or
German:
http://en.wikipedia.org/wiki/Swedish_alphabet
http://en.wikipedia.org/wiki/German_alphabet#Sorting

On Wed, Jun 18, 2008 at 10:13 PM, John-Paul Ferguson <jpferg@mit.edu> wrote:
> Looking at the source for gsort reveals that it's mostly engaged in macro
> manipulation with an occasional call to sort to do the basic work. Since
> sort
> itself is a built-in command, it would almost HAVE to be Stata that made any
> such modification.
>
> John-Paul Ferguson
>
> Quoting John LeBlanc <leblancj@dal.ca>:
>
>> Thanks; I was hoping that Stata had a built-in option to ignore accents.
>> Some software with sort routines have the ability to give characters with
>> diacritical marks the same value as their own. Is this not an issue for
>> non-English Stata users? Is there sufficient desire to justify asking stata
>> for this feature, e.g., as an option to gsort?
>>
>>
>> John
>>
>> On Wed, 18 Jun 2008 12:53:14 +0200, Svend Juul wrote:
>>
>> John LeBlanc wrote:
>>
>> How does one ignore accents while sorting international characters?
>>
>> sort & gsort deliver this:
>>
>> ecole
>> school
>> Úcole
>>
>> What I'd like is this:
>> ecole
>> Úcole
>> school
>>
>> ============================================================
>>
>> I believe that you must generate a second variable with no accents
>> to get it right:
>>
>>   gen str10 key2=key
>>   replace key2 = subinstr(key2,"Ú","e",.)
>>   replace key2 = subinstr(key2,"˘","o",.)
>>   ...
>>   sort key2 key
>>
>> I included key as a secondary sort key to make Ú come after e.
>>
>> Hope this helps
>> Svend
>>
>>

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index