Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Ignore accents while sorting international characters


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Ignore accents while sorting international characters
Date   Thu, 19 Jun 2008 17:05:02 +0100

My position on this is very close to Austin's. 

Contrary to John-Paul, nothing really depends on StataCorp taking action. As Svend has shown and as Austin has indicated, you can write your own code to obtain the sort order you want. For a variety of very good reasons that is better done by generating your own sort keys, not by any kind of hit on -sort- itself. 

For this to be implemented by StataCorp, they would have to know exactly what order to implement in all sorts of (human) languages. That's hardly a practical request. Doing it for most languages, or most "common" languages, is not good enough. It's easier and better that users can forge the tools to get precisely what they want. 

A detail not mentioned yet is that sorting implies one or more variables being explicit as those sorted by. That would be difficult to maintain if the sort key is one of the variables, except ignoring accents, which underlines a comment above. 

Nick
[email protected] 

Austin Nichols

John LeBlanc <[email protected]> et al.:
I would make a stronger statement than John-Paul Ferguson--it's
probably impossible to do for the general case, as different fonts can
map characters that are a bit like another modulo a diacritical mark
to different codes.  If you can specify the mapping you want (between
characters and numeric codes) you can write a gsort2.ado that will
sort as you want, but you can also just generate a new variable that
will sort as you want, which is what a gsort2.ado would do, so there
is little to be gained.  If you want to see how Stata will sort your
string, type:

forv i=32/255 {
 di char(`i') _c
}

and note that capital letters get sorted before lower-case, which come
before all characters with diacritical marks. So you can predict how
this will come out:

clear
input str2 a
 ok
 Ok
 no
 zz
 �k
end
sort a
li

Also note different folks might want different orderings, even if
numeric codes were perfectly stable, e.g. consider � in Swedish or
German:
http://en.wikipedia.org/wiki/Swedish_alphabet
http://en.wikipedia.org/wiki/German_alphabet#Sorting

On Wed, Jun 18, 2008 at 10:13 PM, John-Paul Ferguson <[email protected]> wrote:
> Looking at the source for gsort reveals that it's mostly engaged in macro
> manipulation with an occasional call to sort to do the basic work. Since
> sort
> itself is a built-in command, it would almost HAVE to be Stata that made any
> such modification.
>
> Quoting John LeBlanc <[email protected]>:
>
>> Thanks; I was hoping that Stata had a built-in option to ignore accents.
>> Some software with sort routines have the ability to give characters with
>> diacritical marks the same value as their own. Is this not an issue for
>> non-English Stata users? Is there sufficient desire to justify asking stata
>> for this feature, e.g., as an option to gsort?
>>
>> On Wed, 18 Jun 2008 12:53:14 +0200, Svend Juul wrote:
>>
>> John LeBlanc wrote:
>>
>> How does one ignore accents while sorting international characters?
>>
>> sort & gsort deliver this:
>>
>> ecole
>> school
>> �cole
>>
>> What I'd like is this:
>> ecole
>> �cole
>> school
>>
>> ============================================================
>>
>> I believe that you must generate a second variable with no accents
>> to get it right:
>>
>>   gen str10 key2=key
>>   replace key2 = subinstr(key2,"�","e",.)
>>   replace key2 = subinstr(key2,"�","o",.)
>>   ...
>>   sort key2 key
>>
>> I included key as a secondary sort key to make � come after e.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index