Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: String variable behaving badly


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: String variable behaving badly
Date   Thu, 11 Oct 2012 11:48:51 +0100

In problems like this I suspect the presence of non-standard
characters, say characters that look like spaces but aren't, or
unprintable characters.

I wrote -charlist- (SSC) as a utility which yields a listing of
distinct characters present in a string variable.

. charlist var1
 CEGaeglnovwy

. ret li

macros:
              r(chars) : " CEGaeglnovwy "
           r(sepchars) : "  C E G a e g l n o v w y   "
              r(ascii) : "32 67 69 71 97 101 103 108 110 111 118 119 121 160 "

The tell-tale detail (better seen in the saved results) is the
presence of char(160). Replace those characters by regular spaces

.  replace var1 = subinstr(var1, char(160), " ", .)

and the multiple personae of Ms Cawley collapse into one.

(I'd score very low on a sports quiz but I do remember her as a tennis player.)

Nick

On Thu, Oct 11, 2012 at 11:29 AM, Anna Reimondos <areimondos@gmail.com> wrote:
> Dear Statalist
>
> I am currently cleaning a survey dataset with a variety of numeric as
> well as string variables. I recently discovered some very odd
> behaviour with one of the string variables.
>
> An extract of the data containing two variables (an ID variable and
> the problematic string variable) is available here:
>
> http://wikisend.com/download/508418/stringdata.dta
>
> In the dataset are the 23 responses from people who answered a
> question about who they believe is the most influential sports person
> in Australia. All these 23 people answered the same thing 'Evonne
> Goolagong Cawley' (a famous sports lady).
>
> The problem is that when I do a simple tab of the variable there are
> two entries for Evonne Goolagong Cawley instead of just one. I don't
> understand what is happening. In the dataset you can see that the
> first 2 respondents are somehow being identified as having a different
> answer to the rest of the people even though the spelling is exactly
> the same. I have tried trimming the data, triple checking the spelling
>  and so on, but can't get to the bottom of this and it is driving me
> up the wall.
>
> Just for reference this 'issue' is affecting other entries as well,
> where what I think looks like exactly the same response is not
> recognised as such.
>  Any help would be much appreciated.
>
> I am using Stata 12.1

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index