Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Anna Reimondos <areimondos@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: String variable behaving badly |
Date | Thu, 11 Oct 2012 21:56:50 +1100 |
Oh my goodness thank you so much! Anna On Thu, Oct 11, 2012 at 9:48 PM, Nick Cox <njcoxstata@gmail.com> wrote: > In problems like this I suspect the presence of non-standard > characters, say characters that look like spaces but aren't, or > unprintable characters. > > I wrote -charlist- (SSC) as a utility which yields a listing of > distinct characters present in a string variable. > > . charlist var1 > CEGaeglnovwy > > . ret li > > macros: > r(chars) : " CEGaeglnovwy " > r(sepchars) : " C E G a e g l n o v w y " > r(ascii) : "32 67 69 71 97 101 103 108 110 111 118 119 121 160 " > > The tell-tale detail (better seen in the saved results) is the > presence of char(160). Replace those characters by regular spaces > > . replace var1 = subinstr(var1, char(160), " ", .) > > and the multiple personae of Ms Cawley collapse into one. > > (I'd score very low on a sports quiz but I do remember her as a tennis player.) > > Nick > > On Thu, Oct 11, 2012 at 11:29 AM, Anna Reimondos <areimondos@gmail.com> wrote: >> Dear Statalist >> >> I am currently cleaning a survey dataset with a variety of numeric as >> well as string variables. I recently discovered some very odd >> behaviour with one of the string variables. >> >> An extract of the data containing two variables (an ID variable and >> the problematic string variable) is available here: >> >> http://wikisend.com/download/508418/stringdata.dta >> >> In the dataset are the 23 responses from people who answered a >> question about who they believe is the most influential sports person >> in Australia. All these 23 people answered the same thing 'Evonne >> Goolagong Cawley' (a famous sports lady). >> >> The problem is that when I do a simple tab of the variable there are >> two entries for Evonne Goolagong Cawley instead of just one. I don't >> understand what is happening. In the dataset you can see that the >> first 2 respondents are somehow being identified as having a different >> answer to the rest of the people even though the spelling is exactly >> the same. I have tried trimming the data, triple checking the spelling >> and so on, but can't get to the bottom of this and it is driving me >> up the wall. >> >> Just for reference this 'issue' is affecting other entries as well, >> where what I think looks like exactly the same response is not >> recognised as such. >> Any help would be much appreciated. >> >> I am using Stata 12.1 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/