Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: String variable behaving badly

From	Anna Reimondos <[email protected]>
To	[email protected]
Subject	Re: st: String variable behaving badly
Date	Thu, 11 Oct 2012 21:56:50 +1100

Oh my goodness thank you so much!

Anna

On Thu, Oct 11, 2012 at 9:48 PM, Nick Cox <[email protected]> wrote:
> In problems like this I suspect the presence of non-standard
> characters, say characters that look like spaces but aren't, or
> unprintable characters.
>
> I wrote -charlist- (SSC) as a utility which yields a listing of
> distinct characters present in a string variable.
>
> . charlist var1
>  CEGaeglnovwy
>
> . ret li
>
> macros:
>               r(chars) : " CEGaeglnovwy "
>            r(sepchars) : "  C E G a e g l n o v w y   "
>               r(ascii) : "32 67 69 71 97 101 103 108 110 111 118 119 121 160 "
>
> The tell-tale detail (better seen in the saved results) is the
> presence of char(160). Replace those characters by regular spaces
>
> .  replace var1 = subinstr(var1, char(160), " ", .)
>
> and the multiple personae of Ms Cawley collapse into one.
>
> (I'd score very low on a sports quiz but I do remember her as a tennis player.)
>
> Nick
>
> On Thu, Oct 11, 2012 at 11:29 AM, Anna Reimondos <[email protected]> wrote:
>> Dear Statalist
>>
>> I am currently cleaning a survey dataset with a variety of numeric as
>> well as string variables. I recently discovered some very odd
>> behaviour with one of the string variables.
>>
>> An extract of the data containing two variables (an ID variable and
>> the problematic string variable) is available here:
>>
>> http://wikisend.com/download/508418/stringdata.dta
>>
>> In the dataset are the 23 responses from people who answered a
>> question about who they believe is the most influential sports person
>> in Australia. All these 23 people answered the same thing 'Evonne
>> Goolagong Cawley' (a famous sports lady).
>>
>> The problem is that when I do a simple tab of the variable there are
>> two entries for Evonne Goolagong Cawley instead of just one. I don't
>> understand what is happening. In the dataset you can see that the
>> first 2 respondents are somehow being identified as having a different
>> answer to the rest of the people even though the spelling is exactly
>> the same. I have tried trimming the data, triple checking the spelling
>>  and so on, but can't get to the bottom of this and it is driving me
>> up the wall.
>>
>> Just for reference this 'issue' is affecting other entries as well,
>> where what I think looks like exactly the same response is not
>> recognised as such.
>>  Any help would be much appreciated.
>>
>> I am using Stata 12.1
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: String variable behaving badly
  - From: Anna Reimondos <[email protected]>
- Re: st: String variable behaving badly
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: String variable behaving badly
Next by Date: st: ordered logistic regression with endogenous variable
Previous by thread: Re: st: String variable behaving badly
Next by thread: st: String variable behaving oddly
Index(es):
- Date
- Thread