Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

AW: st: Re: destringing values led to Stata recoding them as missing


From   "Christian Holz" <[email protected]>
To   <[email protected]>
Subject   AW: st: Re: destringing values led to Stata recoding them as missing
Date   Sat, 28 Aug 2004 12:26:58 +0200

Hi Suzy,
first I think it is worth stressing that you should indeed very carefully
consider the point with which Daniel came up recently: Although it is
technically not very hard to remove nonnumeric characters from your string
to allow the destring command to produce numbers (see below), you should be
sure that it is that what you want.
You should carefully thing whether your data is really numeric in the
meaning of interval or ratio (or at least ordinal) level of measurement. You
can of course for example perform a regression analysis with an RHS variable
in which is a value 1002=diabetes and 1003=malaria and 1004=hernia or what
ever and Stata will give you estimates for that regression, but interpreting
those coefficients will not be very meaningful.
But besides these objections, you may use the following code to remove
everything which is not a number from your string variables.

#delimit;
foreach varname of varlist xvar {;
	local i 1;
	while `i'<=_N {;
		local digit 1;
		local tempstring "";
		while `digit' <= length(`varname'[`i']) {;
			local s_digit =substr(`varname'[`i'],`digit',1);
			if ("`s_digit'">="0"&"`s_digit'"<="9") local
tempstring="`tempstring'`s_digit'";
 			local digit=`digit'+1;
		};
		replace `varname'="`tempstring'" in `i';

 		local i=`i'+1;
	};
	destring(`varname'), replace; 
};

Please note that you have to wirte all the variable names of the variables
which you want to convert into numeric instead of xvar in the first opening
line.
Please note further that the code will replace all the values in your
original variables whith numeric ones.
The program does as follows:

Original (from your message):
. d

Contains data
  obs:             4                          
 vars:             4                          
 size:           100 (99.9% of memory free)
----------------------------------------------------------------------------
---
              storage  display     value
variable name   type   format      label      variable label
----------------------------------------------------------------------------
---
patient         float  %9.0g                  
var1            str5   %9s                    
var2            str6   %9s                    
var3            str6   %9s                    
----------------------------------------------------------------------------
---
Sorted by:  
     Note:  dataset has changed since last saved

. l

     +-----------------------------------+
     | patient    var1     var2     var3 |
     |-----------------------------------|
  1. |    1001   1235-    V2347      456 |
  2. |    1002    1233   143135   E28950 |
  3. |    1003   38568   05476-    89076 |
  4. |    1004     126      333    v5678 |
     +-----------------------------------+


Will be as follows after running the code (which may take some time in your
cases with 300k observations)

. d

Contains data
  obs:             4                          
 vars:             4                          
 size:            80 (99.9% of memory free)
----------------------------------------------------------------------------
---
              storage  display     value
variable name   type   format      label      variable label
----------------------------------------------------------------------------
---
patient         float  %9.0g                  
var1            long   %10.0g                 
var2            long   %10.0g                 
var3            long   %10.0g                 
----------------------------------------------------------------------------
---
Sorted by:  
     Note:  dataset has changed since last saved

. l

     +----------------------------------+
     | patient    var1     var2    var3 |
     |----------------------------------|
  1. |    1001    1235     2347     456 |
  2. |    1002    1233   143135   28950 |
  3. |    1003   38568     5476   89076 |
  4. |    1004     126      333    5678 |
     +----------------------------------+


Best wishes
Christian.


-----Urspr�ngliche Nachricht-----
Von: [email protected]
[mailto:[email protected]] Im Auftrag von Suzy
Gesendet: 28 August 2004 05:26
An: [email protected]
Betreff: Re: st: Re: destringing values led to Stata recoding them as
missing

Dear Daniel,
I used the destring option because I wasn't able to analyze the data as 
is - I would get error messages regarding not being able to analyze 
string. These values are codes that represent disorders, so you are 
correct. But since I am a fairly new user of Stata, I just figured that 
it couldn't read those values because of the dashes or the alpha-numeric 
since the datapoints that were only numbers were read and analyzed with 
no problem.

Daniel Egan wrote:

>Hi Suzy,
>
>Just to be clear, are you sure you want to create numeric values? The usual
>reason for destringing a variable is that it IS a numeric variable that has
>typos which cause it to be regarded as text. Is this is a continuous
>variable that does have a numeric (linear etc) relationship. If each of
>these string variables represent different disorders, you should have a
good
>methodological reason for making them numeric. Otherwise, keep them in an
>"apples and oranges" arrangement of strings, i.e. diabetes (1003) is not
>"one more than" malaria (1002)...
>
>In essence, if you want to use each of these variables as categoricals,
they
>are fine as is - as strings. You will be able to analyze them as strings,
in
>a categorical or dummy variable sense.
>
>
>I may be way off here, but just wanted to make sure you knew you could
>analyze them as is.....
>
>Apologies if I am being obvious.
>
>Dan
>
>----- Original Message ----- 
>From: "Suzy" <[email protected]>
>To: <[email protected]>
>Sent: Friday, August 27, 2004 5:44 PM
>Subject: st: destringing values led to Stata recoding them as missing
>
>
>| Dear Statalisters;
>|
>| I have seven variables of over 300,000 observations each. Within each
>| variable, I have over 2000 different  values.  These datapoints
>| represent  specific codes - for example : (72200  = intervertebral disc
>| disorder). Within each of these  seven variables, there are datapoints
>| (values) with dashes or alphabets (Ie: 4109-  or V2389).  The majority
>| of the values though, are purely numeric (23405). I used the destring
>| option so that I could analyze the data and Stata treated all those
>| datapoints that included dashes and alphabets as missing. Now there is a
>| period . where there used to be a value.  I have two questions:
>|
>| 1. Will the restring option restore the datapoints?
>|
>| 2. How can I successfully "destring" these values so that I can include
>| them in my analysis?
>|
>| Any  help and/or specific code would be very helpful as I am only
>| marginally competent with Stata basics.
>|
>| Thank you!
>| Suzy
>|
>|
>| *
>| *   For searches and help try:
>| *   http://www.stata.com/support/faqs/res/findit.html
>| *   http://www.stata.com/support/statalist/faq
>| *   http://www.ats.ucla.edu/stat/stata/
>|
>*
>*   For searches and help try:
>*   http://www.stata.com/support/faqs/res/findit.html
>*   http://www.stata.com/support/statalist/faq
>*   http://www.ats.ucla.edu/stat/stata/
>
>
>  
>


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index