Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Date Data/Asserts


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Date Data/Asserts
Date   Sat, 27 Mar 2004 15:54:26 -0000

You identify the problems very clearly, and you 
are aware of some useful tools here. However, 
a variable so messy is not going to surrender
faced with just a single weapon of mess dissection. 

Here's one way of proceeding. 

I copied your data and argued as follows. 

1. The value is most commonly the first word
of the string, but we want it as a number. 

. gen fvalue = real(word(ferritin,1)) 
(140 missing values generated)

2. The date is most commonly the second word
of the string, interpreted as a (recent) daily date. 

. gen fdate = date(word(ferritin,2),"mdy",2050)
(162 missing values generated)

3. I see lots of values with "U"s. I want to check
that when "U" occurs it occurs only as itself. 

. assert ferritin == "U" if index(ferritin, "U") 

Here -assert- says nothing. No news is good news, 
i.e. Stata assents to the assertion. (In this 
case, all "U"s are bad news, i.e. no information
at all.) 

4. So I know all about the "U" problem. What values
are missing, but not because of "U"? 

. list ferritin fvalue if mi(fvalue) & ferritin != "U" 

     +------------------------+
     |      ferritin   fvalue |
     |------------------------|
178. |        normal        . |
225. |                      . |
286. | >1400 3/24/99        . |
324. |   > 2K 1/7/02        . |
     +------------------------+

You have to decide what to do about these. 

5. What dates are missing, but not because of 
these? 

. list ferritin fdate if mi(fdate) & ferritin != "U" 

     +---------------------------+
     |          ferritin   fdate |
     |---------------------------|
 66. |         5760 5/02       . |
 70. |              4968       . |
 98. |         4782 6/02       . |
117. |         1137 4/02       . |
121. |         3237 3/01       . |
     |---------------------------|
126. |         1060 8/02       . |
148. | 1422 ng/ml 3/1998       . |
154. |         1014 8/01       . |
178. |            normal       . |
209. |  646 mg/ml 9/2000       . |
     |---------------------------|
210. |              1938       . |
212. |               127       . |
225. |                         . |
278. |          129 6/00       . |
289. |              2000       . |
     |---------------------------|
300. |              4995       . |
301. |         2748 9/02       . |
302. |        187 3/2000       . |
303. |               489       . |
305. |        3564 11/01       . |
     |---------------------------|
307. |          862 1990       . |
324. |       > 2K 1/7/02       . |
333. |         3039 2/02       . |
338. |   1777 ng/ml 5/02       . |
352. |           40 3/02       . |
     |---------------------------|
357. |               953       . |
     +---------------------------+

Similarly, you have to decide what to do here. There 
are, it seems, various kinds of problem: 

* No date supplied. 
* A year only supplied. 
* A month and year only supplied. 
* A day, month and year supplied. 
* More than two words in the string. 

In fact, we would have been better off 
trying to extract the date from 
-word(ferritin,-1)-, and that would 
yield one further daily date. 

You can use -assert- to test for truth 
or falsity. Thus 

. assert 42 == int(42) 

asserts that 42 is an integer. More 
usefully 

. assert x == int(x) 

asserts that -x- contains integer 
values only, as only for integers 
is -x- unchanged by applying the -int()- 
function. 

Similarly 

. assert real(date(stringvar, "mdy", 2050)) < . 

asserts that all values of stringvar can 
be treated as daily dates. By itself -assert- has 
very little syntax, and using it requires knowledge
of other parts of Stata. 

Nick 
n.j.cox@durham.ac.uk 

Ward Hagar
> 
> I am new to Stata and new to databases. I've been trying to 
> get an access database "Stata ready". I can insheet it well 
> and have "cleaned up" most of the variables. However, I've 
> hit a snag with one variable named "Ferritin", reproduced 
> below for those interested. (n= 362). 
> 
> My goal is to separate the value form the date.
> 
> But note:
> 1. Most have a four digit value
> 2. Many have a date of the test in an inconsistent format
> 3. Many have "U" for unknown
> 4. One has "normal" for a value
> 5. A few have the ">" operator before the value
> 6. One has the units for the measurement
> 
> 
> I've tried the date() functions, split, and word() and all 
> cause some other problem.
> 
> This is important because the ferritin is the first of a list 
> of similarly (mis)coded variables.
> 
> Is there a Stata-esque approach to this, or am I left with 
> line-by-line cleanup?
> 
> Second question is whether ASSERT can be used to test whether 
> a value is a date, integer, string, etc?
> 
> Many thanks.
> 
> . l ferritin
> 
>      +-------------------+
>      |          ferritin |
>      |-------------------|
>   1. |       540 11/6/02 |
>   2. |                 U |

> 348. |                 U |
> 349. |                 U |
> 350. |       132 2/20/02 |
>      |-------------------|
> 351. |                 U |
> 352. |           40 3/02 |
> 353. |        97 3/29/02 |
> 354. |      6151 5/22/02 |
> 355. |                 U |
>      |-------------------|
> 356. |       1390 5/8/02 |
> 357. |               953 |
> 358. |         54 8/2/02 |
> 359. |      2779 8/12/02 |
> 360. |                 U |
>      |-------------------|
> 361. |                 U |
> 362. |                 U |
>      +-------------------+

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index