Some room still remains for comment.
Ekaterina started the thread with the implication that her date variable
was numeric, and finished with an example in which it is string. Let
-date- be what you start with. I'll look at both possibilities.
Further solutions
=================
-date- is string:
gen sdate = date if length(date) == 6
replace sdate = "20" + date if missing(sdate) & date < "1000"
replace sdate = "19" + date if missing(sdate)
gen ndate = date(sdate, "YM")
-date- is numeric:
gen ndate = date if date >= 10000
replace ndate = 200000 + date if missing(ndate) & date < 1000
replace ndate = 190000 + date if missing(ndate)
replace ndate = date(string(ndate), "YM")
This solution is (deliberately) pedestrian, meaning it takes short
simple steps.
Various remarks
===============
1. I like -tostring- as much as most of its users but it is not needed
here. Using -tostring- to create a variable you only want briefly is
unnecessary when -string()- will do the same thing on the fly.
2. Eric used expressions like
varold - (int(varold/10000)*10000)
Note that this is just -mod(varold, 10000)-.
3. Martin cited a piece in the Stata Journal on -cond()-. Note that the
first author is David Kantor.
4. Tirthankar downloaded -catenate- from SSC (which he didn't use) and
-todate- ditto (which he did). Note that -catenate- is made obsolete by
-egen, concat()- (unless you have Stata 5 or Stata 6) while -todate- is
made obsolete by Stata 10's -date()-.
Nick
[email protected]
Ekaterina Hertog (2)
====================
thank you so very much for all the brilliant advice!
I have tried different options suggested by Eric, Martin and Tirthankar
and finally opted for a slightly modified version of Martin's suggestion
as it seemed it would be most flexible, even allowing em to take into
account the fact that some of my dates come from 20th and some for the
21st century (I could do it with the topyear option) Just in case you
might be interested this is what I did:
clear
input str10 stredu1st
200109
197104
197504
196504
196904
8804
8404
0202
6304
8304
end
compress
gen mydate= cond(length(stredu1st)>4,
stredu1st, substr(stredu1st,1,2)+" "+substr(stredu1st,3,4))
gen edu1st = cond(length(stredu1st)>4,
date(mydate, "YM"), date(mydate, "YM" , 2007))
format edu1st %td
list stredu1st mydate edu1st, noobs
Eric de Souza (2)
=================
Wouldn't the following be much simpler for her?
replacing varnew by whatever her date variable is.
input varold
199804
199702
8705
7502
end
gen varnew = varold - (int(varold/10000)*10000)
gen year = 1900 + int(varnew/100)
gen month = varnew - (int(varnew/100)*100)
gen date = ym(year, month)
format %tm date
Or is there a problem with the above?
Martin Weiss (2)
===============
So it seems Ekaterina can employ several methods, based on the -cond-
function. She can read up on best practice with regard to this function
in Nick`s column
http://www.stata-journal.com/sjpdf.html?articlenum=pr0016
If she is willing to make sure that all her two digit years hail from
the 20th century, she could
*************
clear*
input edu_start_date_1 // :mylabel , auto // str10 double byte
197104
197504
196504
196904
8804
8404
6304
8304
end
compress
tostring edu_start_date_1, gen(stredu1st)
gen mydate=cond(length(stredu1st)>4, /// stredu1st, ///
substr(stredu1st,1,2)+" "+substr(stredu1st,3,4))
gen edu1st = cond(length(stredu1st)>4, /// date(mydate, "YM"), ///
date(mydate, "19YM"))
format edu1st %td
list edu_start_date_1 mydate edu1st, noobs
*************
force a hole into the string after the year digits, and the date
function would understand her.
Alternatively, she could make the same assumption and add the "19" to
all four digit strings:
******************
clear*
input edu_start_date_1 // :mylabel , auto // str10 double byte
197104
197504
196504
196904
8804
8404
6304
8304
end
compress
tostring edu_start_date_1, gen(stredu1st)
gen mydate=cond(length(stredu1st)>4, /// stredu1st, ///
"19"+stredu1st)
gen edu1st = date(mydate, "YM")
format edu1st %td
list edu_start_date_1 mydate edu1st, noobs
******************
She should check very carefully whether the results match her
expectations
:-)
Eric de Souza (1)
=================
And -di %td date("0675", "M19Y")- works
Martin Weiss (1)
================
The -cond- function would let you condition on the length of the string
so you could tailor the command:
-gen mydate = cond(length(stredu1st)>4, ... )-
I would love to fill in the remainder of this command but I just cannot
figure out at the moment why
-di %td date("197506", "YM")-
works and
-di %td date("7506", "19YM")-
does not...
Tirthankar Chakravarty
======================
A fairly crude hack follows. I do this two ways to check that there is
no inconsistency. Then there is a slightly simpler solution below:
/* Begin */
clear
input str10 date
197104
197504
196504
196904
8804
8404
6304
8304
end
g date1=trim(date)
g newvar=substr(date1, 3,.) if length(date1)>4 replace newvar=date1 if
length(date1)<5
// or:
g date2 = "19"
ssc install catenate, replace
g str10 newvar2 = date2+date1 if length(date1)<5 replace newvar2=date1
if length(date1)>4
drop date1 date2
ssc install todate, replace
todate newvar2, gen(edu1st_1) pattern(yyyymm) todate newvar,
gen(edu1st_2) pattern(yymm) cend(2000) assert edu1st_1==edu1st_2 drop
newvar* list edu1st*
/* End */
A simpler option could be to run NJ Cox's -todate- twice and then
combine the two output variables - some manipulation using -length()-
will be required in this case as well.
/* Simple option */
clear
input str10 date
197104
197504
196504
196904
8804
8404
6304
8304
end
todate date if length(date)>4, gen(edu1st_1) pattern(yyyymm) todate date
if length(date)<5, gen(edu1st_2) pattern(yymm) cend(2000) replace
edu1st_1=edu1st_2 if(edu1st_1==.) drop edu1st_2 rename edu1st_1 edu1st
list edu1st
/* End */
Ekaterina Hertog (1)
====================
I have got a variable containing the month and year an individual
started his or her education. Only Stata thinks the values in this
variable are numbers and I want to turn them into dates.
If all the numbers followed the same pattern that will not be a problem.
for example I could do it like this:
tostring edu_start_date_1, gen(stredu1st)
gen edu1st = date(stredu1st, "YM")
My problem is that while most dates in my dataset come in the yyyymm
pattern:
e.g.
+----------+
| stredu~t |
|----------|
1. | . |
2. | 197104 |
3. | 197504 |
4. | 196504 |
5. | 196904 |
|----------|
several contain only yymm
e.g.
+-----------+
| edu_st~1 |
|-----------|
12338. | 8804 |
13265. | 8404 |
13666. | 6304 |
13831. | 8304 |
+-----------+
So when I run
gen edu1st = date(stredu1st, "YM")
all the yymm values in stredu1st are turned into missing values in
edu1st.
I could of course edit the values containing only yymm into yyyymm
pattern manually, but this feels imprecise and prone to error and I
would like to automate the process if at all possible.
Is there a way to make the date command recognise alternative patterns?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/