Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: re: data creation for hazard regression

From   Austin Nichols <>
Subject   Re: st: re: data creation for hazard regression
Date   Fri, 8 Jun 2012 11:16:17 -0400

Kenisha Russell <>:

You will get better answers if you describe your data better--do you
have monthly observations on women?  What transitions are you
observing? Labor market? Education?  If you have data measured once
monthly, you are probably better off turning the data into
person-month observations and using a discrete-time hazard model; see

In any case, before you do any data work, you should replace with
missing dates that are out of range:
replace CMchild1=. if CMchild1==999999

Next consider what subtracting 7 from a date in the format 201003
might mean--is that what you mean by "century month format" perhaps?
200996 is not the answer you want, I assume!
But perhaps you have a proper date variable and you mean you have
applied a display format such as
format d %tm_CCYY_Mon
Just make sure you know what values are encoded in the variable, and
not just how they display.

If you arrange your data as person-month observations, and create a
date variable "now" measuring contemporaneous time, and 3 date
variables "born1,born2,born3" for months of birth, then you can
generate a pregnant dummy like so:

g pregnant=0
forv i=1/3 {
replace pregnant=1 if inrange(now,born`i'-7,born`i')

bearing in mind there will be some considerable measurement error in
the pregnant variable.  Are you sure every child is a biological
child?  Are there women with more than 3 children in the data?  Do you
have any information on gestational age at birth?

If you rewrite your question, please take some time to make it
clearer; phrases like "the likelihood of pregnancy is also 3" just
confuse the reader and lower the probability of your getting a useful

On Fri, Jun 8, 2012 at 4:50 AM, Kenisha Russell
<> wrote:
> Hi Statalisters,
> I am trying to create a data set for which I will use  hazard regression (events history analysis to demographers).
> I am currently restructuring my data into person-period format, in order to use hazard regression to examine the propensity of an individual to transition from state x to state y.
> and one of the variables that I want to use is pregnancy.
> Because I have the day and month each child was born, after making this date into century month format, I have simply subtracted the 7 months previous to the birth of each child to obtain   a variable called pregnancy. In this particular data set the highest recorded parity is 3. See the syntax I have used below.
> gen CMpregnancy1=.
> replace CMpregnancy1=CMchild1-7 if CMchild1!=999999
> CMchild is the birthdate of the each child is in century month format.
> After this I then split the data:
> stsplit pregnancy1, after(CMpregnancy1) at(0)
> /* We replace values for pregnancy1 so that 0 represents time before that
> the woman was pregnant and 1 for after the pregnancy*/
> replace pregnancy1= pregnancy1+1
> replace pregnancy1=0 if CMpregnancy1==.
> list  pid-_st CMpregnancy*  pregnancy* in 1/60
> This is repeated three times because given the fact that highest parity is = 3,  the likelihood of pregnancy is also 3 and all should be taken into account.
> Although I have written a syntax here and have split the data, my issue is that I am not sure it is correct. Am I required to split the data with each pregnancy?  i.e to create a time = before the event (i.e the pregnancy).
>  If I do split the event, if my reasoning is correct I assume I would need to stop each pregnancy at the point where each child is born.  Is that correct? If so, How would I do that?
> Best,
> Kenisha

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index