Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: expanding a large data set and merging with another data set


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: expanding a large data set and merging with another data set
Date   Wed, 24 Apr 2013 09:37:42 +0100

This raises questions on several quite different levels.

0. You show precisely no code and the problem report is vague,
morphing from an implication  that you can't do it to one that it just
takes up too much time. What's too much time? What code did you try?

1. What's the idea, scientifically? The presumption that individuals
experience what a single place experiences over the course of a
lifetime seems very strong to me. People move around, often a lot.
Have you got pollution data measured consistently in the same way at
the same place for decades? That would be impressive. (Are you
splicing together pollution data from several places according to
people's moves? That would be even more impressive, but nothing is
said here about different places too.) On a more mundane level, few
people live at their nearest meteorological station, but taking that
seriously would mean no-one ever looking at pollution and health.

2. Have you done any back-of-the-envelope calculations here? Suppose a
typical individual has a lifetime of the order of 20,000 days. How
many individuals do you have? With thousands of individuals, you get
millions of observations, but I imagine many readers will be saying,
sure, and why is that difficult.

3. If the question is can you speed up the -merge- command, then I
imagine the answer is yes, if you open up the code and rewrite the
central parts in Mata (or more Mata; I've not looked inside recently),
but that would be a substantial project for Bill Gould, Stata's chief
developer, and not to be undertaken lightly by anyone else.

Nick
[email protected]


On 24 April 2013 09:01, David Jose <[email protected]> wrote:
> Hi all,
>
> This is a slightly modified question I have posted before, with an
> attempted but failed solution.
>
> I have two data sets, one which contains daily pollution data, and
> another which contains a data set on individuals. The individual-level
> data has information on the date of birth and date of death, and I
> would like to merge these two data sets, so that the resulting data
> set is an individual-level data set, where for each individual I have
> pollution exposure for each day of life.
>
> Some details to give you a better idea of the structure of each data set:
>
> Data set 1 has a personid and his date of birth and date of death.
>
> For example, for persons 1 and 2:
>
> personid    dob         dod
>
> 1               1/1/00    1/1/01
>
> 2               5/1/05    8/5/09
>
> Data set 2 has a pollution measure for every day of the year.
>
> For example, for the month of January in 2000:
>
> time          pollution
>
> 1/1/00        50
> 1/2/00        49.5
> .
> .
> .
> 12/31/10        65
>
> I would like to merge these two data sets. The resulting merged data
> set would have, for each person, the pollution level for each day of
> life. That is, I'd like the merged data set to look like this:
>
> personid    dob         dod     time          pollution
>
> 1               1/1/00    1/1/01  1/1/00        50
> 1               1/1/00    1/1/01  1/2/00        49.5
> .
> .
> .
> 1               1/1/00    1/1/01  1/1/01        55
>
> 2               5/1/05    8/5/09  5/1/05        65
> 2               5/1/05    8/5/09  5/2/05        62
> .
> .
> .
> 2               5/1/05    8/5/09  8/5/09        69
>
> etc. etc.
>
> I have tried a solution which creates duplicate observations (using
> the expand command) in the individual-level data set, which is based
> on the difference (dod-dob+1). I was hoping to merge the (duplicated)
> individual-level data set, in which each duplicated observation
> corresponds to a different day of life, with the pollution data set.
> However, I am not able to go beyond duplication step because I have a
> large number of individuals in my data set, and this operation is very
> time-intensive.
>
> Does anyone have an idea for a less time-intensive way of merging
> these two data sets?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index