Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: expanding a large data set and merging with another data set |

Date |
Wed, 24 Apr 2013 09:37:42 +0100 |

This raises questions on several quite different levels. 0. You show precisely no code and the problem report is vague, morphing from an implication that you can't do it to one that it just takes up too much time. What's too much time? What code did you try? 1. What's the idea, scientifically? The presumption that individuals experience what a single place experiences over the course of a lifetime seems very strong to me. People move around, often a lot. Have you got pollution data measured consistently in the same way at the same place for decades? That would be impressive. (Are you splicing together pollution data from several places according to people's moves? That would be even more impressive, but nothing is said here about different places too.) On a more mundane level, few people live at their nearest meteorological station, but taking that seriously would mean no-one ever looking at pollution and health. 2. Have you done any back-of-the-envelope calculations here? Suppose a typical individual has a lifetime of the order of 20,000 days. How many individuals do you have? With thousands of individuals, you get millions of observations, but I imagine many readers will be saying, sure, and why is that difficult. 3. If the question is can you speed up the -merge- command, then I imagine the answer is yes, if you open up the code and rewrite the central parts in Mata (or more Mata; I've not looked inside recently), but that would be a substantial project for Bill Gould, Stata's chief developer, and not to be undertaken lightly by anyone else. Nick njcoxstata@gmail.com On 24 April 2013 09:01, David Jose <davidjosework@gmail.com> wrote: > Hi all, > > This is a slightly modified question I have posted before, with an > attempted but failed solution. > > I have two data sets, one which contains daily pollution data, and > another which contains a data set on individuals. The individual-level > data has information on the date of birth and date of death, and I > would like to merge these two data sets, so that the resulting data > set is an individual-level data set, where for each individual I have > pollution exposure for each day of life. > > Some details to give you a better idea of the structure of each data set: > > Data set 1 has a personid and his date of birth and date of death. > > For example, for persons 1 and 2: > > personid dob dod > > 1 1/1/00 1/1/01 > > 2 5/1/05 8/5/09 > > Data set 2 has a pollution measure for every day of the year. > > For example, for the month of January in 2000: > > time pollution > > 1/1/00 50 > 1/2/00 49.5 > . > . > . > 12/31/10 65 > > I would like to merge these two data sets. The resulting merged data > set would have, for each person, the pollution level for each day of > life. That is, I'd like the merged data set to look like this: > > personid dob dod time pollution > > 1 1/1/00 1/1/01 1/1/00 50 > 1 1/1/00 1/1/01 1/2/00 49.5 > . > . > . > 1 1/1/00 1/1/01 1/1/01 55 > > 2 5/1/05 8/5/09 5/1/05 65 > 2 5/1/05 8/5/09 5/2/05 62 > . > . > . > 2 5/1/05 8/5/09 8/5/09 69 > > etc. etc. > > I have tried a solution which creates duplicate observations (using > the expand command) in the individual-level data set, which is based > on the difference (dod-dob+1). I was hoping to merge the (duplicated) > individual-level data set, in which each duplicated observation > corresponds to a different day of life, with the pollution data set. > However, I am not able to go beyond duplication step because I have a > large number of individuals in my data set, and this operation is very > time-intensive. > > Does anyone have an idea for a less time-intensive way of merging > these two data sets? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: expanding a large data set and merging with another data set***From:*David Jose <davidjosework@gmail.com>

- Prev by Date:
**Re: st: expanding a large data set and merging with another data set** - Next by Date:
**st: gllamm and weight** - Previous by thread:
**Re: st: expanding a large data set and merging with another data set** - Next by thread:
**st: gllamm and weight** - Index(es):