Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <[email protected]> |

To |
[email protected] |

Subject |
Re: st: Draw a random sample of my data... |

Date |
Thu, 4 Oct 2012 08:53:01 +0100 |

Although presented as a reply to my post of 27 September, this is in fact a repost of a question posted on 1 October http://www.stata.com/statalist/archive/2012-10/msg00020.html which got no replies. The FAQ has explicit advice on re-posting when the original gets no reply; re-posting once is certainly allowed, but it is better to think that your question was unclear and so should be reworded. See <http://www.stata.com/support/faqs/resources/statalist-faq/#noanswer> In this case, I am unclear about what you want to do and why. 1. Taking account of business volume sounds reasonable, but having 29% of the sample be US firms because US firms account for 29% of the volume is neither necessary nor sufficient to do that. The proper weighting would usually be that _individual_ firms are weighted by their individual volume. But you still need to look ahead to the analysis commands you plan using as they may not allow weighting. I have no idea how differential weighting might work with panel analysis, and presumably your economic research questions would affect what you do. 2. As you have U.S. firms and non-U.S. firms, combining does not sound at all like a -merge- problem, but more of an -append- problem and in itself unproblematic. Nick On Thu, Oct 4, 2012 at 12:46 AM, <[email protected]> wrote: > Thank you very much Nick for your answer. The "stable" option > helped solving my problem. However a new question emerged: > I have a > little problem with generating a new dataset. I first use the command > "sample" and "set seed" to generate a new dataset. > But I still have problemswith integrating my random sample dataset within > the original paneldata. The reason is that US firms account for more than > 50% of the dataset, this affects the cross-country results very strong. > However, with respect to the world wide industry business volume, US > firms account 29%. Therefore, I draw a random sample, in which I randomly > account 29% of the US firms in the dataset. I have a panel data with > countryID firmID and years. After running the random sample and setting > the seeds, I would like to merge the randomly generated dataset of US > firms (with random firmID and random years) with my original panel data > (with countryID firmID and years). But: how can I merge the dataset in > which only the random sample of US firms is considered (for additional > years within the original paneldataset) and the other US fimrs are > dropped. How can I genetrate a variable, in which I can say that only > "the random" US firms can be considered within the original > panel dataset for all years? > Please help..Thank you in > advance...Mehmet Altun > > My commands look like: > use > all_data8; > > by firmID, sort: gen firms = _n; > keep if > firms==1; > > keep if countryID==244 (USA); > sort firmID, > stable; > set seed 260581; > > sample 63; > sort year; > save usfirms_1, replace; > >> First note that >> >> sort countryID year >> >> does nothing useful because you undo it by >> >> by firmID, sort: gen firms = _n >> >> Now focus on that last command. It will sort your data by -firmID- but >> precisely which observation comes first within -firmID- is not >> reproducible with that syntax. So which observations are selected by >> >> keep if firms == 1 >> >> may differ. Nothing that you do afterwards will undo that >> indeterminacy. You can ensure consistency by e.g. -sort, stable-. >> >> Here is a demo: >> >> . sysuse auto, clear >> >> . bysort rep78 : gen which = _n == 1 >> >> . levelsof make if which >> `"AMC Spirit"' `"Cad. Deville"' `"Dodge St. Regis"' `"Pont. Firebird"' >> `"Subaru"' `"VW Rabbit"' >> >> . sysuse auto, clear >> (1978 Automobile Data) >> >> . bysort rep78 : gen which = _n == 1 >> >> . levelsof make if which >> `"Buick Century"' `"Chev. Monte Carlo"' `"Ford Fiesta"' `"Honda >> Accord"' `"Pont. Firebird"' `"Pont. Phoenix"' >> >> Different -make-s come first. >> >> . sysuse auto, clear >> (1978 Automobile Data) >> >> . sort rep78, stable >> >> . by rep78 : gen which = _n == 1 >> >> . levelsof make if which >> `"AMC Concord"' `"AMC Spirit"' `"Buick Electra"' `"Cad. Eldorado"' >> `"Dodge Colt"' `"Olds Starfire"' >> >> . sysuse auto, clear >> (1978 Automobile Data) >> >> . sort rep78, stable >> >> . by rep78 : gen which = _n == 1 >> >> . levelsof make if which >> `"AMC Concord"' `"AMC Spirit"' `"Buick Electra"' `"Cad. Eldorado"' >> `"Dodge Colt"' `"Olds Starfire"' >> >> >> Nick >> >> Mehmet Altun >> >>> I will code a subset of my data. I used the "sample" >>> command..However, I would like to fix my random sample, so that I can >>> generate the same sample again..For this I used the "set seed" command. >>> However, if I rerun the dofile I get different samples in my random >>> sample. Here is my dofile: >>> >>> clear; >>> use all_data8; >>> sort countryID year; >>> >>> by firmID, sort: gen firms = _n; >>> keep if firms==1; >>> >>> by countryID, sort: egen countryfirms = total(firms); >>> >>> keep if countryID==244; >>> >>> set seed 260581; >>> >>> sample 63; >>> >>> save usfirms_1, replace; >>> >>> >>> >>> Is there a bug in stata, or what is wrong? Please help. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:

- Prev by Date:
**Re: st: Combination of a multinomial logit model and a logit model** - Next by Date:
**Re: st: local bootstrapping** - Previous by thread:
**Re: st: Draw a random sample of my data...** - Next by thread:
**st: Using the predictnl command following a model containing restricted cubic splines and time-dependent effects** - Index(es):