Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Draw a random sample of my data...


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Draw a random sample of my data...
Date   Thu, 4 Oct 2012 08:53:01 +0100

Although presented as a reply to my post of 27 September, this is in
fact a repost of a question posted on 1 October

http://www.stata.com/statalist/archive/2012-10/msg00020.html

which got no replies. The FAQ has explicit advice on re-posting when
the original gets no reply; re-posting once is certainly allowed, but
it is better to think that your question was unclear and so should be
reworded.  See
<http://www.stata.com/support/faqs/resources/statalist-faq/#noanswer>

In this case, I am unclear about what you want to do and why.

1. Taking account of business volume sounds reasonable, but having 29%
of the sample be US firms because US firms account for 29% of the
volume is neither necessary nor sufficient to do that. The proper
weighting would usually be that _individual_ firms are weighted by
their individual volume. But you still need to look ahead to the
analysis commands you plan using as they may not allow weighting. I
have no idea how differential weighting might work with panel
analysis, and presumably your economic research questions would affect
what you do.

2. As you have U.S. firms and non-U.S. firms, combining does not sound
at all like a -merge- problem, but more of an -append- problem and in
itself unproblematic.

Nick

On Thu, Oct 4, 2012 at 12:46 AM,  <[email protected]> wrote:
> Thank you very much Nick for your answer. The "stable" option
> helped solving my problem. However a new question emerged:
> I have a
> little problem with generating a new dataset. I first use the command
> "sample" and "set seed" to generate a new  dataset.
> But I still have problemswith integrating my random sample dataset within
> the original  paneldata. The reason is that US firms account for more than
> 50% of the  dataset, this affects the cross-country results very strong.
> However,  with respect to the world wide industry business volume, US
> firms  account 29%. Therefore, I draw a random sample, in which I randomly
>  account 29% of the US firms in the dataset. I have a panel data with
> countryID firmID and years. After running the random sample and setting
> the seeds, I would like to merge the randomly generated dataset of US
> firms (with random firmID and random years) with my original panel data
> (with countryID firmID and years). But: how can I merge the dataset in
> which only the random sample of US firms is considered (for additional
> years within the original paneldataset) and the other US fimrs are
> dropped. How can I genetrate a variable, in which I can say that only
> "the random" US firms can be considered within the original
> panel  dataset for all years?
>  Please help..Thank you in
> advance...Mehmet Altun
>
>  My commands look like:
>  use
> all_data8;
>
>  by firmID, sort: gen firms = _n;
>  keep if
> firms==1;
>
>  keep if countryID==244 (USA);
>  sort firmID,
> stable;
>  set seed 260581;
>
>  sample 63;
>  sort year;
>  save usfirms_1, replace;
>
>> First note that
>>
>> sort countryID year
>>
>> does nothing useful because you undo it by
>>
>> by firmID, sort: gen firms = _n
>>
>> Now focus on that last command. It will sort your data by -firmID- but
>> precisely which observation comes first within -firmID- is not
>> reproducible with that syntax.  So which observations are selected by
>>
>> keep if firms == 1
>>
>> may differ. Nothing that you do afterwards will undo that
>> indeterminacy. You can ensure consistency by e.g. -sort, stable-.
>>
>> Here is a demo:
>>
>> . sysuse auto, clear
>>
>> . bysort rep78 : gen which = _n == 1
>>
>> . levelsof make if which
>> `"AMC Spirit"' `"Cad. Deville"' `"Dodge St. Regis"' `"Pont. Firebird"'
>> `"Subaru"' `"VW Rabbit"'
>>
>> . sysuse auto, clear
>> (1978 Automobile Data)
>>
>> . bysort rep78 : gen which = _n == 1
>>
>> . levelsof make if which
>> `"Buick Century"' `"Chev. Monte Carlo"' `"Ford Fiesta"' `"Honda
>> Accord"' `"Pont. Firebird"' `"Pont. Phoenix"'
>>
>> Different -make-s come first.
>>
>> . sysuse auto, clear
>> (1978 Automobile Data)
>>
>> . sort rep78, stable
>>
>> . by rep78 : gen which = _n == 1
>>
>> . levelsof make if which
>> `"AMC Concord"' `"AMC Spirit"' `"Buick Electra"' `"Cad. Eldorado"'
>> `"Dodge Colt"' `"Olds Starfire"'
>>
>> . sysuse auto, clear
>> (1978 Automobile Data)
>>
>> . sort rep78, stable
>>
>> . by rep78 : gen which = _n == 1
>>
>> . levelsof make if which
>> `"AMC Concord"' `"AMC Spirit"' `"Buick Electra"' `"Cad. Eldorado"'
>> `"Dodge Colt"' `"Olds Starfire"'
>>
>>
>> Nick
>>
>> Mehmet Altun
>>
>>> I will code a subset of my data. I used the "sample"
>>> command..However, I would like to fix my random sample, so that I can
>>> generate the same sample again..For this I used the "set seed" command.
>>> However, if I rerun the dofile I get different samples in my random
>>> sample. Here is my dofile:
>>>
>>> clear;
>>> use all_data8;
>>> sort countryID year;
>>>
>>> by firmID, sort: gen firms = _n;
>>> keep if firms==1;
>>>
>>> by countryID, sort: egen countryfirms = total(firms);
>>>
>>> keep if countryID==244;
>>>
>>> set seed 260581;
>>>
>>> sample 63;
>>>
>>> save usfirms_1, replace;
>>>
>>>
>>>
>>> Is there a bug in stata, or what is wrong? Please help.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index