.- help for ^sample2^ (STB-37: dm46) .- Draw random sample ------------------ ^sample2^ # [^if^ exp] [^in^ range] [^, by(^groupvars^)^ ^c^luster^(^varname^)^ ^any^ ^all^ ^k^eep^(^varname^)^ ] Description ----------- ^sample2^ draws a # percent pseudo-random sample of the data in memory, thus discarding 100-# percent of the observations. Observations not meeting the optional ^if^ and ^in^ criteria are kept (sampled at 100%). Sampling here is defined as drawing observations without replacement; see help @bsample@ for sampling with replacement. If you are serious about drawing random samples, you must first set the random number seed; see help @generate@. Options ------- ^by(^groupvars^)^ specifies a # percent sample is to be drawn within each set of values of groupvars, thus maintaining the proportion of each group. ^cluster(^varname^)^ specifies that an observation should be interpreted as a collection of records with the same value of the cluster variable. The cluster variable should be non-missing. If ^if^ or ^in^ clauses are defined without specifying ^any^ or ^all^, if/in should select all records associated with clusters. If ^by^ is combined with ^cluster^, the groupvars should be constant within clusters. ^any^ specifies that the sampling frame is comprised of all clusters for which at least one record was selected via if/in. ^all^ specifies that the sampling frame is comprised of all clusters for which all records were selected via if/in. ^keep(^varname^)^ specifies the name of a variable that specfies which observations are to be kept (value 1) or dropped (value 0). If ^keep^ is not specified, all obs with ^keep^==0 are dropped automatically. Examples -------- . ^sample2 10^ (draw 10 percent sample) . ^sample2 10 if race==0^ (keep all the ^race^ ~=0, but sample 10 percent of ^race^==0) . ^sample2 10, by(race)^ (sample 10% within ^race^) Suppose you analyze a data-set of individuals in households. You want to subsample households rather than individuals. Thus, in the subsample all individuals from the sampled households should be included. This can be obtained as . ^sample2 20, c(hrespnr)^ To sample 50% from all households in which at least one person has income higher than 100000, . ^sample2 50 if inc>100000, c(hrespnr) any^ If the sample frame should consist of all households in which all persons are at least 40 years of age, we would issue . ^sample2 50 if age>40, c(hrespnr) all^ Finally, ^cluster^ and ^by^ can be combined. To sample 50% of households while maintaining the regional distribution, . ^sample2 50 if age>40, c(hrespnr) all by(region)^ Author ------ Jeroen Weesie Utrecht University Netherlands weesie@@weesie.fsw.ruu.nl Also see -------- STB: STB-37 dm46 Manual: [R] sample On-line: help for @bsample@, @generate@