Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: alt to inlist?


From   Joseph Coveney <jcoveney@bigplanet.com>
To   Statalist <statalist@hsphsun2.harvard.edu>
Subject   Re: st: alt to inlist?
Date   Tue, 24 Jan 2006 14:23:45 +0900

Danielle H Ferry wrote:

As for whether there is a better way to do what I want ... probably.
I welcome any suggestions: I am reading in several datasets, each
containing different economic series at the metropolitan level, but I
want to keep data for only the top X metro areas. Metro areas are
defined differently for different series' (i.e., Los Angeles could be
represented as LAS or LAX depending on whether we are talking about
MSA or CMSA). Rather than figure out which datasets use which
definition for each metro area, I want it to be flexible - keep if
msa==LAS | msa==LAX (except that I have a list of about 145 MSA names
is list). Is there a better way of doing this?

--------------------------------------------------------------------------------

I misunderstood your problem, thinking you were trying to do something based
upon the existence of variables called a, b, c, . . . in a dataset, i.e., a
too-literal reading of "varname = a or b or c or d or . . ."  (I've just
been doing conditional manipulation of datasets based upon the intersection
of the set of variable names in each dataset and a set of probe names, and
so was primed to think along that line when reading your original
description.)

I second Nick's suggestion to take a look at his FAQ's Section 3 and Kit
Baum's related FAQ in lieu of chained -inlist()-s.  You can put your list of
top-X MSAs and CMSAs in a probe dataset and -merge- your various datasets
against it, -keep-ing the observations -if _merge == 3- (inner joins).  If
you have trouble seeing how to do this from the FAQs, I've illustrated it
below with dummy datasets.

If you have *huge* (many observations) to-be-subsetted datasets,
where -sort-ing string variables might be time-consuming, then there's an
outside chance that it will be faster to create an index variable (see
Section 2 of Nick's FAQ), that is,

local probe_list [list your 145 MSA names here, no quotes, no commas]
generate byte keep = 0
foreach MSA of local probe_list {
   replace keep = 1 if msa == "`MSA'"
}
keep if keep
drop keep

You can try it to see which is quickest, but I strongly suspect that,
despite the need to -sort-, the -merge- method will be the more efficient
for moderate-sized datasets and 145 probe MSAs.  (The sorting and joining
algorithms might theoretically always be more efficient in your case than
the multitude of string comparisons involved in creating an index variable
for 145 probes.)

Joseph Coveney

clear
set more off
set seed `=date("2005-01-24", "ymd")'
// List of top-five MSAs
tempfile probes
set obs 5
generate str1 msa = char(64 + _n)
sort msa
save `probes'
// One of the various to-be-subsetted datasets
clear
set obs 100
generate float desired_data = uniform()
generate str1 msa = char(65 + floor(26 * uniform()))
// Subsetting
sort msa
merge msa using `probes'
keep if _merge == 3
drop _merge
erase `probes'
exit

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index