Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: match variable across two tables


From   Robert Picard <picard@netbox.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: match variable across two tables
Date   Sat, 21 Dec 2013 14:54:45 -0500

Well "331A" is not a valid NAISC code so you have to decide what to do
about that. The sample code I provided earlier requires that
naics1-naics8 be string. This can easily be done using

tostring naics*, replace

Robert

On Sat, Dec 21, 2013 at 1:56 PM, Rongrong Zhang <r05zhang@gmail.com> wrote:
> Thank you Sarah.
>
> NAICS1 does not contain all the naics code from the original data.
>
> I found out why stata import naics1 as str, because there are a few
> observations have letters embeded in NAICS1, e.g. 331A as a value of
> NAICS1, NAICS2-8 are only numeric .
>
> I am not proficient in writing a import program, that is why I use
> import wizard to import the txt file.
>
>
>
> On Sat, Dec 21, 2013 at 12:32 PM, Sarah Edgington <sedging@ucla.edu> wrote:
>> Rochelle,
>> At this point to determine what to do next you're actually going to have to
>> look carefully at your data, all of it, not just the first observation.
>> Then you'll have to make some decisions about how to get from the data you
>> have to the data you want.
>>
>> Are naics2-naics8 missing for ALL observations.  Stata doesn't make
>> decisions about what format to import variables based on only the first
>> observation so looking at the first observation is not going to be enough
>> information to tell you what happened.
>>
>> Then you'll want to look at naics1.  Does it contain all the naics codes
>> from your original table?  If naics1 contains all your values, separated by
>> spaces, and the rest of the naics variables are ALWAYS missing then you can,
>> as I suggested previously, just get rid of the extraneous naics variables
>> and use -split- as Robert suggested previously.
>>
>> If naic2-naics8 contain data for some of your observations then you'll have
>> to think harder about your next steps.
>>
>> For -reshape- to work you need a series of numbered variables that all have
>> the same storage format.
>> You should have all the tools you need to get to that point.  You just have
>> to look carefully at your data and figure out what steps you need to take.
>>
>> -Sarah
>>
>>
>> At 05:33 AM 12/21/2013, you wrote:
>>>
>>> Hi Sarah,
>>>
>>> after importing, naics1 was set to str, naics2-8 were set to long, as
>>> I said previously, I used File-Import-ASCII data created by
>>> spreadsheet, then stata imported my txt file for me, my first
>>> observation has non-missing data for naics1 and all missing for
>>> naics2-8, I guess that is why stata assigned different types to
>>> them.and the log shows command insheet was used.
>>>  insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt"
>>>
>>> Best,
>>> Rochelle
>>>
>>> On Fri, Dec 20, 2013 at 6:05 PM, Sarah Edgington <sedging@ucla.edu> wrote:
>>> > Rochelle,
>>> > The error message isn't because the naics variables are missing, it's
>>> > because naics2 (and presumably all of naics2-naics8?) are a different
>>> > variable type than naics1.  However, reshaping when all but 1 of the
>>> > variables being reshaped contain all missing values isn't going to get you
>>> > what you want.
>>> >
>>> > It sounds like something is going awry with your import process.  If I
>>> > understand you correctly you're saying that naics2-naics8 are missing for
>>> > all observations not just the first two that you show, right?
>>> > Are the codes all being read into the naics1 variable?  That is, is
>>> > naics1 a string variable containing multiple codes separated by spaces?  If
>>> > that's the case you'll want to drop naics2-naics8 and separate naics1 into
>>> > multiple variables before reshaping.
>>> > -Sarah
>>> >
>>> > -----Original Message-----
>>> > From: owner-statalist@hsphsun2.harvard.edu
>>> > [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Rongrong Zhang
>>> > Sent: Friday, December 20, 2013 2:06 PM
>>> > To: statalist@hsphsun2.harvard.edu
>>> > Subject: Re: st: RE: match variable across two tables
>>> >
>>> > THANKS! I use import wizard and get the data into stata .
>>> >
>>> > data looks like:
>>> > ionumber ioname naics1 naics2 naics3 naics4 naics5 naics6 naics7 naics8
>>> > 1110  Crop production 111 . . . . . . .
>>> > 1111A0 Oilseed farming    11111 . . . . . . .
>>> >
>>> >
>>> > that is missing for naics2 ~8.
>>> >
>>> >  insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt"
>>> > (10 vars, 564 obs)
>>> >
>>> > I got an error here:
>>> >
>>> > reshape long naics, i(ionumber) j(code) string
>>> > (note: j = 1 2 3 4 5 6 7 8)
>>> > naics2 type mismatch with other naics variables
>>> >
>>> > I did not have this error when I use your entire program, that is , when
>>> > I use your input, then split codelist,
>>> >
>>> > I wonder if my error is caused by missing values in naics2
>>> >
>>> > On Fri, Dec 20, 2013 at 4:40 PM, Robert Picard <picard@netbox.com>
>>> > wrote:
>>> >> I added double quotes so that your few lines of data could be read
>>> >> inline using -input- (since Statalist does not allow attachments). You
>>> >> most certainly do not need to input your data into Stata using the
>>> >> same command. See -help import- to find better ways to do it.
>>> >>
>>> >> Robert
>>> >>
>>> >> On Fri, Dec 20, 2013 at 4:31 PM, Rongrong Zhang <r05zhang@gmail.com>
>>> >> wrote:
>>> >>> Dear Roberts,
>>> >>>
>>> >>> Please excuse my late response.  Thanks so very much for your code !!!
>>> >>> Words can't express my gratitude.
>>> >>>
>>> >>> my original data has over 600 rows (the I-O table), I posted only a
>>> >>> few lines to save space. My question - to add quotes like in your
>>> >>> program
>>> >>>
>>> >>> "1110" "Crop production"
>>> >>>
>>> >>> is there a stata tool that does it automatically or do I need to
>>> >>> insert it manually for all 600 rows?
>>> >>>
>>> >>> Merry Christmas!
>>> >>>
>>> >>> Rochelle
>>> >>>
>>> >>> On Thu, Dec 19, 2013 at 12:22 PM, Robert Picard <picard@netbox.com>
>>> >>> wrote:
>>> >>>> No need to talk about "fuzzy" matching as NAISC codes are defined
>>> >>>> hierarchically. If you do not match at the 6-digit level, you can
>>> >>>> try again using 5-digit codes, and so on.
>>> >>>>
>>> >>>> Your first problem is to reshape Table 1 data from wide to long
>>> >>>> format. Your "I-O number codes" are clearly not valid NAISC codes so
>>> >>>> the target becomes creating a crosswalk between valid NAICS to "I-O
>>> >>>> number codes".
>>> >>>>
>>> >>>> Once you have the crosswalk, you can do an exact match using -merge-.
>>> >>>> For all NAICS code that did not find an exact match, you can do an
>>> >>>> update merge to find matching "I-O numbers" using 5-digit NAISC
>>> >>>> codes.
>>> >>>> You can then repeat down to 2-digit NAICS if you want to.
>>> >>>>
>>> >>>> Robert
>>> >>>>
>>> >>>> * ----------------- begin example ------------------------ clear
>>> >>>> input str6 ionumber str244 ioname str244 codelist "1110" "Crop
>>> >>>> production"
>>> >>>> "1111A0" "Oilseed farming" "11111 11112"
>>> >>>> "1111B0" "Grain farming" "11113 11114 11115 11116 11119"
>>> >>>> "111200" "Vegetable and melon farming" "1112"
>>> >>>> "111400" "Greenhouse and nursery production" "1114"
>>> >>>> "111910" "Tobacco farming" "11191"
>>> >>>> "111920" "Cotton farming" "11192"
>>> >>>> "1119A0" "Sugarcane and sugar beet" "11193 111991"
>>> >>>> "1119B0" "All other crop farming" "11194 111992 111998"
>>> >>>> end
>>> >>>> compress
>>> >>>>
>>> >>>> * split into separate codes and reshape long split codelist,
>>> >>>> gen(naics) reshape long naics, i(ionumber) j(code) string
>>> >>>>
>>> >>>> * drop obs with missing codes
>>> >>>> bysort ionumber (code): drop if mi(naics) & _n > 1 replace naics =
>>> >>>> ionumber if mi(naics)
>>> >>>>
>>> >>>> * remove trailing zeros
>>> >>>> replace naics = regexr(naics,"0+$","")
>>> >>>>
>>> >>>> * save naics to ionumber crosswalk
>>> >>>> isid naics, sort
>>> >>>> list, noobs sepby(ionumber)
>>> >>>> tempfile table1
>>> >>>> save "`table1'"
>>> >>>>
>>> >>>> clear
>>> >>>> input str6 naics
>>> >>>> "111"
>>> >>>> "1111"
>>> >>>> "111150"
>>> >>>> "111199"
>>> >>>> "111219"
>>> >>>> "111310"
>>> >>>> "111320"
>>> >>>> "111332"
>>> >>>> "111334"
>>> >>>> "111335"
>>> >>>> "111339"
>>> >>>> "1114"
>>> >>>> "111411"
>>> >>>> "111419"
>>> >>>> "111421"
>>> >>>> "111422"
>>> >>>> "111920"
>>> >>>> "111930"
>>> >>>> "111940"
>>> >>>> "111998"
>>> >>>> end
>>> >>>> gen table2id = _n
>>> >>>> replace naics = regexr(naics,"0+$","")
>>> >>>>
>>> >>>> * do an exact match using the crosswalk merge 1:1 naics using
>>> >>>> "`table1'", keepusing(ionumber) /// keep(master match) nogen
>>> >>>>
>>> >>>> * for obs that did not match, try again using 5 digits.
>>> >>>> clonevar naics6 = naics
>>> >>>> replace naics = substr(naics6,1,5)
>>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>>> >>>> gen(merge5) drop if merge5 == 2
>>> >>>>
>>> >>>> * repeat for 4-digit naics
>>> >>>> replace naics = substr(naics6,1,4)
>>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>>> >>>> gen(merge4) drop if merge4 == 2
>>> >>>>
>>> >>>> * repeat for 3-digit naics
>>> >>>> replace naics = substr(naics6,1,3)
>>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>>> >>>> gen(merge3) drop if merge3 == 2
>>> >>>> * --------------------------- end example ---------------
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Dec 18, 2013 at 9:52 PM, Rongrong Zhang <r05zhang@gmail.com>
>>> >>>> wrote:
>>> >>>>> Hi Sarah,
>>> >>>>>
>>> >>>>> Thanks so much for your questions.  Let me try to answer them in
>>> >>>>> the order they were posted.
>>> >>>>>
>>> >>>>> Yes, I plan to drop trailing zeros and take all the nonzero digits
>>> >>>>> as match criteria. In this case, you are correct in terms of - I
>>> >>>>> need processing the data first. - should I use trim ()?
>>> >>>>>
>>> >>>>> your next question: the structure of data in table 1: do I have a
>>> >>>>> single variable that has multiple codes in it. I assume you are
>>> >>>>> asking:
>>> >>>>>
>>> >>>>> e.g 1111B0    Grain farming    corresponds to 5 different NAICS code
>>> >>>>>  and they are    11113      11114      11115 11116      11119.
>>> >>>>>
>>> >>>>> suppose all these 5 NAICS codes are present in my Table 2, I would
>>> >>>>> like to have 5 rows in my final output table like this:
>>> >>>>>
>>> >>>>> 1111B0   11113
>>> >>>>> 1111B0   11114
>>> >>>>> 1111B0   11115
>>> >>>>> 1111B0   11116
>>> >>>>> 1111B0   11119
>>> >>>>>
>>> >>>>> next question : the rule that make an entry a match. If I require 5
>>> >>>>> or
>>> >>>>> 6 digit match, then these two tables may not produce many matches.
>>> >>>>> that is why I thought of 4 digit matches. Ideally I would like to
>>> >>>>> do both exact and "fuzzy" match e.g. using 4 digit, so I have the
>>> >>>>> flexibility to control my sample size.
>>> >>>>>
>>> >>>>> If you or others have questions or suggestions, please let me know.
>>> >>>>>
>>> >>>>> thanks,
>>> >>>>>
>>> >>>>> On Wed, Dec 18, 2013 at 3:05 PM, Sarah Edgington <sedging@ucla.edu>
>>> >>>>> wrote:
>>> >>>>>> Rochelle,
>>> >>>>>> This looks like it may be a pretty complicated problem.  I don't
>>> >>>>>> immediately have any suggestions because I'm not sure I understand either
>>> >>>>>> the exact structure of your data or the matching rules you want to follow.
>>> >>>>>>
>>> >>>>>> You say that if you use exact matching that you want I-O number
>>> >>>>>> 1111B0 to match with NAICS code 111150.  I take it that is an "exact match"
>>> >>>>>> because you want to drop the trailing zero in the NAICS code.  So, since
>>> >>>>>> 11115 appears in the list of NAICS codes for 1111B0, it would match to
>>> >>>>>> 111150 in table 2.  This is not to my mind an "exact match" because it
>>> >>>>>> requires first modifying the NAICS code in table 2 before you can match.  To
>>> >>>>>> do that successfully you need to be very clear about what the rule for
>>> >>>>>> modification is.
>>> >>>>>> Is the rule that if the NAICS code in table 2 has a zero at the end
>>> >>>>>> you always drop it?  Does it matter how many digits appear before the zero?
>>> >>>>>>
>>> >>>>>> The next question I have is about the structure of table 1 as it
>>> >>>>>> appears in Stata.  Do you have a single variable that has multiple codes in
>>> >>>>>> it?  If so, you're probably going to have to do some additional processing
>>> >>>>>> to that variable before trying to match the two tables.
>>> >>>>>>
>>> >>>>>> The final thing I was unclear on was what you want the final
>>> >>>>>> structure of your data to be after matching.  How do you want to deal with
>>> >>>>>> entries in table 1 that have multiple matches in table 2?  Do you want the
>>> >>>>>> resulting data to contain multiple observations, one for each of the NAICS
>>> >>>>>> codes that the I-O number matches to?
>>> >>>>>>
>>> >>>>>> Again for the four digit match, you'll want to be very clear on the
>>> >>>>>> rules that make an entry a match.  I'm not sure if you're asking for a match
>>> >>>>>> of the first four digits of the NAICS code in table 1 to only the codes in
>>> >>>>>> table 2 that are four digits long.  Alternatively perhaps you're looking to
>>> >>>>>> match observation in table 1 to ALL the entries in table 2 that share the
>>> >>>>>> same first four digits.
>>> >>>>>>
>>> >>>>>> If you can more precisely describe the structure of your data as it
>>> >>>>>> currently exists, the matching rules you want to follow, and the structure
>>> >>>>>> you want your final data to be in, you'll increase your chances of getting a
>>> >>>>>> helpful answer from the list.
>>> >>>>>>
>>> >>>>>> -S
>>> >>>>>>
>>> >>>>>> -----Original Message-----
>>> >>>>>> From: owner-statalist@hsphsun2.harvard.edu
>>> >>>>>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of
>>> >>>>>> Rongrong Zhang
>>> >>>>>> Sent: Wednesday, December 18, 2013 11:15 AM
>>> >>>>>> To: statalist@hsphsun2.harvard.edu
>>> >>>>>> Subject: st: match variable across two tables
>>> >>>>>>
>>> >>>>>> Dear STATALISTER,
>>> >>>>>>
>>> >>>>>> I have two tables:
>>> >>>>>>
>>> >>>>>> Table 1 has 3 variables  I-O number, I-O Name ,   Related 1997
>>> >>>>>> NAICS codes.
>>> >>>>>>
>>> >>>>>> Table 2 has 1 variable 1997 NAICS codes.
>>> >>>>>>
>>> >>>>>> I want to link these two tables based on NAICS code. However, the
>>> >>>>>> level of details on NAICS code does not match one-to-one because
>>> >>>>>> the tables come from different data source. My goal is to know
>>> >>>>>> which NAICS code correspond to which I-O number. I can’t use Table
>>> >>>>>> 1 only, because TABLE 2 is produced from Wharton Research Database
>>> >>>>>> which has company level financial data ­ I will use later on.
>>> >>>>>>
>>> >>>>>> By different details I mean : e.g.
>>> >>>>>>
>>> >>>>>> table 1:
>>> >>>>>>
>>> >>>>>> I-O number  I-O Name                  1997 NAICS codes
>>> >>>>>>
>>> >>>>>> 1110        Crop production
>>> >>>>>>
>>> >>>>>> 1111A0    Oilseed farming
>>> >>>>>>          11111      11112
>>> >>>>>>
>>> >>>>>> 1111B0    Grain farming             11113      11114      11115
>>> >>>>>> 11116      11119
>>> >>>>>>
>>> >>>>>> 111200    Vegetable and melon farming
>>> >>>>>>                    1112
>>> >>>>>> 111400 Greenhouse and nursery production
>>> >>>>>>                 1114
>>> >>>>>> 111910 Tobacco farming
>>> >>>>>>                 11191
>>> >>>>>> 111920 Cotton farming
>>> >>>>>>                 11192
>>> >>>>>> 1119A0 Sugarcane and sugar beet
>>> >>>>>>       11193 111991
>>> >>>>>> 1119B0 All other crop farming
>>> >>>>>> 11194 111992 111998
>>> >>>>>>
>>> >>>>>> in the above example, I present industry 1110 and its
>>> >>>>>> subindustries 1111A0, 1111B0, 111200, each of the subindustries
>>> >>>>>> correspond to a few (or a single) NAICS code (north america
>>> >>>>>> industry classification system).
>>> >>>>>>
>>> >>>>>> table 2:
>>> >>>>>> NAICS CODE
>>> >>>>>> 111
>>> >>>>>> 1111
>>> >>>>>> 111150
>>> >>>>>> 111199
>>> >>>>>> 111219
>>> >>>>>> 111310
>>> >>>>>> 111320
>>> >>>>>> 111332
>>> >>>>>> 111334
>>> >>>>>> 111335
>>> >>>>>> 111339
>>> >>>>>> 1114
>>> >>>>>> 111411
>>> >>>>>> 111419
>>> >>>>>> 111421
>>> >>>>>> 111422
>>> >>>>>> 111920
>>> >>>>>> 111930
>>> >>>>>> 111940
>>> >>>>>> 111998
>>> >>>>>>
>>> >>>>>> if I enforce exact match, then table 2  111150 matches with table 1
>>> >>>>>> 1111B0,    table 2 1112l9 may be matched with  111200 table 1 I-O.
>>> >>>>>>
>>> >>>>>> My question :
>>> >>>>>> 1.could you give a sample code/function to do exact match? note,
>>> >>>>>> if first 5digit match, and drop last 0 (naics), we consider that a
>>> >>>>>> match 2. if I want to increase match, how could I change the
>>> >>>>>> program to do 4 digit match
>>> >>>>>>
>>> >>>>>> thanks a bunch,
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Best,
>>> >>>>>> Rochelle
>>> >>>>>>
>>> >>>>>> *
>>> >>>>>> *   For searches and help try:
>>> >>>>>> *   http://www.stata.com/help.cgi?search
>>> >>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> >>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> *
>>> >>>>>> *   For searches and help try:
>>> >>>>>> *   http://www.stata.com/help.cgi?search
>>> >>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> >>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> -Best,
>>> >>>>> R
>>> >>>>>
>>> >>>>> *
>>> >>>>> *   For searches and help try:
>>> >>>>> *   http://www.stata.com/help.cgi?search
>>> >>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> >>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> >>>>
>>> >>>> *
>>> >>>> *   For searches and help try:
>>> >>>> *   http://www.stata.com/help.cgi?search
>>> >>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> >>>> *   http://www.ats.ucla.edu/stat/stata/
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> -Best,
>>> >>> R
>>> >>>
>>> >>> *
>>> >>> *   For searches and help try:
>>> >>> *   http://www.stata.com/help.cgi?search
>>> >>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> >>> *   http://www.ats.ucla.edu/stat/stata/
>>> >>
>>> >> *
>>> >> *   For searches and help try:
>>> >> *   http://www.stata.com/help.cgi?search
>>> >> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> >> *   http://www.ats.ucla.edu/stat/stata/
>>> >
>>> >
>>> >
>>> > --
>>> > -Best,
>>> > R
>>> >
>>> > *
>>> > *   For searches and help try:
>>> > *   http://www.stata.com/help.cgi?search
>>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> > *   http://www.ats.ucla.edu/stat/stata/
>>> >
>>> >
>>> > *
>>> > *   For searches and help try:
>>> > *   http://www.stata.com/help.cgi?search
>>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> > *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>>
>>> --
>>> -Best,
>>> R
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index