Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: match variable across two tables


From   Rongrong Zhang <[email protected]>
To   [email protected]
Subject   Re: st: RE: match variable across two tables
Date   Sat, 21 Dec 2013 13:56:16 -0500

Thank you Sarah.

NAICS1 does not contain all the naics code from the original data.

I found out why stata import naics1 as str, because there are a few
observations have letters embeded in NAICS1, e.g. 331A as a value of
NAICS1, NAICS2-8 are only numeric .

I am not proficient in writing a import program, that is why I use
import wizard to import the txt file.



On Sat, Dec 21, 2013 at 12:32 PM, Sarah Edgington <[email protected]> wrote:
> Rochelle,
> At this point to determine what to do next you're actually going to have to
> look carefully at your data, all of it, not just the first observation.
> Then you'll have to make some decisions about how to get from the data you
> have to the data you want.
>
> Are naics2-naics8 missing for ALL observations.  Stata doesn't make
> decisions about what format to import variables based on only the first
> observation so looking at the first observation is not going to be enough
> information to tell you what happened.
>
> Then you'll want to look at naics1.  Does it contain all the naics codes
> from your original table?  If naics1 contains all your values, separated by
> spaces, and the rest of the naics variables are ALWAYS missing then you can,
> as I suggested previously, just get rid of the extraneous naics variables
> and use -split- as Robert suggested previously.
>
> If naic2-naics8 contain data for some of your observations then you'll have
> to think harder about your next steps.
>
> For -reshape- to work you need a series of numbered variables that all have
> the same storage format.
> You should have all the tools you need to get to that point.  You just have
> to look carefully at your data and figure out what steps you need to take.
>
> -Sarah
>
>
> At 05:33 AM 12/21/2013, you wrote:
>>
>> Hi Sarah,
>>
>> after importing, naics1 was set to str, naics2-8 were set to long, as
>> I said previously, I used File-Import-ASCII data created by
>> spreadsheet, then stata imported my txt file for me, my first
>> observation has non-missing data for naics1 and all missing for
>> naics2-8, I guess that is why stata assigned different types to
>> them.and the log shows command insheet was used.
>>  insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt"
>>
>> Best,
>> Rochelle
>>
>> On Fri, Dec 20, 2013 at 6:05 PM, Sarah Edgington <[email protected]> wrote:
>> > Rochelle,
>> > The error message isn't because the naics variables are missing, it's
>> > because naics2 (and presumably all of naics2-naics8?) are a different
>> > variable type than naics1.  However, reshaping when all but 1 of the
>> > variables being reshaped contain all missing values isn't going to get you
>> > what you want.
>> >
>> > It sounds like something is going awry with your import process.  If I
>> > understand you correctly you're saying that naics2-naics8 are missing for
>> > all observations not just the first two that you show, right?
>> > Are the codes all being read into the naics1 variable?  That is, is
>> > naics1 a string variable containing multiple codes separated by spaces?  If
>> > that's the case you'll want to drop naics2-naics8 and separate naics1 into
>> > multiple variables before reshaping.
>> > -Sarah
>> >
>> > -----Original Message-----
>> > From: [email protected]
>> > [mailto:[email protected]] On Behalf Of Rongrong Zhang
>> > Sent: Friday, December 20, 2013 2:06 PM
>> > To: [email protected]
>> > Subject: Re: st: RE: match variable across two tables
>> >
>> > THANKS! I use import wizard and get the data into stata .
>> >
>> > data looks like:
>> > ionumber ioname naics1 naics2 naics3 naics4 naics5 naics6 naics7 naics8
>> > 1110  Crop production 111 . . . . . . .
>> > 1111A0 Oilseed farming    11111 . . . . . . .
>> >
>> >
>> > that is missing for naics2 ~8.
>> >
>> >  insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt"
>> > (10 vars, 564 obs)
>> >
>> > I got an error here:
>> >
>> > reshape long naics, i(ionumber) j(code) string
>> > (note: j = 1 2 3 4 5 6 7 8)
>> > naics2 type mismatch with other naics variables
>> >
>> > I did not have this error when I use your entire program, that is , when
>> > I use your input, then split codelist,
>> >
>> > I wonder if my error is caused by missing values in naics2
>> >
>> > On Fri, Dec 20, 2013 at 4:40 PM, Robert Picard <[email protected]>
>> > wrote:
>> >> I added double quotes so that your few lines of data could be read
>> >> inline using -input- (since Statalist does not allow attachments). You
>> >> most certainly do not need to input your data into Stata using the
>> >> same command. See -help import- to find better ways to do it.
>> >>
>> >> Robert
>> >>
>> >> On Fri, Dec 20, 2013 at 4:31 PM, Rongrong Zhang <[email protected]>
>> >> wrote:
>> >>> Dear Roberts,
>> >>>
>> >>> Please excuse my late response.  Thanks so very much for your code !!!
>> >>> Words can't express my gratitude.
>> >>>
>> >>> my original data has over 600 rows (the I-O table), I posted only a
>> >>> few lines to save space. My question - to add quotes like in your
>> >>> program
>> >>>
>> >>> "1110" "Crop production"
>> >>>
>> >>> is there a stata tool that does it automatically or do I need to
>> >>> insert it manually for all 600 rows?
>> >>>
>> >>> Merry Christmas!
>> >>>
>> >>> Rochelle
>> >>>
>> >>> On Thu, Dec 19, 2013 at 12:22 PM, Robert Picard <[email protected]>
>> >>> wrote:
>> >>>> No need to talk about "fuzzy" matching as NAISC codes are defined
>> >>>> hierarchically. If you do not match at the 6-digit level, you can
>> >>>> try again using 5-digit codes, and so on.
>> >>>>
>> >>>> Your first problem is to reshape Table 1 data from wide to long
>> >>>> format. Your "I-O number codes" are clearly not valid NAISC codes so
>> >>>> the target becomes creating a crosswalk between valid NAICS to "I-O
>> >>>> number codes".
>> >>>>
>> >>>> Once you have the crosswalk, you can do an exact match using -merge-.
>> >>>> For all NAICS code that did not find an exact match, you can do an
>> >>>> update merge to find matching "I-O numbers" using 5-digit NAISC
>> >>>> codes.
>> >>>> You can then repeat down to 2-digit NAICS if you want to.
>> >>>>
>> >>>> Robert
>> >>>>
>> >>>> * ----------------- begin example ------------------------ clear
>> >>>> input str6 ionumber str244 ioname str244 codelist "1110" "Crop
>> >>>> production"
>> >>>> "1111A0" "Oilseed farming" "11111 11112"
>> >>>> "1111B0" "Grain farming" "11113 11114 11115 11116 11119"
>> >>>> "111200" "Vegetable and melon farming" "1112"
>> >>>> "111400" "Greenhouse and nursery production" "1114"
>> >>>> "111910" "Tobacco farming" "11191"
>> >>>> "111920" "Cotton farming" "11192"
>> >>>> "1119A0" "Sugarcane and sugar beet" "11193 111991"
>> >>>> "1119B0" "All other crop farming" "11194 111992 111998"
>> >>>> end
>> >>>> compress
>> >>>>
>> >>>> * split into separate codes and reshape long split codelist,
>> >>>> gen(naics) reshape long naics, i(ionumber) j(code) string
>> >>>>
>> >>>> * drop obs with missing codes
>> >>>> bysort ionumber (code): drop if mi(naics) & _n > 1 replace naics =
>> >>>> ionumber if mi(naics)
>> >>>>
>> >>>> * remove trailing zeros
>> >>>> replace naics = regexr(naics,"0+$","")
>> >>>>
>> >>>> * save naics to ionumber crosswalk
>> >>>> isid naics, sort
>> >>>> list, noobs sepby(ionumber)
>> >>>> tempfile table1
>> >>>> save "`table1'"
>> >>>>
>> >>>> clear
>> >>>> input str6 naics
>> >>>> "111"
>> >>>> "1111"
>> >>>> "111150"
>> >>>> "111199"
>> >>>> "111219"
>> >>>> "111310"
>> >>>> "111320"
>> >>>> "111332"
>> >>>> "111334"
>> >>>> "111335"
>> >>>> "111339"
>> >>>> "1114"
>> >>>> "111411"
>> >>>> "111419"
>> >>>> "111421"
>> >>>> "111422"
>> >>>> "111920"
>> >>>> "111930"
>> >>>> "111940"
>> >>>> "111998"
>> >>>> end
>> >>>> gen table2id = _n
>> >>>> replace naics = regexr(naics,"0+$","")
>> >>>>
>> >>>> * do an exact match using the crosswalk merge 1:1 naics using
>> >>>> "`table1'", keepusing(ionumber) /// keep(master match) nogen
>> >>>>
>> >>>> * for obs that did not match, try again using 5 digits.
>> >>>> clonevar naics6 = naics
>> >>>> replace naics = substr(naics6,1,5)
>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>> >>>> gen(merge5) drop if merge5 == 2
>> >>>>
>> >>>> * repeat for 4-digit naics
>> >>>> replace naics = substr(naics6,1,4)
>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>> >>>> gen(merge4) drop if merge4 == 2
>> >>>>
>> >>>> * repeat for 3-digit naics
>> >>>> replace naics = substr(naics6,1,3)
>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>> >>>> gen(merge3) drop if merge3 == 2
>> >>>> * --------------------------- end example ---------------
>> >>>>
>> >>>>
>> >>>> On Wed, Dec 18, 2013 at 9:52 PM, Rongrong Zhang <[email protected]>
>> >>>> wrote:
>> >>>>> Hi Sarah,
>> >>>>>
>> >>>>> Thanks so much for your questions.  Let me try to answer them in
>> >>>>> the order they were posted.
>> >>>>>
>> >>>>> Yes, I plan to drop trailing zeros and take all the nonzero digits
>> >>>>> as match criteria. In this case, you are correct in terms of - I
>> >>>>> need processing the data first. - should I use trim ()?
>> >>>>>
>> >>>>> your next question: the structure of data in table 1: do I have a
>> >>>>> single variable that has multiple codes in it. I assume you are
>> >>>>> asking:
>> >>>>>
>> >>>>> e.g 1111B0    Grain farming    corresponds to 5 different NAICS code
>> >>>>>  and they are    11113      11114      11115 11116      11119.
>> >>>>>
>> >>>>> suppose all these 5 NAICS codes are present in my Table 2, I would
>> >>>>> like to have 5 rows in my final output table like this:
>> >>>>>
>> >>>>> 1111B0   11113
>> >>>>> 1111B0   11114
>> >>>>> 1111B0   11115
>> >>>>> 1111B0   11116
>> >>>>> 1111B0   11119
>> >>>>>
>> >>>>> next question : the rule that make an entry a match. If I require 5
>> >>>>> or
>> >>>>> 6 digit match, then these two tables may not produce many matches.
>> >>>>> that is why I thought of 4 digit matches. Ideally I would like to
>> >>>>> do both exact and "fuzzy" match e.g. using 4 digit, so I have the
>> >>>>> flexibility to control my sample size.
>> >>>>>
>> >>>>> If you or others have questions or suggestions, please let me know.
>> >>>>>
>> >>>>> thanks,
>> >>>>>
>> >>>>> On Wed, Dec 18, 2013 at 3:05 PM, Sarah Edgington <[email protected]>
>> >>>>> wrote:
>> >>>>>> Rochelle,
>> >>>>>> This looks like it may be a pretty complicated problem.  I don't
>> >>>>>> immediately have any suggestions because I'm not sure I understand either
>> >>>>>> the exact structure of your data or the matching rules you want to follow.
>> >>>>>>
>> >>>>>> You say that if you use exact matching that you want I-O number
>> >>>>>> 1111B0 to match with NAICS code 111150.  I take it that is an "exact match"
>> >>>>>> because you want to drop the trailing zero in the NAICS code.  So, since
>> >>>>>> 11115 appears in the list of NAICS codes for 1111B0, it would match to
>> >>>>>> 111150 in table 2.  This is not to my mind an "exact match" because it
>> >>>>>> requires first modifying the NAICS code in table 2 before you can match.  To
>> >>>>>> do that successfully you need to be very clear about what the rule for
>> >>>>>> modification is.
>> >>>>>> Is the rule that if the NAICS code in table 2 has a zero at the end
>> >>>>>> you always drop it?  Does it matter how many digits appear before the zero?
>> >>>>>>
>> >>>>>> The next question I have is about the structure of table 1 as it
>> >>>>>> appears in Stata.  Do you have a single variable that has multiple codes in
>> >>>>>> it?  If so, you're probably going to have to do some additional processing
>> >>>>>> to that variable before trying to match the two tables.
>> >>>>>>
>> >>>>>> The final thing I was unclear on was what you want the final
>> >>>>>> structure of your data to be after matching.  How do you want to deal with
>> >>>>>> entries in table 1 that have multiple matches in table 2?  Do you want the
>> >>>>>> resulting data to contain multiple observations, one for each of the NAICS
>> >>>>>> codes that the I-O number matches to?
>> >>>>>>
>> >>>>>> Again for the four digit match, you'll want to be very clear on the
>> >>>>>> rules that make an entry a match.  I'm not sure if you're asking for a match
>> >>>>>> of the first four digits of the NAICS code in table 1 to only the codes in
>> >>>>>> table 2 that are four digits long.  Alternatively perhaps you're looking to
>> >>>>>> match observation in table 1 to ALL the entries in table 2 that share the
>> >>>>>> same first four digits.
>> >>>>>>
>> >>>>>> If you can more precisely describe the structure of your data as it
>> >>>>>> currently exists, the matching rules you want to follow, and the structure
>> >>>>>> you want your final data to be in, you'll increase your chances of getting a
>> >>>>>> helpful answer from the list.
>> >>>>>>
>> >>>>>> -S
>> >>>>>>
>> >>>>>> -----Original Message-----
>> >>>>>> From: [email protected]
>> >>>>>> [mailto:[email protected]] On Behalf Of
>> >>>>>> Rongrong Zhang
>> >>>>>> Sent: Wednesday, December 18, 2013 11:15 AM
>> >>>>>> To: [email protected]
>> >>>>>> Subject: st: match variable across two tables
>> >>>>>>
>> >>>>>> Dear STATALISTER,
>> >>>>>>
>> >>>>>> I have two tables:
>> >>>>>>
>> >>>>>> Table 1 has 3 variables  I-O number, I-O Name ,   Related 1997
>> >>>>>> NAICS codes.
>> >>>>>>
>> >>>>>> Table 2 has 1 variable 1997 NAICS codes.
>> >>>>>>
>> >>>>>> I want to link these two tables based on NAICS code. However, the
>> >>>>>> level of details on NAICS code does not match one-to-one because
>> >>>>>> the tables come from different data source. My goal is to know
>> >>>>>> which NAICS code correspond to which I-O number. I can’t use Table
>> >>>>>> 1 only, because TABLE 2 is produced from Wharton Research Database
>> >>>>>> which has company level financial data ­ I will use later on.
>> >>>>>>
>> >>>>>> By different details I mean : e.g.
>> >>>>>>
>> >>>>>> table 1:
>> >>>>>>
>> >>>>>> I-O number  I-O Name                  1997 NAICS codes
>> >>>>>>
>> >>>>>> 1110        Crop production
>> >>>>>>
>> >>>>>> 1111A0    Oilseed farming
>> >>>>>>          11111      11112
>> >>>>>>
>> >>>>>> 1111B0    Grain farming             11113      11114      11115
>> >>>>>> 11116      11119
>> >>>>>>
>> >>>>>> 111200    Vegetable and melon farming
>> >>>>>>                    1112
>> >>>>>> 111400 Greenhouse and nursery production
>> >>>>>>                 1114
>> >>>>>> 111910 Tobacco farming
>> >>>>>>                 11191
>> >>>>>> 111920 Cotton farming
>> >>>>>>                 11192
>> >>>>>> 1119A0 Sugarcane and sugar beet
>> >>>>>>       11193 111991
>> >>>>>> 1119B0 All other crop farming
>> >>>>>> 11194 111992 111998
>> >>>>>>
>> >>>>>> in the above example, I present industry 1110 and its
>> >>>>>> subindustries 1111A0, 1111B0, 111200, each of the subindustries
>> >>>>>> correspond to a few (or a single) NAICS code (north america
>> >>>>>> industry classification system).
>> >>>>>>
>> >>>>>> table 2:
>> >>>>>> NAICS CODE
>> >>>>>> 111
>> >>>>>> 1111
>> >>>>>> 111150
>> >>>>>> 111199
>> >>>>>> 111219
>> >>>>>> 111310
>> >>>>>> 111320
>> >>>>>> 111332
>> >>>>>> 111334
>> >>>>>> 111335
>> >>>>>> 111339
>> >>>>>> 1114
>> >>>>>> 111411
>> >>>>>> 111419
>> >>>>>> 111421
>> >>>>>> 111422
>> >>>>>> 111920
>> >>>>>> 111930
>> >>>>>> 111940
>> >>>>>> 111998
>> >>>>>>
>> >>>>>> if I enforce exact match, then table 2  111150 matches with table 1
>> >>>>>> 1111B0,    table 2 1112l9 may be matched with  111200 table 1 I-O.
>> >>>>>>
>> >>>>>> My question :
>> >>>>>> 1.could you give a sample code/function to do exact match? note,
>> >>>>>> if first 5digit match, and drop last 0 (naics), we consider that a
>> >>>>>> match 2. if I want to increase match, how could I change the
>> >>>>>> program to do 4 digit match
>> >>>>>>
>> >>>>>> thanks a bunch,
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best,
>> >>>>>> Rochelle
>> >>>>>>
>> >>>>>> *
>> >>>>>> *   For searches and help try:
>> >>>>>> *   http://www.stata.com/help.cgi?search
>> >>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>>>>
>> >>>>>>
>> >>>>>> *
>> >>>>>> *   For searches and help try:
>> >>>>>> *   http://www.stata.com/help.cgi?search
>> >>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> -Best,
>> >>>>> R
>> >>>>>
>> >>>>> *
>> >>>>> *   For searches and help try:
>> >>>>> *   http://www.stata.com/help.cgi?search
>> >>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>>
>> >>>> *
>> >>>> *   For searches and help try:
>> >>>> *   http://www.stata.com/help.cgi?search
>> >>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> -Best,
>> >>> R
>> >>>
>> >>> *
>> >>> *   For searches and help try:
>> >>> *   http://www.stata.com/help.cgi?search
>> >>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>> *   http://www.ats.ucla.edu/stat/stata/
>> >>
>> >> *
>> >> *   For searches and help try:
>> >> *   http://www.stata.com/help.cgi?search
>> >> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >> *   http://www.ats.ucla.edu/stat/stata/
>> >
>> >
>> >
>> > --
>> > -Best,
>> > R
>> >
>> > *
>> > *   For searches and help try:
>> > *   http://www.stata.com/help.cgi?search
>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> > *   http://www.ats.ucla.edu/stat/stata/
>> >
>> >
>> > *
>> > *   For searches and help try:
>> > *   http://www.stata.com/help.cgi?search
>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> > *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>> --
>> -Best,
>> R
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index