Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: match variable across two tables

From	Sarah Edgington <[email protected]>
To	[email protected]
Subject	Re: st: RE: match variable across two tables
Date	Sat, 21 Dec 2013 09:32:34 -0800

Rochelle,

At this point to determine what to do next you'reactually going to have to look carefully at yourdata, all of it, not just the firstobservation. Then you'll have to make somedecisions about how to get from the data you have to the data you want.

Are naics2-naics8 missing for ALLobservations. Stata doesn't make decisions aboutwhat format to import variables based on only thefirst observation so looking at the firstobservation is not going to be enough information to tell you what happened.

Then you'll want to look at naics1. Does itcontain all the naics codes from your originaltable? If naics1 contains all your values,separated by spaces, and the rest of the naicsvariables are ALWAYS missing then you can, as Isuggested previously, just get rid of theextraneous naics variables and use -split- as Robert suggested previously.

If naic2-naics8 contain data for some of yourobservations then you'll have to think harder about your next steps.

For -reshape- to work you need a series ofnumbered variables that all have the same storage format.You should have all the tools you need to get tothat point. You just have to look carefully atyour data and figure out what steps you need to take.


-Sarah

At 05:33 AM 12/21/2013, you wrote:

Hi Sarah,

after importing, naics1 was set to str, naics2-8 were set to long, as
I said previously, I used File-Import-ASCII data created by
spreadsheet, then stata imported my txt file for me, my first
observation has non-missing data for naics1 and all missing for
naics2-8, I guess that is why stata assigned different types to
them.and the log shows command insheet was used.
 insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt"

Best,
Rochelle

On Fri, Dec 20, 2013 at 6:05 PM, Sarah Edgington <[email protected]> wrote:
> Rochelle,

> The error message isn't because the naicsvariables are missing, it's because naics2 (andpresumably all of naics2-naics8?) are adifferent variable type than naics1. However,reshaping when all but 1 of the variables beingreshaped contain all missing values isn't going to get you what you want.

> It sounds like something is going awry withyour import process. If I understand youcorrectly you're saying that naics2-naics8 aremissing for all observations not just the first two that you show, right?> Are the codes all being read into the naics1variable? That is, is naics1 a string variablecontaining multiple codes separated byspaces? If that's the case you'll want to dropnaics2-naics8 and separate naics1 into multiple variables before reshaping.

> -Sarah
>
> -----Original Message-----

> From: [email protected][mailto:[email protected]] On Behalf Of Rongrong Zhang

> Sent: Friday, December 20, 2013 2:06 PM
> To: [email protected]
> Subject: Re: st: RE: match variable across two tables
>
> THANKS! I use import wizard and get the data into stata .
>
> data looks like:
> ionumber ioname naics1 naics2 naics3 naics4 naics5 naics6 naics7 naics8
> 1110  Crop production 111 . . . . . . .
> 1111A0 Oilseed farming    11111 . . . . . . .
>
>
> that is missing for naics2 ~8.
>
>  insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt"
> (10 vars, 564 obs)
>
> I got an error here:
>
> reshape long naics, i(ionumber) j(code) string
> (note: j = 1 2 3 4 5 6 7 8)
> naics2 type mismatch with other naics variables
>

> I did not have this error when I use yourentire program, that is , when I use your input, then split codelist,

>
> I wonder if my error is caused by missing values in naics2
>
> On Fri, Dec 20, 2013 at 4:40 PM, Robert Picard <[email protected]> wrote:
>> I added double quotes so that your few lines of data could be read
>> inline using -input- (since Statalist does not allow attachments). You
>> most certainly do not need to input your data into Stata using the
>> same command. See -help import- to find better ways to do it.
>>
>> Robert
>>

>> On Fri, Dec 20, 2013 at 4:31 PM, RongrongZhang <[email protected]> wrote:

>>> Dear Roberts,
>>>
>>> Please excuse my late response.  Thanks so very much for your code !!!
>>> Words can't express my gratitude.
>>>
>>> my original data has over 600 rows (the I-O table), I posted only a
>>> few lines to save space. My question - to add quotes like in your
>>> program
>>>
>>> "1110" "Crop production"
>>>
>>> is there a stata tool that does it automatically or do I need to
>>> insert it manually for all 600 rows?
>>>
>>> Merry Christmas!
>>>
>>> Rochelle
>>>

>>> On Thu, Dec 19, 2013 at 12:22 PM, RobertPicard <[email protected]> wrote:

>>>> No need to talk about "fuzzy" matching as NAISC codes are defined
>>>> hierarchically. If you do not match at the 6-digit level, you can
>>>> try again using 5-digit codes, and so on.
>>>>
>>>> Your first problem is to reshape Table 1 data from wide to long
>>>> format. Your "I-O number codes" are clearly not valid NAISC codes so
>>>> the target becomes creating a crosswalk between valid NAICS to "I-O
>>>> number codes".
>>>>
>>>> Once you have the crosswalk, you can do an exact match using -merge-.
>>>> For all NAICS code that did not find an exact match, you can do an
>>>> update merge to find matching "I-O numbers" using 5-digit NAISC codes.
>>>> You can then repeat down to 2-digit NAICS if you want to.
>>>>
>>>> Robert
>>>>
>>>> * ----------------- begin example ------------------------ clear
>>>> input str6 ionumber str244 ioname str244 codelist "1110" "Crop
>>>> production"
>>>> "1111A0" "Oilseed farming" "11111 11112"
>>>> "1111B0" "Grain farming" "11113 11114 11115 11116 11119"
>>>> "111200" "Vegetable and melon farming" "1112"
>>>> "111400" "Greenhouse and nursery production" "1114"
>>>> "111910" "Tobacco farming" "11191"
>>>> "111920" "Cotton farming" "11192"
>>>> "1119A0" "Sugarcane and sugar beet" "11193 111991"
>>>> "1119B0" "All other crop farming" "11194 111992 111998"
>>>> end
>>>> compress
>>>>
>>>> * split into separate codes and reshape long split codelist,
>>>> gen(naics) reshape long naics, i(ionumber) j(code) string
>>>>
>>>> * drop obs with missing codes
>>>> bysort ionumber (code): drop if mi(naics) & _n > 1 replace naics =
>>>> ionumber if mi(naics)
>>>>
>>>> * remove trailing zeros
>>>> replace naics = regexr(naics,"0+$","")
>>>>
>>>> * save naics to ionumber crosswalk
>>>> isid naics, sort
>>>> list, noobs sepby(ionumber)
>>>> tempfile table1
>>>> save "`table1'"
>>>>
>>>> clear
>>>> input str6 naics
>>>> "111"
>>>> "1111"
>>>> "111150"
>>>> "111199"
>>>> "111219"
>>>> "111310"
>>>> "111320"
>>>> "111332"
>>>> "111334"
>>>> "111335"
>>>> "111339"
>>>> "1114"
>>>> "111411"
>>>> "111419"
>>>> "111421"
>>>> "111422"
>>>> "111920"
>>>> "111930"
>>>> "111940"
>>>> "111998"
>>>> end
>>>> gen table2id = _n
>>>> replace naics = regexr(naics,"0+$","")
>>>>
>>>> * do an exact match using the crosswalk merge 1:1 naics using
>>>> "`table1'", keepusing(ionumber) /// keep(master match) nogen
>>>>
>>>> * for obs that did not match, try again using 5 digits.
>>>> clonevar naics6 = naics
>>>> replace naics = substr(naics6,1,5)
>>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>>>> gen(merge5) drop if merge5 == 2
>>>>
>>>> * repeat for 4-digit naics
>>>> replace naics = substr(naics6,1,4)
>>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>>>> gen(merge4) drop if merge4 == 2
>>>>
>>>> * repeat for 3-digit naics
>>>> replace naics = substr(naics6,1,3)
>>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update
>>>> gen(merge3) drop if merge3 == 2
>>>> * --------------------------- end example ---------------
>>>>
>>>>

>>>> On Wed, Dec 18, 2013 at 9:52 PM, RongrongZhang <[email protected]> wrote:

>>>>> Hi Sarah,
>>>>>
>>>>> Thanks so much for your questions.  Let me try to answer them in
>>>>> the order they were posted.
>>>>>
>>>>> Yes, I plan to drop trailing zeros and take all the nonzero digits
>>>>> as match criteria. In this case, you are correct in terms of - I
>>>>> need processing the data first. - should I use trim ()?
>>>>>
>>>>> your next question: the structure of data in table 1: do I have a
>>>>> single variable that has multiple codes in it. I assume you are
>>>>> asking:
>>>>>
>>>>> e.g 1111B0    Grain farming    corresponds to 5 different NAICS code
>>>>>  and they are    11113      11114      11115 11116      11119.
>>>>>
>>>>> suppose all these 5 NAICS codes are present in my Table 2, I would
>>>>> like to have 5 rows in my final output table like this:
>>>>>
>>>>> 1111B0   11113
>>>>> 1111B0   11114
>>>>> 1111B0   11115
>>>>> 1111B0   11116
>>>>> 1111B0   11119
>>>>>
>>>>> next question : the rule that make an entry a match. If I require 5
>>>>> or
>>>>> 6 digit match, then these two tables may not produce many matches.
>>>>> that is why I thought of 4 digit matches. Ideally I would like to
>>>>> do both exact and "fuzzy" match e.g. using 4 digit, so I have the
>>>>> flexibility to control my sample size.
>>>>>
>>>>> If you or others have questions or suggestions, please let me know.
>>>>>
>>>>> thanks,
>>>>>

>>>>> On Wed, Dec 18, 2013 at 3:05 PM, SarahEdgington <[email protected]> wrote:

>>>>>> Rochelle,

>>>>>> This looks like it may be a prettycomplicated problem. I don't immediately haveany suggestions because I'm not sure Iunderstand either the exact structure of yourdata or the matching rules you want to follow.

>>>>>>

>>>>>> You say that if you use exact matchingthat you want I-O number 1111B0 to match withNAICS code 111150. I take it that is an "exactmatch" because you want to drop the trailingzero in the NAICS code. So, since 11115appears in the list of NAICS codes for 1111B0,it would match to 111150 in table 2. This isnot to my mind an "exact match" because itrequires first modifying the NAICS code intable 2 before you can match. To do thatsuccessfully you need to be very clear about what the rule for modification is.>>>>>> Is the rule that if the NAICS code intable 2 has a zero at the end you always dropit? Does it matter how many digits appear before the zero?

>>>>>>

>>>>>> The next question I have is about thestructure of table 1 as it appears inStata. Do you have a single variable that hasmultiple codes in it? If so, you're probablygoing to have to do some additional processingto that variable before trying to match the two tables.

>>>>>>

>>>>>> The final thing I was unclear on waswhat you want the final structure of your datato be after matching. How do you want to dealwith entries in table 1 that have multiplematches in table 2? Do you want the resultingdata to contain multiple observations, one foreach of the NAICS codes that the I-O number matches to?

>>>>>>

>>>>>> Again for the four digit match, you'llwant to be very clear on the rules that make anentry a match. I'm not sure if you're askingfor a match of the first four digits of theNAICS code in table 1 to only the codes intable 2 that are four digitslong. Alternatively perhaps you're looking tomatch observation in table 1 to ALL the entriesin table 2 that share the same first four digits.

>>>>>>

>>>>>> If you can more precisely describe thestructure of your data as it currently exists,the matching rules you want to follow, and thestructure you want your final data to be in,you'll increase your chances of getting a helpful answer from the list.

>>>>>>
>>>>>> -S
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [email protected]
>>>>>> [mailto:[email protected]] On Behalf Of
>>>>>> Rongrong Zhang
>>>>>> Sent: Wednesday, December 18, 2013 11:15 AM
>>>>>> To: [email protected]
>>>>>> Subject: st: match variable across two tables
>>>>>>
>>>>>> Dear STATALISTER,
>>>>>>
>>>>>> I have two tables:
>>>>>>

>>>>>> Table 1 has 3 variables I-O number, I-OName , Related 1997 NAICS codes.

>>>>>>
>>>>>> Table 2 has 1 variable 1997 NAICS codes.
>>>>>>
>>>>>> I want to link these two tables based on NAICS code. However, the
>>>>>> level of details on NAICS code does not match one-to-one because
>>>>>> the tables come from different data source. My goal is to know
>>>>>> which NAICS code correspond to which I-O number. I canâ??t use Table
>>>>>> 1 only, because TABLE 2 is produced from Wharton Research Database
>>>>>> which has company level financial data  I will use later on.
>>>>>>
>>>>>> By different details I mean : e.g.
>>>>>>
>>>>>> table 1:
>>>>>>
>>>>>> I-O number  I-O Name                  1997 NAICS codes
>>>>>>
>>>>>> 1110        Crop production
>>>>>>
>>>>>> 1111A0    Oilseed farming
>>>>>>          11111      11112
>>>>>>
>>>>>> 1111B0    Grain farming             11113      11114      11115
>>>>>> 11116      11119
>>>>>>
>>>>>> 111200    Vegetable and melon farming
>>>>>>                    1112
>>>>>> 111400 Greenhouse and nursery production
>>>>>>                 1114
>>>>>> 111910 Tobacco farming
>>>>>>                 11191
>>>>>> 111920 Cotton farming
>>>>>>                 11192
>>>>>> 1119A0 Sugarcane and sugar beet
>>>>>>       11193 111991
>>>>>> 1119B0 All other crop farming
>>>>>> 11194 111992 111998
>>>>>>
>>>>>> in the above example, I present industry 1110 and its
>>>>>> subindustries 1111A0, 1111B0, 111200, each of the subindustries
>>>>>> correspond to a few (or a single) NAICS code (north america
>>>>>> industry classification system).
>>>>>>
>>>>>> table 2:
>>>>>> NAICS CODE
>>>>>> 111
>>>>>> 1111
>>>>>> 111150
>>>>>> 111199
>>>>>> 111219
>>>>>> 111310
>>>>>> 111320
>>>>>> 111332
>>>>>> 111334
>>>>>> 111335
>>>>>> 111339
>>>>>> 1114
>>>>>> 111411
>>>>>> 111419
>>>>>> 111421
>>>>>> 111422
>>>>>> 111920
>>>>>> 111930
>>>>>> 111940
>>>>>> 111998
>>>>>>
>>>>>> if I enforce exact match, then table 2  111150 matches with table 1
>>>>>> 1111B0,    table 2 1112l9 may be matched with  111200 table 1 I-O.
>>>>>>
>>>>>> My question :
>>>>>> 1.could you give a sample code/function to do exact match? note,
>>>>>> if first 5digit match, and drop last 0 (naics), we consider that a
>>>>>> match 2. if I want to increase match, how could I change the
>>>>>> program to do 4 digit match
>>>>>>
>>>>>> thanks a bunch,
>>>>>>
>>>>>> --
>>>>>> Best,
>>>>>> Rochelle
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -Best,
>>>>> R
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>>
>>> --
>>> -Best,
>>> R
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> -Best,
> R
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/



--
-Best,
R

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: match variable across two tables
  - From: Rongrong Zhang <[email protected]>

References:
- st: match variable across two tables
  - From: Rongrong Zhang <[email protected]>
- st: RE: match variable across two tables
  - From: "Sarah Edgington" <[email protected]>
- Re: st: RE: match variable across two tables
  - From: Rongrong Zhang <[email protected]>
- Re: st: RE: match variable across two tables
  - From: Robert Picard <[email protected]>
- Re: st: RE: match variable across two tables
  - From: Rongrong Zhang <[email protected]>
- Re: st: RE: match variable across two tables
  - From: Robert Picard <[email protected]>
- Re: st: RE: match variable across two tables
  - From: Rongrong Zhang <[email protected]>
- RE: st: RE: match variable across two tables
  - From: "Sarah Edgington" <[email protected]>
- Re: st: RE: match variable across two tables
  - From: Rongrong Zhang <[email protected]>

Prev by Date: Re: st: book in genetics analysis using Stata
Next by Date: st: Stata - efficiently appending 200+ files (my method takes hours)
Previous by thread: Re: st: RE: match variable across two tables
Next by thread: Re: st: RE: match variable across two tables
Index(es):
- Date
- Thread