Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: find the corresponding values between two variables

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: find the corresponding values between two variables
Date	Fri, 18 Apr 2014 00:53:55 +0000

Sara,

I understand the problem now; thanks for the extra explanation.  I should have realized there would be extra records at the end corresponding to unmatched pairs.

I think you can safely drop all the missings before the final reshape.  They don't have to be "matched" with anything as Stata will automatically fill in missing for those cases that have a c_est without a c_trab (13,35,40).  Unfortunately, I don't have access to Stata at the moment (nor to the program I used to test out my solution), so I can't confirm that this modification will work with the expanded test set you provided.  If you still can't get it to work I will able to do more with this tomorrow.

Regards,
Joe
________________________________________
From: [email protected] [[email protected]] on behalf of Sara Neto Machado [[email protected]]
Sent: Thursday, April 17, 2014 7:04 PM
To: [email protected]
Subject: Re: st: find the corresponding values between two variables

Joe,

Yes, I run your code exactly like you suggested me. I had no -id-
variable before the gen id=_n generated by your method. I think I
understand why there is an error with reshaping and is due to my
initial inquiry which is the reason why I am having so much trouble to
solve this issue.
In order to better explain my theory about the problem let me send
again the example that I provide in the beginning with more detail
information that may help  us with the reasoning issue.

thus, for the sake of illustration, let's consider the following:
c_est----w_est---c_trab----w_trab
10----------10-----------10-------11
11-----------3------------11-------3
13-----------4-------------17-------5
17------------5-------------18-------7
18---------10----------------23-----3
23-----------5----------------25-----6
25----------9-----------------37-----3
35---------5------------------"."-----"."
37----------6----------------"."------"."
40----------4-----------------"."-----"."

c_"something" represents a code of a company and w_"something"
represents the number of workers for that exactly company in a given
year. Thus, the values of c_Est and c_trab  when doing the comparison
should be equal and the value of w_est and w_trab should be equal as
well but that is not happening in reality..
So I need to understand where those differences appear as I notice
they come from the worker's side as well as from the companies side
since there are some companies missing in c_Trab. Stata is defining
missing value for those cases at the end of the database because the
data is mismatch along the database.

By taking into account this example, we have 3 missing values that are
due to c_est == 13 & c_est == 35 & c_est == 40 observations that do
not have a match in c_trab like their were supposed too.
If we apply your reshape syntax code in our example it will give a
result of 3 missing values for -id-. The problem is always the
mismatch situation with companies that I do not know how to
indentify....

In my real database I have 28 cases of mismatch and thus 28 missing
values for -id-. However, and since missing values always appear in
the end I do not know to which companies they belong.

If you have any more suggestions for this problem I will be much
appreciated. Thank you for your valuable input.

regards,
Sara

2014-04-17 15:22 GMT+01:00 Joe Canner <[email protected]>:
> Sara,
>
> Based on the results of -reshape error- it looks like you have a bunch of observations for which -id- is missing.  This, in turn, causes problems with duplicate -j- values.  Did you generate -id- using my suggestion (gen id=_n) or did you already have a variable called -id- in your dataset?  My method should not have resulted in  missing values for -id-.  If you are using your own -id- variable you will need to figure out why there are missing values and either fix them or get rid of them.
>
> Regards,
> Joe
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Sara Neto Machado
> Sent: Wednesday, April 16, 2014 8:28 PM
> To: [email protected]
> Subject: Re: st: find the corresponding values between two variables
>
> Dear Nick and Joe,
>
> For Nick:
> initially I did something similar of what you have proposed
> gen diff4 = w_est - w_trab if c_est != c_trab
> to understand how much was the difference between the cases that
> didn't match. However I found the result to be a lot bigger that was
> supposed to, therefore I notice a new problem: the mismatch that Joe
> correclty understood from my previous example sample. I know that I
> have 28 cases that do not match, however they are all over the
> database and I need to identify them.
>
> Taking into account Joe's solution:
> I read about the reshape syntax before but I though it woulnd't apply
> for my case because it doesn´t allow me to isolate the mismatch cases
> instead I need to eliminate those which I do not want to do. However,
> I test all your code and an error occur (see result below, please) due
> to the exactly 28 mismatch that I state previously and thus I continue
> to not know how to isolate them since the 28 missing values appear in
> the end which is not correct since they appear all over the
> database....
>
> there isn't any other syntax that just do some kind of sorting with
> -if- statements to allow to indentify this mismatch? I know that we
> have  - sort by - but also does no good for my problem.
>
> STATA RESULT:
> reshape wide c_ w_, i(id) j(j) string
> (note: j = emp est trab)
> j not unique within id;
> there are multiple observations at the same j within id.
> Type "reshape error" for a listing of the problem observations.
> . reshape error
> (note: j = emp est trab)
>
> i (id) indicates the top-level grouping such as subject id.
> j (j) indicates the subgrouping such as time.
> The data are in the long form;  j should be unique within i.
>
> There are multiple observations on the same j within id.
>
> The following 28 of 1049448 observations have repeated j values:
>
>          +-----------+
>          | id      j |
>          |-----------|
> 1049421. |  .   trab |
> 1049422. |  .   trab |
> 1049423. |  .   trab |
> 1049424. |  .   trab |
> 1049425. |  .   trab |
>          |-----------|
> 1049426. |  .   trab |
> 1049427. |  .   trab |
> 1049428. |  .   trab |
> 1049429. |  .   trab |
> 1049430. |  .   trab |
>          |-----------|
> 1049431. |  .   trab |
> 1049432. |  .   trab |
> 1049433. |  .   trab |
> 1049434. |  .   trab |
> 1049435. |  .   trab |
>          |-----------|
> 1049436. |  .   trab |
> 1049437. |  .   trab |
> 1049438. |  .   trab |
> 1049439. |  .   trab |
> 1049440. |  .   trab |
>          |-----------|
> 1049441. |  .   trab |
> 1049442. |  .   trab |
> 1049443. |  .   trab |
> 1049444. |  .   trab |
> 1049445. |  .   trab |
>          |-----------|
> 1049446. |  .   trab |
> 1049447. |  .   trab |
> 1049448. |  .   trab |
>          +-----------+
>
> (data now sorted by id j)
>
> thanks in advance!
> Kind regards,
> Sara
>
> 2014-04-16 20:00 GMT+01:00 Joe Canner <[email protected]>:
>> The problem as I understand it (which may not be correct) is that the entry with c_est==13 is not just a problem for that observation but its presence makes all of the rest of the observations mismatched as well.
>>
>> What might help here is to -reshape- the dataset so that all (c_est,c_trab) pairs can be identified and all unmatched cases eliminated.  Or, if you'd rather, just reshape again based on the c_ variables and make the mismatches have a missing partner. Something like:
>>
>> gen id=_n
>> reshape long w_ c_, i(id) j(j) string
>> replace id=c_
>> reshape wide c_ w_, i(id) j(j) string
>>
>> This assumes that your c_ variables are unique.  If not, you will have to modify this or do something else entirely.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>> Sent: Wednesday, April 16, 2014 2:38 PM
>> To: [email protected]
>> Subject: Re: st: find the corresponding values between two variables
>>
>> Not clear what you want (or what you tried: you show no code) but does
>>
>> gen diff = w_est - w_trab if c_est == c_trab
>>
>> or
>>
>> gen OK = c_est == c_trab
>>
>> edit if OK
>>
>> help?
>> Nick
>> [email protected]
>>
>>
>> On 16 April 2014 19:31, Sara Neto Machado <[email protected]> wrote:
>>> Dear all,
>>>
>>> c_est----w_est---c_trab----w_trab
>>> 10----------10-----------10-------11
>>> 11-----------3------------11-------3
>>> 13-----------4-------------17-------5
>>> 17------------5-------------18-------7
>>> 18---------10----------------23-----3
>>> 23-----------5----------------25-----6
>>>
>>> my aim is to perform the differences between w_est and w_trab for the
>>> same values of c_est and c_trab. However, I have along the dataset as
>>> well as values that do not coincide in c_trab and c_est (Eg from the
>>> sample: 13) that "ruins" the sorting between those columms. I want the
>>> 13 to appear like missing values on the same line of c_trab. I am
>>> trying to search any syntax that suits my purpose and nothing relevant
>>> came up. Maybe there are other alternatives that I am not seeing now..
>>>
>>> Can anyone help me? much appreciated!
>>>
>>> regards,
>>> Sara
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: find the corresponding values between two variables
  - From: Sara Neto Machado <[email protected]>

References:
- st: find the corresponding values between two variables
  - From: Sara Neto Machado <[email protected]>
- Re: st: find the corresponding values between two variables
  - From: Nick Cox <[email protected]>
- RE: st: find the corresponding values between two variables
  - From: Joe Canner <[email protected]>
- Re: st: find the corresponding values between two variables
  - From: Sara Neto Machado <[email protected]>
- RE: st: find the corresponding values between two variables
  - From: Joe Canner <[email protected]>
- Re: st: find the corresponding values between two variables
  - From: Sara Neto Machado <[email protected]>

Prev by Date: st: Variable not in covariant ds
Next by Date: Re: st: Variable not in covariant ds
Previous by thread: Re: st: find the corresponding values between two variables
Next by thread: Re: st: find the corresponding values between two variables
Index(es):
- Date
- Thread