Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: RE: Unique Clients with Multiple IDs that need to be integrated

 From Robert Picard To statalist@hsphsun2.harvard.edu Subject Re: st: RE: Unique Clients with Multiple IDs that need to be integrated Date Thu, 11 Oct 2012 18:36:37 -0700

```Sarah is correct in pointing out that the grouping identifier is
simply the lowest value of all the alternate ids. If the identifiers
are numeric, then the following may do what is requested.

*------------------- begin example -----------------------
clear
input long (row CLIENT cid1 altid1 altid2 altid3)
1      1          1         1          .          .
2      1          1         1          .          .
3      2          2         2      65958      62781
4      2          2         2      65958      62781
5      2          2         2      65958      62781
6      2          2         2      65958      62781
7      3          3    123171          3          .
8      3          3    123171          3          .
9      3          3    123171          3          .
10     3          3    123171          3          .
11     3          3    123171          3          .
12     3          3    123171          3          .
13     4         11        11   15262202          .
14     4         11        11   15262202          .
15     5         13     66234         13     172805
16     5         13     66234         13     172805
17     5         13     66234         13     172805
18     5         13     66234         13     172805
19     6         31  14997478         31     190690
20     6         31  14997478         31     190690
21     6         31  14997478         31     190690
22     6         31  14997478         31     190690
98    10         31        31         22          .
99    11         41        22         31          .
67821  14922   65958          2     65958      62781
67822  14922   65958          2     65958      62781
67823  14922   65958          2     65958      62781
67824  14922   65958          2     65958      62781
99998 123171  123171          3    123171          .
99999 123171  123171          3    123171          .
end

gen order = _n
egen lowid = rowmin(altid*)
sort lowid order
by lowid: gen newid = cid1[1]
order newid
list , noobs sepby(newid)
*--------------------- end example -----------------------

On Thu, Oct 11, 2012 at 5:05 PM, Sarah Edgington <sedging@ucla.edu> wrote:
> Emily,
> There are a few ways of going about this, I think.  Which one makes the most
> sense depends a bit on what pieces of information you have.  I'm not sure I
> completely understand what form your data is in.  So this is my best guess
> about something that might work based on what I think you have.
>
> Your example looks like it contains multiple records per person and that the
> cid1 variable is the id that was assigned to the patient at the time of that
> encounter not the unique client id number that you need to match them to
> other data sources.  It seems like some substantial effort has been made to
> figure out which IDs go together, but that doesn't really help you figure
> out which is right cid1 to use for all the encounters.  Do you happen to
> have a master list of data that contains the right cid and then the list of
> alt ids?  If you have a list of the final chosen client ids already
> associated with their alt ids, there may be other strategies that will be
> more efficient than what I'm going to suggest below.
>
> Otherwise, it looks to me like what you have for each encounter is a
> client's id at the time of that encounter and a list of IDs that client used
> during other encounters.  You know that one of those IDs is the correct
> final id.  But when looking at a single encounter you have no way of
> determining which one is the correct one without some other source of
> information.  The first important thing is to be sure you know exactly how
> to identify which is the real final id.  (I think, from your description,
> the rule is that the first cid1 to appear in your data is the real one,
> right?).
>
> The first thing I'd do is create an index variable so you know for sure you
> can recreate the original sort order of the data:
> -gen index=_n-
>
> Next I'd create a new variable that contains the lowest of the ids from the
> list of all the possible ids for that client.  I'm assuming your id
> variables are numeric.  If they aren't this will require some more thought
> to deal with strings, but logic is just that you want to have a rule that
> allows you to identify a single constant id from the list for all the
> encounters.  Don't worry if that isn't the "right" id.  We'll fix it in a
> moment.
>
> So now you have cid1, alt1-alt10 and newid for each encounter.  Remember
> newid is just the minimum of the non-missing id fields.
>
> If you sort by newid you should group all the records for a given client
> together, no matter what order the original id choices appeared in.
> Next step is to identify which of the records for the client appeared first
> in your data set.  You'll use the index variable you created above.  Once
> you identify the first record for a given client (based on the original sort
> order) you'll know that clients "real" id number.
>
> -sort newid index-
> -by newid: gen realid=cid1 if _n=1
>
> Now you just need to copy that realid value to the rest of the records for
> that client.  (There are probably other more elegant ways to do this; this
> is just my default)
> -by newid: egen finalid=sum(realid)
>
> The finalid variable should now contain the client id number that matches to
>
> Hope that helps.
>
> -Sarah
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Emily
> Putnam-Hornstein
> Sent: Thursday, October 11, 2012 3:20 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: st: Unique Clients with Multiple IDs that need to be integrated
>
> I am working with a large substance abuse treatment dataset with roughly
> 900,000 observations for approximately 200,000 unique clients.
>
> In working with these data (these records were being linked to a second
> dataset), it was discovered that a number of clients had been inadvertently
> assigned a new id when returning for a follow-up treatment service. In other
> words, the agency thought it was linking/working with dataset of unique
> clients, only to discover that many records were clearly for the same
> individual.
>
> To complete the data linkages the decision was made to maintain one id/one
> record for each client, but to create up to 10 new variables with the
> alternative ids listed so that after the linkages were completed, data for
> these other ids could be pulled to construct a longitudinal record of
> service encounters for a given individual. They were attached to a client’s
> record as depicted in the data below. So, for example, the first client
> (cid1=1) was not found to have any duplicate ids, so fields altid1=cid1 and
> altid2-altid10 are all empty (I have only shown alt1-altid3 for simplicity).
> But for the second client (cid1=2) it was discovered that this client has
> information under two other ids “65958” and “62781”
>
> To be able to construct a longitudinal record of service encounters for each
> client I need to recode the records currently appearing as cid1=65958 and
> cid1=62781 to cid1=2 (and I need to do this for up to 10 ids for some
> clients). Additionally, the correct client id is listed as an alternative id
> for those records I want to recode. So for cid1=65958, altid1=2 and
> altid2=62781 (I have also shown this below, under “Later Data”).
>
> Finally, I mentioned the data linkages earlier because it is important that
> I maintain the first (as sorted) cid1 as that is the id that links to other
> data source – although it is not necessarily the id associated with a
>
> I will not burden the listserv with the completely inelegant code I have
> been working on for the last two days (consisting of attempts to reshape,
> merge subsets of records, etc. and has gotten me nowhere). Nor will I
> attempt to justify decisions that were made in creating this dataset that
> was given to me…but would greatly appreciate any and all guidance! Thank you
> and apologies in advance if I have explained this poorly. Emily
>
>
> +-----------------------------------------------------------------------
> +----
> -----------------+
>         | CLIENT   modality_type        adm1        dis1       cid1
>   altid1     altid2     altid3 |
>
> |-----------------------------------------------------------------------
> |----
> -----------------|
>      1. |      1       Admission   12 Dec 06           .          1
> 1          .          . |
>      2. |      1       Discharge   12 Dec 06   31mar2007          1
> 1          .          . |
>
> |-----------------------------------------------------------------------
> |----
> -----------------|
>      3. |      2       Admission   08 Mar 07           .          2
> 2      65958      62781 |
>      4. |      2       Discharge   08 Mar 07   17mar2007          2
> 2      65958      62781 |
>      5. |      2       Admission   14 Jun 07           .          2
> 2      65958      62781 |
>      6. |      2       Discharge   14 Jun 07   30jun2007          2
> 2      65958      62781 |
>
> |-----------------------------------------------------------------------
> |----
> -----------------|
>      7. |      3       Admission   14 Jul 06           .          3
> 123171          3          . |
>      8. |      3       Discharge   14 Jul 06   03aug2006          3
> 123171          3          . |
>      9. |      3       Admission   02 Oct 06           .          3
> 123171          3          . |
>     10. |      3       Discharge   02 Oct 06   22oct2006          3
> 123171          3          . |
>     11. |      3       Admission   30 Apr 07           .          3
> 123171          3          . |
>     12. |      3       Discharge   30 Apr 07   20may2007          3
> 123171          3          . |
>
> |-----------------------------------------------------------------------
> |----
> -----------------|
>     13. |      4       Admission   07 Mar 07           .         11
> 11   15262202          . |
>     14. |      4       Discharge   07 Mar 07   27mar2007         11
> 11   15262202          . |
>
> |-----------------------------------------------------------------------
> |----
> -----------------|
>     15. |      5       Admission   20 Dec 06           .         13
> 66234         13     172805 |
>     16. |      5       Discharge   20 Dec 06   07jan2007         13
> 66234         13     172805 |
>     17. |      5       Admission   01 Mar 07           .         13
> 66234         13     172805 |
>     18. |      5       Discharge   01 Mar 07   12mar2007         13
> 66234         13     172805 |
>
> |-----------------------------------------------------------------------
> |----
> -----------------|
>     19. |      6       Admission   13 Feb 07           .         31
> 14997478         31     190690 |
>     20. |      6       Discharge   13 Feb 07   06mar2007         31
> 14997478         31     190690 |
>     21. |      6       Admission   22 Jun 10           .         31
> 14997478         31     190690 |
>     22. |      6       Discharge   22 Jun 10   23sep2010         31
> 14997478         31     190690 |
>
> (LATER DATA)
>
>
> +-----------------------------------------------------------------------
> +----
> ----------+
>         | CLIENT   mo~y_type        adm1        dis1    cid1       altid1
> altid2     altid3 |
>
> |-----------------------------------------------------------------------
> |----
> ----------|
> 67821. |  14922   Admission   19 Dec 06           .   65958          2
> 65958      62781 |
> 67822. |  14922   Discharge   19 Dec 06   30jan2007   65958          2
> 65958      62781 |
> 67823. |  14922   Admission   08 Feb 07           .   65958          2
> 65958      62781 |
> 67824. |  14922   Discharge   08 Feb 07   12mar2007   65958          2
> 65958      62781 |
>
> +-----------------------------------------------------------------------
> +----
> ----------+
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```