[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: AW: Merging database

From   "Eric A. Booth" <>
Subject   Re: st: AW: Merging database
Date   Wed, 29 Apr 2009 13:14:00 -0500

On Apr 29, 2009, at 12:03 PM, Nick Cox wrote:

I have various comments on this code.

1. -foreach x in V*- won't work. Eric is probably thinking of
-foreach x of var V*- but in this case -foreach v in V1 V2 V3- takes no more thought.

-foreach x in V*- works for fine for me. It loops over all the variables V1-V3.

2. Eric wants to -recode- missings and also concatenate the identifiers. If so, it is easier to go

egen V_combined = concat(V1 V2 V3), p(_)
replace V_combined = subinstr(V_combined, ".", "x", .)


Note that using 99, even temporarily, is dangerous unless one can be sure that 99 is not a legitimate identifier. In any case, why recode? A variable with values like "1 . ." is a satisfactory composite -- if that is what is needed.

Thanks, Nick...this is very helpful.
I wasn't aware of subinstr() and I couldn't get regexr() to work with the missing value "." (hence the use of 99...but you're right about that too, using 99 is dangerous, and in the past I've used a much larger number that was clearly out of the range of the dataset, but using substr() without the recode is a far better approach).


Eric A. Booth
Public Policy Research Institute
Texas A&M University
Office: +979.845.6754
Fax: +979.845.0249

Eric A. Booth

To add to Jochen's comment:

If you were hoping to have a new 'ID' variable that keeps the
information from all the ID variables V1, V2, and V3, you could create
a string are some examples:


input V1  V2  V3
1     .       1
2     .       2
3    3       3
4    4        .   .     5       5
6     .       6
foreach x in V* {
	recode `x' (.=99)  // <-- So that -regexr- isn't tripped up later
	tostring `x', replace
	gen str10 v_combined = V1+"_"+V2+"_"+V3
	gen v_combined2 = regexr(v_combined, "99", "x")
	sencode v_combined2, gene(uniqueID) gsort(+v_combined2) label(id)	

On Apr 29, 2009, at 11:36 AM, Jochen Späth wrote:

Hello Sergio,

I'm not quite sure of what your problem is, maybe it would help if
you were a little more precise.

Below, I assumed that the example you gave is AFTER your three data
sets have been merged, with v1 coming from the first, v2 from the
second and v3 from the third and with v1, v2 and v3 denoting all the
same ID. If this is the case you could

-replace v1 = v2 if v1 == . & v2 != .-
-replace v1 = v3 if v1 == . & v2 ==. & v3 != .-
-count if v1 == .- /* should return 0, otherwise there are
observations in your data that are not uniquely determined by either
of your three ID variables.*/
-drop v2 v3- /* of course, only if you got all IDs caught in v1 */


-----Ursprüngliche Nachricht-----
Von: [
] Im Auftrag von "SERGIO M. AFCHA CHÁVEZ"
Gesendet: Mittwoch, 29. April 2009 17:55
Betreff: st: Merging database

Dear Statlisters,

I have a little problem merging a data base. I have variables for 3
years showing an ID:

V1  V2  V3
1     .       1
2     .       2
3    3       3
4    4        .
.     5       5
6     .       6

I need only one ID variable. How can I obtain one column with all the
ID numbers?

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index