Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: appending several files with different variable names


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   st: RE: appending several files with different variable names
Date   Fri, 2 Mar 2012 11:08:40 +0000

Some of the details here are unclear, but I think the short answer is No. 

Let's backtrack and review some basics. Some of this will seem obvious but very likely not all of it. Others should be able to add to this. 

0. A golden rule is that you must always leave the original datasets as they are. 

1. This is a common and important problem but you shouldn't underestimate it. (Very recently I had the same problem with a colleague who usually works very consistently but in putting together just four of his files the few exceptional inconsistencies took about a hour's work to be sure that we had got it all right. Between us we have 42 years of Stata experience.) 

2. The desire to automate this is laudable but the idea that a program can somehow simulate your subject-matter knowledge and work out what should be the same variable despite different names sounds fantastical to me. More positively, if there are rules that define the inconsistencies between datasets then you can write a program to exploit them but your post does not make clear what they are. In practice, the best way to do this is by a single do-file that does all the work, which you keep revising as you discover fixes and changes that are needed until you have a script that does everything correctly. 

3. Always in appending keep track of where each block of data came from. I -generate- a variable in the first dataset (-datasetid-, say) that tells me where the data came from (the filename is often natural and convenient) and then -replace- as appropriate as each new file is -append-ed. Doing this does no harm and can save enormous frustration in trying to disentangle inconsistencies. 

4. The real nightmare is the same name being used for different things. Different names for the same thing is less of a problem. 

5. Renaming before -append-ing is one good strategy but not only the only one and not necessarily the best. 

6. If -var1-, -var2- are the same thing under different names, and at most one of them is non-missing, then -max(var1, var2)- combines numeric variables and -var1 + var2- combines string variables, and the principle can be extended to more variables. -egen, rowtotal()- is a quick and dirty way of mapping several numeric variables with at most one non-missing value to one variable with that non-missing value. -egen, concat()- does the same for several string variables. 

7. -describe .. using ...- can be very helpful. 

8. In my most recent problem I found -nmissing- (SJ) and -distinct- (SJ) useful. 

Nick 
n.j.cox@durham.ac.uk 

Yogesh Uppal

I am appending multiple files each having over 100 variables, with
same variables having different names in some files. Some of these
variables have common strings in some files, but some others do not.
Since the number of files and variables is large, I do not want to
manually identify the variables that are same and give them some
common name. I was wondering if there is a code that already exists to
take care of issues like this.

What I was leaning towards is creating a local macro for each file and
use the command findname to look for some variable like var* and
create a r(varlist) of such variables. And then run another loop over
each file to rename each variable in r(varlist) to a specified name.

I was hoping this method would rename variables that have common
string var* (The problem I am facing here is that if some file does
not have the variable name identified by var*, my loop gives me an
error message).

Is this a right way to go about the problem? If not, could you please
suggest something better?


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index