Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Data manipulation question

From   "Nick Cox" <>
To   <>
Subject   st: RE: Data manipulation question
Date   Fri, 14 Nov 2003 10:59:29 -0000

This example suggests various kinds of problems.

Whenever CarManufacturer is empty, you could
pull across the value from the second variable
like this:

replace CarManufacturer = CarModel if mi(CarManufacturer)

but that leaves e.g. "Ford Excursion" as both CarManufacturer
and CarModel, which replaces one problem by another.

I would try another way: concatenate all these into a single
variable, and then start again.

That is

gen Car = CarManufacturer + " " + CarModel + " " + CarEngine


egen Car = concat(CarManufacturer CarModel CarEngine), p(" ")

Then two simple clean-ups are to trim spaces

replace Car = trim(Car)

and perhaps to remove isolated periods

replace Car = subinstr(Car, " .", " ",.)

Now it starts getting serious. Two tools that might come
in handy are the -word()- function and the -split- command.

split Car

will -split- the variable into several, each containing
one "word".

tab Car1


levels Car1

will expose problems like "318" in obs 8
and the inconsistency between "Alfa" and "Alfra".

You are probably going to end up with a .do
file mixing all sorts of general and detailed


I have discovered errors in my dataset, and it seems some of my data
are recorded in the wrong variable. The variable the data should have
been recorded as, is left missing. A few examples: (Missing values
marked as "")

Record              CarManufacturer            CarModel
1                      Ford                              Mustang
2                      Chevrolet                       Starcraft
3                      Ford                              Galaxy
4                      Honda                           Civic
1.4 I S
5                      Toyota                          Avensis
6                      ""                                  Ford
Excursion              .
7                      ""                                  BMW 520 I
Touring        520 I
8                      ""                                  318
9                      BMW                            320 I
320 I
10                     Alfra Romeo                  Spider
11                     ""                                  Alfa Romeo

What I wish to do is to search for an expression in each record that
can also be observed as a distinct value in CarManufacturer, and then
replace it into CarManufacturer. I have failed in both creating tests
across records and on an attempt to fetch the unique values of
CarManufacturer into an object which I then can perform checks
against. But then again, I'm no seasoned veteran in this game.
Is there any way of pulling this off in Stata?

*   For searches and help try:

© Copyright 1996–2022 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index