Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Changing 2 columns of repetitive ASCII data into many columns


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Changing 2 columns of repetitive ASCII data into many columns
Date   Fri, 12 May 2006 09:04:19 -0500

Simon Wheeler <simon.wheeler@kcl.ac.uk> has a data-input question:

> I have recently purchased a dataset which contains food codes in 7-bit 
> ASCII format [...]
>

> The data as they stand fall into 2 columns: varname|value.  NUMB is the 
> food code - i.e. the variable that I want to sort the data by.

>  would like to have all of the other variables along the top as 
> separate columns, so that I end up with individual food codes and their 
> nutrient values - see below.  With these data I can then do dietary 
> analysis.
>
> This format goes on for several thousand different food codes, over and 
> over.  Codes are separated by the line ***
> 
> How it is now:
> 
> RECTYPE       1
> NUMB  14002
> CONST 50852
> PREV  50852
> GROUP  FA
> NAME   Apples, cooking, raw, peeled
> EDPOR 1
> WATER 87.7
> TOTNIT      0.05
> PROT  0.3
> FAT   0.1
> CHO   8.9
> KCALS 35
> KJ    151
> STAR  Tr
> TOTSUG 8.9
> GLUC  2
> FRUCT 5.9
> SUCR  1
> MALT  0
> LACT  0
> SOUTHFIB  2.2
> ENGFIB  1.6
> CELL  0.6
> SOLNCP  0.6
> INSOLNCP  0.4
> LIGNIN   Tr
> NA  2
> ***
> RECTYPE      1
> NUMB  14003
> CONST 50853
> PREV  50853
> GROUP FA
> NAME  Apples, cooking, weighed with skin and core
> EDPOR 0.73
> WATER 63.1
>
> etc
> 
> What I would like to have:
> 
> Code | Description | Group | Water | Fat | CHO | Protein | Sodium | Potassium
> 
> 14001|  Apples     | Fruit |  26   | 5.1 | 4.2 |   3.6   | 0.1    |  0.04
> 14002|  Oranges    | Fruit |  etc

There are two appraches to problems like this -- one uses Stata and the other 
uses Mata -- but the underlying logic is the same either way.

Let's use Stata.  One begins by reading each line of the data into a string
variable.  We then give Stata commands to, step by step, move the data
into the form we want.  The wonderful thing about this approach is that 
we never have to form a grand plan.  We just look at what we've got, 
fix some problem and move a little closer, and repeat the process
until we arrive.

Simon ultimately wants the data in wide form.  Given the format of 
this data, however, it will easiest to get the data into long form and 
then turn that into wide form in the standard way.  So let's set the 
following goal for ourselves, to get the data into the form:

        code    description  grouip  category  value
        14001   Apples       Fruit   Water     26
        14001   Apples       Fruit   Fat       5.1
        14001   Apples       Fruit   CHO       4.2
        ...
        14001   Oranges      Fruit   etc.


Step 1.  Infile the data
------------------------

Here is what we do:

        . infix str line 1-80 using <filename>

        . compress
        line was str80 now str41

The second line just shortens variable line.  I just made up the 
"line was str80 not str41" part.  However, if -compress- produces no 
message, then the longest line was 80 characters, and that means the 
real maximum was probabliy longer than 80.  In that case, redo the -infix-
with a longer string.

Anyway, now we have a dataset that looks like, 


            line
	---------------------------------------------
        1.  RECTYPE       1
        2.  NUMB  14002
        3.  CONST 50852
        4.  PREV  50852
        5.  GROUP  FA
        6.  NAME   Apples, cooking, raw, peeled
        7.  EDPOR 1
        8.  WATER 87.7
        9.  etc.
	---------------------------------------------

If we list the data in Stata, the lines will look right justified, but 
that's because of the display format.  Let's change it.  We type 
-describe- and learn the dipslay format is %41s, so let's change it 
to a %-41s format:

	. describe
	  <output omitted>

	. format line %-41s

Now -list- will look better.  The data I have looks left-justified, 
but I'm worried some line has a leading blank.  I worry about trailing
blanks, too, so let's get rid of any that happen to be around 

	. replace line = strtrim(line)
        (0 real changes made)


Step 2:  Get a record id
------------------------

Type the following:

	. gen long recnum = _n

This is very important.  At this stage, the meaning of the data is 
dependent on order.  If we were accidently to sort the data, it would 
become meaningless.  Now we can get back to the original order.  


Step 3:  Break line into first word and rest
--------------------------------------------

Our obsrvations contain lines like "RECTYPE   1", "NUMB 14002", etc.
Let's break out the first word, so that we have 

            first         rest
	---------------------------------------------
        1.  RECTYPE       1
        2.  NUMB          14002
        3.  CONST         50852
        4.  PREV          50852
        5.  GROUP         FA
        6.  NAME          Apples, cooking, raw, peeled
        7.  EDPOR         1
        8.  WATER         87.7
        9.  etc.
	---------------------------------------------

Getting the first word is easy,  we could just code 

	. gen first = word(line, 1)

but getting the rest of the line, spaced exactly as it was originally, is more
difficult.  One way to proceed

	. gen blank = strpos(line, " ")

	. gen str first = strtrim(substr(line, 1, blank)) if blank 
	. replace first = line if blank==0

	. gen str rest = strtrim(substr(line, blank, .)) if blank

	. drop blank

Now let's list a little of the data, make sure we did that right, and then 

	. drop line


Step 4:  Clean up
----------------

Can we drop some lines?  Simon didn't tell us about RECTYPE lines.  I'm 
going to assume they are irrelevant and get rid of them.

	. drop if first=="RECTYPE"

Maybe Simon left somthing unsaid about RECTYPE and we'll need to get rid of
the RECYTPES other than 1, including recrods that follow them.  If so, skip
this step and do that later.  I just want to get rid of any single-line junk I
can right now.


Step 5:  Identifier variable
----------------------------




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index