Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading a txt file that loops


From   David Kantor <kantor.d@att.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: reading a txt file that loops
Date   Sat, 16 Apr 2011 23:31:10 -0400

At 08:35 AM 4/16/2011, Sears Generic wrote:
Are there any shortcuts to reading a data file that has the following format
other than to reorganize the data before importing?  The data file is for
population by year by geographic location (e.g. United States, Indiana, then
3 counties in Indiana).  "FIPS" is a unique identifier for each county.  The
problem is that the text file loops (i.e. only provides 4 decades of data
before starting over) on a new line.  In the example below I've reduced the
issue to the United States, Indiana, and 3 counties, but the full dataset
has every county for every state so the looping does not recur in a
consistent way.  Any suggestions would be appreciated.


FIPS        1990       1980       1970       1960
00000  248709873  226545805  203211926  179323175 United States

18000    5544159    5490224    5193669    4662498 Indiana
18001      31095      29619      26871      24643 Adams County
18003     300836     294335     280455     232196 Allen County
18005      63657      65088      57022      48198 Bartholomew County

FIPS        1950       1940       1930       1920
00000  151325798  132164569   12320262  106021537 United States

18000    3934224    3427796    3238503    2930390 Indiana
18001      22393      21254      19957      20503 Adams County
18003     183722     155084     146743     114303 Allen County
18005      36108      28276      24864      23887 Bartholomew County

It seems that you have data lines and header lines. You need to have a dictionary that accommodates both kinds of lines. Then as you read in the data, the variables for a data line will be meaningless for header lines, and vice-versa.

Find a way to determine which line-type each record is. Maybe you test whether the first variable is "FIPS" to signify a header line. Furthermore, there are several different types of header lines; I see two here, but maybe there are more. The two I see are...
1: for 1990, 1980, 1970, 1960
2: for 1950, 1940, 1930, 1920

Create a variable that indicates which type of header line is present. Then carry that value forward over the subsequent data lines. You can use -replace headertype = headertype [_n-1] if mi(headertype) & ~mi(headertype[_n-1])-
-- assuming that headertype is initially missing for data lines.
You can also use -carryforward- from SSC.

Now save as a tempfile.

Loop through the headertypes; for each headertype ,
 -use- the dataset (the tempfile) if headertype = the desired type
 keep only the variables that pertain to data lines, plus headertype
based on headertype, rename the variables that contain the population to something meaningful such as pop1990, pop1980, etc. (And the numeric suffixes are necessary if you are to do the reshape in the next step)
 here, you may want to -reshape long-
-save- under a tempfile name that you can reconstruct later (e.g., t1, t2 for headertype 1 & 2).

Finally pull these files together. Loop through the headertypes; for each headertype,
 -use- the first of the latter tempfiles (maybe `t1')
 for the remaining tempfiles...
  -append- them if you did the -reshape long- step as mentioned above;
  -merge- them otherwise.

HTH
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index