Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: how does insheet determine datatypes?


From   Phil Schumm <[email protected]>
To   [email protected]
Subject   Re: st: how does insheet determine datatypes?
Date   Sat, 6 Jan 2007 21:30:21 -0600

On Jan 6, 2007, at 11:51 AM, Jens Lauritsen wrote:
I edit the raw file manually by adding a record at the top after variable names or (better) use stata to add that line:
cpr v1 v2 v3
0xx1201956 1 2 2 // this record will force the first variable to string
0101201956 1 2 2
1101201954 1 2 1
etc .... rest of records
and then read as :
insheet using myfile
drop in 1 ....and I have the cpr variable as a string without the "fake" record

If you want to avoid having to edit your data file(s), and assuming the first row contains variable names each of which contains at least one non-numeric character, you can also simply use the -nonames- option and then use something like the following:


foreach var of varlist _all {
ren `var' `=`var'[1]'
}
drop in 1


Of course this will result in *all* your data being read as strings. However, this is sometimes desirable, as for example when you need to read and append several files and want to make certain that no data are lost due to appending a numeric variable onto a string variable (or vice versa). In cases where this is not desirable, a single call to -destring- is all that is necessary to restore one or more variables to numeric.

I mention this only because it raises two interesting questions (at least to me). First, note that the rename above will fail if the original variable name (as appears in the first row of the data file) is not a valid Stata name. -insheet- takes care of this for you, by automatically deleting spaces and/or special characters, dropping leading digits, etc. I wonder: Is the function -insheet- uses to do this exposed? It would be very handy to have such a function available, such as for use in the snippet above. And although such a function is easily written in Mata, it would be nice to be able to use the same one -insheet- uses in cases where you might need consistency between the two.

Let's suppose it's not exposed. If I were to write such a function myself, I might use a regular expression to match those elements that are not valid and remove them. Problem is, since -regexr(s1,re,s2)- only replaces the first match of re in s1, you can't do this with a single function call. I wonder: Why does -regexr()- not take a fourth argument like -subinstr()-, indicating the maximum number of replacements to make (with . indicating replace all)?

Just some random musings on a Saturday night...


-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index