Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: trim spaces in postcode


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: trim spaces in postcode
Date   Mon, 3 Apr 2006 10:19:09 +0100

Joseph's method of reading in data as one 
or more big string variables and then 
applying -split- and -destring- is one 
I often use myself for reproducing small example
datasets from Statalist postings. More precisely, 
I copy and paste into the Editor; the embedded spaces
usually present then cause the Editor to treat the pasted
material as a string variable. I'd say the method 
was useful so long as you knew exactly what you were doing. 

I'd like to see a write-up of Joseph's regexp application! 

But three small caveats: 

1. =split- and -destring- are commands defined 
by .ado files, so if this were to be used repeatedly 
-- not something Joseph is recommending -- 
then it would be very inefficient because of the 
overhead of interpretation. 

2. Assuming something is born as -str244- and then 
-compress-ing it could of course make huge demands
on storage, albeit briefly. 

3. Most dangerous is the tacit assumption that the 
1 + 244 * j th column (245, 489, etc.) doesn't find
you in the middle of a field. 

Nick 
[email protected] 

Joseph Coveney
> 
> Ronnie Babigumira wrote:
> 
> Yes I have used and continue to use -insheet- (mainly if I have tab
> delimited data from excel) and specifying the string length 
> is not a problem
> with -insheet-. That said, there are situations where -infile- is the
> appropriate command and of course -input- is invaluable when 
> I want to input
> a few entries. In this case I have to specify the length of 
> string variables
> 
> Ada Ma wrote:
> 
> have you tried -insheet-??
> 
> <snip>
> Is there something in newer versions of Stata that would save me from
> guessing the length of strings when using -infile-
> and -input-
> </snip>
> 
> --------------------------------------------------------------
> ------------------
> 
> For -infile-:
> 
> If your input file is space-delimited (that is, spaces aren't used to
> represent missing values and there aren't internal spaces in 
> strings), then
> you can use -split- after -infix str v1 1-244 using 
> <filename>- for record
> lengths up to 244 bytes.  You can then -destring- to restore numeric
> variables.
> 
> In cases of multispace-delimited files (typically used, for 
> example, where
> there are internal spaces in strings), then I believe that 
> you can specify a
> multispace parsing string with -split-.  (See first example 
> below.)  Be
> aware that -infix- strips leading spaces at the beginning of the
> record; -filefilter- can help to remedy that beforehand if needed.
> 
> In cases where the input file is messy, you can use Stata's 
> conventional
> string functions and new regular expression functions after 
> -infix str v1
> 1-244 using <filename>-.  I've just finished such a project 
> (the data were
> imbedded in prettily formatted .pdf files), and Stata's 
> regular expression
> functions were a godsend.
> 
> If the record length is longer than 244, then I believe that 
> you can -infix
> str v1 1-244 str v2 245-488 . . . using <filename>-, and 
> proceed as above.
> 
> For -input-:
> 
> You don't actually need to guess string length in order to 
> use -input-.
> Just specify the maximum and away you go.  (See second example below.)
> 
> Joseph Coveney
> 
> 
> . set obs 2
> obs was 0, now 2
> 
> . input str244 a
> 
> 
> >
> >                                   a
>   1. "a  b  c d"
>   2. "e  f g  h i"
> 
> . split a, generate(b) parse("  ")
> variables created as string:
> b1  b2  b3
> 
> . list b*, noobs
> 
>   +----------------+
>   | b1    b2    b3 |
>   |----------------|
>   |  a     b   c d |
>   |  e   f g   h i |
>   +----------------+
> 
> . clear
> 
> . input str244 a byte b str244 c int d str244 e float f
> 
> 
> >
> >                                   a         b
> >
> >
> > c         d
> >
> >
> >                 e          f
>   1. abc 3 def 200 ghi 1001.1
>   2. lmn -1 opq .m "" 10000
>   3. end
> 
> . compress
> a was str244 now str3
> c was str244 now str3
> e was str244 now str3
> 
> . list, noobs
> 
>   +-------------------------------------+
>   |   a    b     c     d     e        f |
>   |-------------------------------------|
>   | abc    3   def   200   ghi   1001.1 |
>   | lmn   -1   opq    .m          10000 |
>   +-------------------------------------+
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index