Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Looping through entries in csv file

From	David Elliott <[email protected]>
To	[email protected]
Subject	Re: st: Looping through entries in csv file
Date	Mon, 13 Sep 2010 17:45:42 -0300

I've taken an additional look at the UK postal code format and found
that it can come in a number of forms with variable lengths and Letter
Number sequencing - oh bother!

UK postal code patterns:

In the notation below 'L' is letter, 'N' is number + means joined
together and a space is an *expected* space.

L+N   NLL
L+NN  NLL
L+NL  NLL
L+LN  NLL
L+LNN NLL
L+LNL NLL

Fortunately these sequences, while complicated, still provide the
necessary anchors for using a regular expression since they
always begin with a letter L and end in NLL.

The variable first part of the code can be captured with:
([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z]))
Parsing out the above you have:
([a-zA-Z] = L (any single letter , lower or upper case) followed by
one of the following alternative sequences
([0-9] = N (any single number) (Note the alternatives in brackets with
bars between)
|[0-9][0-9] = NN
|[0-9][a-zA-Z] = NL
|[a-zA-Z][0-9] = LN
|[a-zA-Z][0-9][0-9] = LNN
|[a-zA-Z][0-9][a-zA-Z]) = LNL
one then can have 0 to several spaces:
[ ]*
and the final sequence is fixed:
([0-9][a-zA-Z][a-zA-Z])

The full code would be:
*==========================
capture drop post*
gen str8 postcode = ""
gen str4 postdistrict = ""
quietly forvalues i = 1/`c(N)' {
	if regexm(aadd1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i'],
!! ".*([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z]))
!! [ ]*([0-9][a-zA-Z][a-zA-Z]).*") != 0 {
		replace postcode = upper(regexs(1))+" "+upper(regexs(2)) in `i'
		replace postdistrict = upper(regexs(1)) in `i'
	}
}
*==========================
(Note: long line wrapped at !!    )

There is no problem with the postal code being on any line as the
first part of the expression simply appends all the address fields
together and then scans them for a postalcode flavoured string.:
add1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i']

The two parts of the code are captured by wrapping them in brackets:
".*([a-zA-Z]([0-9]|...[a-zA-Z]))[ ]*([0-9][a-zA-Z][a-zA-Z]).*"
regexs(1) returns the first matching sequence if regexm(...) found a
postal code pattern and
regexs(2) returns the second part.

For Gillian's purposes, the first part is also the postal district and
becomes available to her for merging purposes.

DC Elliott

I should point out that the process of comparing codes against a
string is time consuming and is doomed to fail if codes are not in
"perfect" format.  The regular expression route will extract and
transform the code to its "perfect" form in separate variables for
merging purposes doing extraction and cleaning in a single process.

On 13 September 2010 16:52, Ada Ma <[email protected]> wrote:
> David - I think Gillian need to create a list of postcode sector to
> identify the postcodes, so that they can be extracted from the address
> strings variables.
>
> UK postcodes - IIRC - can go from 5 to 7 letters long (8 if including
> the space in the middle).  e.g. first part of centre of Birmingham
> postcode is B1...  but you also have first sector which is 4 letters
> long e.g. Aberdeen Foresterhill AB25, Newcastle city centre is NE1.
> Your suggestion about pattern should work.  Only thing is that
> sometimes people are sloppy with the number of spaces in the middle,
> sometimes they put them in, sometimes they forget it, sometimes they
> have two to three spaces in to make it pretty (so that B1 would align
> with AB24).
>
> As for county boundaries - I am not sure.  But as counties are
> amalgamation of Local Authority Districts so they are pretty big, I'd
> imagine any mismatch shouldn't be massive.
>
>
> Gillian - If you can't do the transpose in one go, I guess you must be
> using older versions of MS Office - are you using 2003?  The latest
> version of Excel has no limit on column numbers... so if you can find
> a friend using MS Office 2007 your transpose problem can be solved.
>
> Yet another alternative is to use the -reshape- command in Stata.
> Create two variables:
>
> gen id = 1
> gen year=_n
> keep postcode id year
> reshape postcode, i(id) j(year)
>
> Now browse  your data and copy and paste the only line of data you
> have into a do file.   Fiddly solution but one needs not learn
> programming for it!
>
>
>
>
> On Mon, Sep 13, 2010 at 5:53 PM, David Elliott <[email protected]> wrote:
>> Once one has the postal district extracted from the address fields,
>> why not just use it to merge with a file containing the districts and
>> counties?  I'm not catching the reason to place everything in a
>> global.
>>
>> I also need to ask a question about UK postal codes.  Are UK codes
>> fully circumscribed by county boundaries?  In Canada the fist part of
>> the postal code is called a "Forward sortation area" or FSA - The FSA
>> where I work, for example is "B3J"  One FSA can serve a fairly large
>> region and at least for us, that region can extend across county lines
>> which plays holy hell with our use of them to place persons within a
>> county boundary for statistical purposes.  It may be prudent to test
>> your apparent assumption that codes don't cross county lines.
>>
>> DC Elliott
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
>
> --
> Ada Ma
> Research Fellow
> Health Economics Research Unit
> University of Aberdeen, UK.
> http://www.abdn.ac.uk/heru/
> Tel: +44 (0) 1224 555189
> Fax: +44 (0) 1224 550926
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
David Elliott

Everything is theoretically impossible, until it is done.
Progress is made by lazy men looking for easier ways to do things.
 -- Robert A. Heinlein (American science-fiction Writer, 1907-1988)

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Looping through entries in csv file
  - From: [email protected]

References:
- Re: st: Looping through entries in csv file
  - From: Ada Ma <[email protected]>
- Re: st: Looping through entries in csv file
  - From: [email protected]
- Re: st: Looping through entries in csv file
  - From: Ada Ma <[email protected]>
- Re: st: Looping through entries in csv file
  - From: Neil Shephard <[email protected]>
- Re: st: Looping through entries in csv file
  - From: David Elliott <[email protected]>
- Re: st: Looping through entries in csv file
  - From: David Elliott <[email protected]>
- Re: st: Looping through entries in csv file
  - From: Ada Ma <[email protected]>

Prev by Date: st: estimate supply and demand with short panel
Next by Date: st: Merging question for a panel dataset
Previous by thread: Re: st: Looping through entries in csv file
Next by thread: Re: st: Looping through entries in csv file
Index(es):
- Date
- Thread