Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
David Elliott <dcelliott@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Looping through entries in csv file |

Date |
Mon, 13 Sep 2010 17:45:42 -0300 |

I've taken an additional look at the UK postal code format and found that it can come in a number of forms with variable lengths and Letter Number sequencing - oh bother! UK postal code patterns: In the notation below 'L' is letter, 'N' is number + means joined together and a space is an *expected* space. L+N NLL L+NN NLL L+NL NLL L+LN NLL L+LNN NLL L+LNL NLL Fortunately these sequences, while complicated, still provide the necessary anchors for using a regular expression since they always begin with a letter L and end in NLL. The variable first part of the code can be captured with: ([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z])) Parsing out the above you have: ([a-zA-Z] = L (any single letter , lower or upper case) followed by one of the following alternative sequences ([0-9] = N (any single number) (Note the alternatives in brackets with bars between) |[0-9][0-9] = NN |[0-9][a-zA-Z] = NL |[a-zA-Z][0-9] = LN |[a-zA-Z][0-9][0-9] = LNN |[a-zA-Z][0-9][a-zA-Z]) = LNL one then can have 0 to several spaces: [ ]* and the final sequence is fixed: ([0-9][a-zA-Z][a-zA-Z]) The full code would be: *========================== capture drop post* gen str8 postcode = "" gen str4 postdistrict = "" quietly forvalues i = 1/`c(N)' { if regexm(aadd1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i'], !! ".*([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z])) !! [ ]*([0-9][a-zA-Z][a-zA-Z]).*") != 0 { replace postcode = upper(regexs(1))+" "+upper(regexs(2)) in `i' replace postdistrict = upper(regexs(1)) in `i' } } *========================== (Note: long line wrapped at !! ) There is no problem with the postal code being on any line as the first part of the expression simply appends all the address fields together and then scans them for a postalcode flavoured string.: add1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i'] The two parts of the code are captured by wrapping them in brackets: ".*([a-zA-Z]([0-9]|...[a-zA-Z]))[ ]*([0-9][a-zA-Z][a-zA-Z]).*" regexs(1) returns the first matching sequence if regexm(...) found a postal code pattern and regexs(2) returns the second part. For Gillian's purposes, the first part is also the postal district and becomes available to her for merging purposes. DC Elliott I should point out that the process of comparing codes against a string is time consuming and is doomed to fail if codes are not in "perfect" format. The regular expression route will extract and transform the code to its "perfect" form in separate variables for merging purposes doing extraction and cleaning in a single process. On 13 September 2010 16:52, Ada Ma <heu034@googlemail.com> wrote: > David - I think Gillian need to create a list of postcode sector to > identify the postcodes, so that they can be extracted from the address > strings variables. > > UK postcodes - IIRC - can go from 5 to 7 letters long (8 if including > the space in the middle). e.g. first part of centre of Birmingham > postcode is B1... but you also have first sector which is 4 letters > long e.g. Aberdeen Foresterhill AB25, Newcastle city centre is NE1. > Your suggestion about pattern should work. Only thing is that > sometimes people are sloppy with the number of spaces in the middle, > sometimes they put them in, sometimes they forget it, sometimes they > have two to three spaces in to make it pretty (so that B1 would align > with AB24). > > As for county boundaries - I am not sure. But as counties are > amalgamation of Local Authority Districts so they are pretty big, I'd > imagine any mismatch shouldn't be massive. > > > Gillian - If you can't do the transpose in one go, I guess you must be > using older versions of MS Office - are you using 2003? The latest > version of Excel has no limit on column numbers... so if you can find > a friend using MS Office 2007 your transpose problem can be solved. > > Yet another alternative is to use the -reshape- command in Stata. > Create two variables: > > gen id = 1 > gen year=_n > keep postcode id year > reshape postcode, i(id) j(year) > > Now browse your data and copy and paste the only line of data you > have into a do file. Fiddly solution but one needs not learn > programming for it! > > > > > On Mon, Sep 13, 2010 at 5:53 PM, David Elliott <dcelliott@gmail.com> wrote: >> Once one has the postal district extracted from the address fields, >> why not just use it to merge with a file containing the districts and >> counties? I'm not catching the reason to place everything in a >> global. >> >> I also need to ask a question about UK postal codes. Are UK codes >> fully circumscribed by county boundaries? In Canada the fist part of >> the postal code is called a "Forward sortation area" or FSA - The FSA >> where I work, for example is "B3J" One FSA can serve a fairly large >> region and at least for us, that region can extend across county lines >> which plays holy hell with our use of them to place persons within a >> county boundary for statistical purposes. It may be prudent to test >> your apparent assumption that codes don't cross county lines. >> >> DC Elliott >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > > > -- > Ada Ma > Research Fellow > Health Economics Research Unit > University of Aberdeen, UK. > http://www.abdn.ac.uk/heru/ > Tel: +44 (0) 1224 555189 > Fax: +44 (0) 1224 550926 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- David Elliott Everything is theoretically impossible, until it is done. Progress is made by lazy men looking for easier ways to do things. -- Robert A. Heinlein (American science-fiction Writer, 1907-1988) * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Looping through entries in csv file***From:*Gillian.Frost@hsl.gov.uk

**References**:**Re: st: Looping through entries in csv file***From:*Ada Ma <heu034@googlemail.com>

**Re: st: Looping through entries in csv file***From:*Gillian.Frost@hsl.gov.uk

**Re: st: Looping through entries in csv file***From:*Ada Ma <heu034@googlemail.com>

**Re: st: Looping through entries in csv file***From:*Neil Shephard <nshephard@gmail.com>

**Re: st: Looping through entries in csv file***From:*David Elliott <dcelliott@gmail.com>

**Re: st: Looping through entries in csv file***From:*David Elliott <dcelliott@gmail.com>

**Re: st: Looping through entries in csv file***From:*Ada Ma <heu034@googlemail.com>

- Prev by Date:
**st: estimate supply and demand with short panel** - Next by Date:
**st: Merging question for a panel dataset** - Previous by thread:
**Re: st: Looping through entries in csv file** - Next by thread:
**Re: st: Looping through entries in csv file** - Index(es):