Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Gillian.Frost@hsl.gov.uk |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Looping through entries in csv file |

Date |
Tue, 14 Sep 2010 09:37:12 +0100 |

All, Thank you so much for your help! David, the below code works like a dream. I had to remove the ".*" from the start of the expression ".*([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z])) ...", for some reason it wasn't picking up the first letter of the postcode district, but it works with this removed. I was not aware of these regular expressions, and can see how useful they could be for future problems. Ada was correct in that I needed to create a list of postcode districts to identify and extract the postcode from the address variables, but your method means that this is no longer necessary and I can now just merge with the postcode data file. David, in answer to your question, UK postcodes are not fully within county boundaries. What we have done is taken the number of households in each postcode district and looked at the proportion that is in each county. We have then assigned the postcode district to the county where the greatest proportion of households appear. For example, 74% of households in the postcode district SK17 are in the East Midlands region, so this is the county we have assigned to SK17. There is scope for some misclassification using this method. However, for the actual analysis, we only need to know whether individuals are in England and Wales, or Scotland, and so the effect of any misclassification should be minimal. The more detailed information about counties will be used mainly for descriptive purposes. Thank you once again for all your help. I would never had got to this solution myself. Many thanks, Gillian David Elliott <dcelliott@gmail.com> Sent by: owner-statalist@hsphsun2.harvard.edu 13/09/2010 21:45 Please respond to statalist@hsphsun2.harvard.edu To statalist@hsphsun2.harvard.edu cc Subject Re: st: Looping through entries in csv file I've taken an additional look at the UK postal code format and found that it can come in a number of forms with variable lengths and Letter Number sequencing - oh bother! UK postal code patterns: In the notation below 'L' is letter, 'N' is number + means joined together and a space is an *expected* space. L+N NLL L+NN NLL L+NL NLL L+LN NLL L+LNN NLL L+LNL NLL Fortunately these sequences, while complicated, still provide the necessary anchors for using a regular expression since they always begin with a letter L and end in NLL. The variable first part of the code can be captured with: ([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z])) Parsing out the above you have: ([a-zA-Z] = L (any single letter , lower or upper case) followed by one of the following alternative sequences ([0-9] = N (any single number) (Note the alternatives in brackets with bars between) |[0-9][0-9] = NN |[0-9][a-zA-Z] = NL |[a-zA-Z][0-9] = LN |[a-zA-Z][0-9][0-9] = LNN |[a-zA-Z][0-9][a-zA-Z]) = LNL one then can have 0 to several spaces: [ ]* and the final sequence is fixed: ([0-9][a-zA-Z][a-zA-Z]) The full code would be: *========================== capture drop post* gen str8 postcode = "" gen str4 postdistrict = "" quietly forvalues i = 1/`c(N)' { if regexm(aadd1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i'], !! ".*([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z])) !! [ ]*([0-9][a-zA-Z][a-zA-Z]).*") != 0 { replace postcode = upper(regexs(1))+" "+upper(regexs(2)) in `i' replace postdistrict = upper(regexs(1)) in `i' } } *========================== (Note: long line wrapped at !! ) There is no problem with the postal code being on any line as the first part of the expression simply appends all the address fields together and then scans them for a postalcode flavoured string.: add1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i'] The two parts of the code are captured by wrapping them in brackets: ".*([a-zA-Z]([0-9]|...[a-zA-Z]))[ ]*([0-9][a-zA-Z][a-zA-Z]).*" regexs(1) returns the first matching sequence if regexm(...) found a postal code pattern and regexs(2) returns the second part. For Gillian's purposes, the first part is also the postal district and becomes available to her for merging purposes. DC Elliott I should point out that the process of comparing codes against a string is time consuming and is doomed to fail if codes are not in "perfect" format. The regular expression route will extract and transform the code to its "perfect" form in separate variables for merging purposes doing extraction and cleaning in a single process. On 13 September 2010 16:52, Ada Ma <heu034@googlemail.com> wrote: > David - I think Gillian need to create a list of postcode sector to > identify the postcodes, so that they can be extracted from the address > strings variables. > > UK postcodes - IIRC - can go from 5 to 7 letters long (8 if including > the space in the middle). e.g. first part of centre of Birmingham > postcode is B1... but you also have first sector which is 4 letters > long e.g. Aberdeen Foresterhill AB25, Newcastle city centre is NE1. > Your suggestion about pattern should work. Only thing is that > sometimes people are sloppy with the number of spaces in the middle, > sometimes they put them in, sometimes they forget it, sometimes they > have two to three spaces in to make it pretty (so that B1 would align > with AB24). > > As for county boundaries - I am not sure. But as counties are > amalgamation of Local Authority Districts so they are pretty big, I'd > imagine any mismatch shouldn't be massive. > > > Gillian - If you can't do the transpose in one go, I guess you must be > using older versions of MS Office - are you using 2003? The latest > version of Excel has no limit on column numbers... so if you can find > a friend using MS Office 2007 your transpose problem can be solved. > > Yet another alternative is to use the -reshape- command in Stata. > Create two variables: > > gen id = 1 > gen year=_n > keep postcode id year > reshape postcode, i(id) j(year) > > Now browse your data and copy and paste the only line of data you > have into a do file. Fiddly solution but one needs not learn > programming for it! > > > > > On Mon, Sep 13, 2010 at 5:53 PM, David Elliott <dcelliott@gmail.com> wrote: >> Once one has the postal district extracted from the address fields, >> why not just use it to merge with a file containing the districts and >> counties? I'm not catching the reason to place everything in a >> global. >> >> I also need to ask a question about UK postal codes. Are UK codes >> fully circumscribed by county boundaries? In Canada the fist part of >> the postal code is called a "Forward sortation area" or FSA - The FSA >> where I work, for example is "B3J" One FSA can serve a fairly large >> region and at least for us, that region can extend across county lines >> which plays holy hell with our use of them to place persons within a >> county boundary for statistical purposes. It may be prudent to test >> your apparent assumption that codes don't cross county lines. >> >> DC Elliott >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > > > -- > Ada Ma > Research Fellow > Health Economics Research Unit > University of Aberdeen, UK. > http://www.abdn.ac.uk/heru/ > Tel: +44 (0) 1224 555189 > Fax: +44 (0) 1224 550926 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- David Elliott Everything is theoretically impossible, until it is done. Progress is made by lazy men looking for easier ways to do things. -- Robert A. Heinlein (American science-fiction Writer, 1907-1988) * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ ------------------------------------------------------------------------ ATTENTION: This message contains privileged and confidential information intended for the addressee(s) only. If this message was sent to you in error, you must not disseminate, copy or take any action in reliance on it and we request that you notify the sender immediately by return email. Opinions expressed in this message and any attachments are not necessarily those held by the Health and Safety Laboratory or any person connected with the organisation, save those by whom the opinions were expressed. Please note that any messages sent or received by the Health and Safety Laboratory email system may be monitored and stored in an information retrieval system. ------------------------------------------------------------------------ Think before you print - do you really need to print this email? ------------------------------------------------------------------------ ------------------------------------------------------------------------ Scanned by MailMarshal - Marshal's comprehensive email content security solution. Download a free evaluation of MailMarshal at www.marshal.com ------------------------------------------------------------------------ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Looping through entries in csv file***From:*David Elliott <dcelliott@gmail.com>

**References**:**Re: st: Looping through entries in csv file***From:*David Elliott <dcelliott@gmail.com>

- Prev by Date:
**Re: st: RE: st: RE : st: Insheet Issue** - Next by Date:
**st: 2-stages cluster analysis** - Previous by thread:
**Re: st: Looping through entries in csv file** - Next by thread:
**Re: st: Looping through entries in csv file** - Index(es):