Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Looping through entries in csv file


From   Gillian.Frost@hsl.gov.uk
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Looping through entries in csv file
Date   Tue, 14 Sep 2010 09:37:12 +0100

All,

Thank you so much for your help!

David, the below code works like a dream.  I had to remove the ".*" from 
the start of the expression 
".*([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z]))
...", for some reason it wasn't picking up the first letter of the 
postcode district, but it works with this removed.  I was not aware of 
these regular expressions, and can see how useful they could be for future 
problems.

Ada was correct in that I needed to create a list of postcode districts to 
identify and extract the postcode from the address variables, but your 
method means that this is no longer necessary and I can now just merge 
with the postcode data file.

David, in answer to your question, UK postcodes are not fully within 
county boundaries.  What we have done is taken the number of households in 
each postcode district and looked at the proportion that is in each 
county.  We have then assigned the postcode district to the county where 
the greatest proportion of households appear.  For example, 74% of 
households in the postcode district SK17 are in the East Midlands region, 
so this is the county we have assigned to SK17.  There is scope for some 
misclassification using this method.  However, for the actual analysis, we 
only need to know whether individuals are in England and Wales, or 
Scotland, and so the effect of any misclassification should be minimal. 
The more detailed information about counties will be used mainly for 
descriptive purposes.

Thank you once again for all your help.  I would never had got to this 
solution myself.

Many thanks,

Gillian





David Elliott <dcelliott@gmail.com> 
Sent by: owner-statalist@hsphsun2.harvard.edu
13/09/2010 21:45
Please respond to
statalist@hsphsun2.harvard.edu


To
statalist@hsphsun2.harvard.edu
cc

Subject
Re: st: Looping through entries in csv file






I've taken an additional look at the UK postal code format and found
that it can come in a number of forms with variable lengths and Letter
Number sequencing - oh bother!

UK postal code patterns:

In the notation below 'L' is letter, 'N' is number + means joined
together and a space is an *expected* space.

L+N   NLL
L+NN  NLL
L+NL  NLL
L+LN  NLL
L+LNN NLL
L+LNL NLL

Fortunately these sequences, while complicated, still provide the
necessary anchors for using a regular expression since they
always begin with a letter L and end in NLL.

The variable first part of the code can be captured with:
([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z]))
Parsing out the above you have:
([a-zA-Z] = L (any single letter , lower or upper case) followed by
one of the following alternative sequences
([0-9] = N (any single number) (Note the alternatives in brackets with
bars between)
|[0-9][0-9] = NN
|[0-9][a-zA-Z] = NL
|[a-zA-Z][0-9] = LN
|[a-zA-Z][0-9][0-9] = LNN
|[a-zA-Z][0-9][a-zA-Z]) = LNL
one then can have 0 to several spaces:
[ ]*
and the final sequence is fixed:
([0-9][a-zA-Z][a-zA-Z])

The full code would be:
*==========================
capture drop post*
gen str8 postcode = ""
gen str4 postdistrict = ""
quietly forvalues i = 1/`c(N)' {
                 if 
regexm(aadd1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i'],
!! 
".*([a-zA-Z]([0-9]|[0-9][0-9]|[0-9][a-zA-Z]|[a-zA-Z][0-9]|[a-zA-Z][0-9][0-9]|[a-zA-Z][0-9][a-zA-Z]))
!! [ ]*([0-9][a-zA-Z][a-zA-Z]).*") != 0 {
                                 replace postcode = upper(regexs(1))+" 
"+upper(regexs(2)) in `i'
                                 replace postdistrict = upper(regexs(1)) 
in `i'
                 }
}
*==========================
(Note: long line wrapped at !!    )

There is no problem with the postal code being on any line as the
first part of the expression simply appends all the address fields
together and then scans them for a postalcode flavoured string.:
add1[`i']+add2[`i']+add3[`i']+add4[`i']+add5[`i']+add6[`i']

The two parts of the code are captured by wrapping them in brackets:
".*([a-zA-Z]([0-9]|...[a-zA-Z]))[ ]*([0-9][a-zA-Z][a-zA-Z]).*"
regexs(1) returns the first matching sequence if regexm(...) found a
postal code pattern and
regexs(2) returns the second part.

For Gillian's purposes, the first part is also the postal district and
becomes available to her for merging purposes.

DC Elliott

I should point out that the process of comparing codes against a
string is time consuming and is doomed to fail if codes are not in
"perfect" format.  The regular expression route will extract and
transform the code to its "perfect" form in separate variables for
merging purposes doing extraction and cleaning in a single process.

On 13 September 2010 16:52, Ada Ma <heu034@googlemail.com> wrote:
> David - I think Gillian need to create a list of postcode sector to
> identify the postcodes, so that they can be extracted from the address
> strings variables.
>
> UK postcodes - IIRC - can go from 5 to 7 letters long (8 if including
> the space in the middle).  e.g. first part of centre of Birmingham
> postcode is B1...  but you also have first sector which is 4 letters
> long e.g. Aberdeen Foresterhill AB25, Newcastle city centre is NE1.
> Your suggestion about pattern should work.  Only thing is that
> sometimes people are sloppy with the number of spaces in the middle,
> sometimes they put them in, sometimes they forget it, sometimes they
> have two to three spaces in to make it pretty (so that B1 would align
> with AB24).
>
> As for county boundaries - I am not sure.  But as counties are
> amalgamation of Local Authority Districts so they are pretty big, I'd
> imagine any mismatch shouldn't be massive.
>
>
> Gillian - If you can't do the transpose in one go, I guess you must be
> using older versions of MS Office - are you using 2003?  The latest
> version of Excel has no limit on column numbers... so if you can find
> a friend using MS Office 2007 your transpose problem can be solved.
>
> Yet another alternative is to use the -reshape- command in Stata.
> Create two variables:
>
> gen id = 1
> gen year=_n
> keep postcode id year
> reshape postcode, i(id) j(year)
>
> Now browse  your data and copy and paste the only line of data you
> have into a do file.   Fiddly solution but one needs not learn
> programming for it!
>
>
>
>
> On Mon, Sep 13, 2010 at 5:53 PM, David Elliott <dcelliott@gmail.com> 
wrote:
>> Once one has the postal district extracted from the address fields,
>> why not just use it to merge with a file containing the districts and
>> counties?  I'm not catching the reason to place everything in a
>> global.
>>
>> I also need to ask a question about UK postal codes.  Are UK codes
>> fully circumscribed by county boundaries?  In Canada the fist part of
>> the postal code is called a "Forward sortation area" or FSA - The FSA
>> where I work, for example is "B3J"  One FSA can serve a fairly large
>> region and at least for us, that region can extend across county lines
>> which plays holy hell with our use of them to place persons within a
>> county boundary for statistical purposes.  It may be prudent to test
>> your apparent assumption that codes don't cross county lines.
>>
>> DC Elliott
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
>
> --
> Ada Ma
> Research Fellow
> Health Economics Research Unit
> University of Aberdeen, UK.
> http://www.abdn.ac.uk/heru/
> Tel: +44 (0) 1224 555189
> Fax: +44 (0) 1224 550926
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
David Elliott

Everything is theoretically impossible, until it is done.
Progress is made by lazy men looking for easier ways to do things.
 -- Robert A. Heinlein (American science-fiction Writer, 1907-1988)

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential information intended
for the addressee(s) only. If this message was sent to you in error,
you must not disseminate, copy or take any action in reliance on it and
we request that you notify the sender immediately by return email.

Opinions expressed in this message and any attachments are not
necessarily those held by the Health and Safety Laboratory or any person
connected with the organisation, save those by whom the opinions were
expressed.

Please note that any messages sent or received by the Health and Safety
Laboratory email system may be monitored and stored in an information
retrieval system.
------------------------------------------------------------------------
Think before you print - do you really need to print this email?
------------------------------------------------------------------------

------------------------------------------------------------------------
Scanned by MailMarshal - Marshal's comprehensive email content security
solution. Download a free evaluation of MailMarshal at www.marshal.com
------------------------------------------------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index