Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Substring extraction based on punctuation


From   "David E Moore" <davem@hartman-group.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: RE: Substring extraction based on punctuation
Date   Wed, 29 Jun 2005 16:43:17 -0700

Have you considered -tokenize- using "," as the parse character?

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu]On Behalf Of Michael S.
Hanson
Sent: Wednesday, June 29, 2005 4:30 PM
To: statalist@hsphsun2.harvard.edu
Subject: st: Substring extraction based on punctuation


I have a (large) set of variables with labels of the (general) form:

	Some text, some more text, still more text
	Also some text, lots and lots more text, text
	(etc.)

The commas are the separators of interest to me:  I would like to 
extract the sub-strings before, between and after the commas (excluding 
the commas and trailing spaces themselves) into three local string 
variables for further use.  The number of words in each part of the 
label vary as do the total number of words;  hence the -word # of 
`varname'- extended macro does not appear to apply here.  The closest I 
have come with extended macros is:

	local varlbl : variable label `varname'
	local varlbl1 : piece 1 20 of "`varname'"
	local varlbl2 : piece 2 20 of "`varname'"
	local varlbl3 : piece 3 20 of "`varname'"

but this doesn't reliably return the desired substrings (given the 
variation in words (and in word length) between commas) -- 20 here is 
simply an approximate value that works for a particular subset of 
labels.  Same with the -nobreak- option.  (This code also does not 
strip off the commas.)

So instead of extended macros, I've tried using string functions.  I 
suspect that if I knew and understood regular expression syntax, I 
could make use of -regexm- and -regexs- on `varlbl' -- but I don't.  
Instead, the following "works":

	local varlbl : variable label `varname'
	local l = length("`varlbl'")
	local c1 = strpos("`varlbl'",",")
	local c2 = strpos(reverse("`varlbl'"),",")
	local varlbl1 = substr("`varlbl'",1,`c1'-1)
	local varlbl2 = substr("`varlbl'",`c1'+2,`l'-`c1'-`c2'-1)
	local varlbl3 = substr("`varlbl'",`l'-`c2'+3,`l')

... but I'm really hoping to find some alternative code that is 
"cleaner" and more transparent.  Any such suggestions are welcome.  
Thanks in advance.

                                         -- Mike

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index