Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Turning text pages into indicators

From   Nick Cox <>
Subject   Re: st: Turning text pages into indicators
Date   Wed, 8 Aug 2012 17:03:16 +0100


gen indicator = strpos(string, "word1") | strpos(string, "word2") |
strpos(string, "word3")


Mata offers support for longer strings. Otherwise, I'd think in terms
of lines = observations, pages = blocks of observations.

If you run code like that above with a structure of three variables:

page   line    text
1         1     "Once upon a time there was a cat who liked statistics, "
1         2     "and her favourite program was called Stata. She just loved"
1         3     "Stata and thought it was purrfect."

2          1    "The cat knew a big bad wolf who didn't like Stata."
2          2    "The wolf used SAS, Scary Animal Software."

Then you can go

gen lineindicator = strpos(text, "Stata") | strpos(text, "SAS")

egen  pageindicator = max(lineindicator), by(page)


On Wed, Aug 8, 2012 at 2:01 PM, Jen Zhen <> wrote:

> (1) I'd like to create a list of indicators to cover whether a string
> variable contains at least one out of several words.
> I know I can check whether it contains one specific word with - gen
> indicator=regexm(string,"word1") - but can I also cover several words
> in one command line with this?
> I tried - gen indicator=regexm(string,"word1" "word2") - and  gen
> indicator=regexm(string,"word1" | "word2") - and these wouldn't work,
> but maybe there's another way to do this?
> I know I can as well generate a separate indicator for each word and
> then just sum them up, but since I have many words and many strings to
> cover that would be inefficient.
> (2) I'm starting with long texts, think half a page or a full page, so
> I presumably can't read the entire page into a single string variable
> on which I can then perform (1) above.
> Do I need to initially split the text in say Excel, or is there a way
> to still read all text in in Stata and then split it into as many
> variables as necessary (but no more)?
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index