Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: neighbourhood size

From   "James Beard" <>
Subject   Re: st: neighbourhood size
Date   Thu, 25 Jul 2013 02:31:03 -0000

If there is some compelling reason for doing this in Stata, have a 
look at -strgroup- and -levenshtein- (-findit strgroup-).

Of course, you may still have the unicode problem. If the actual 
content of the text in unimportant, and the text is all in the same 
unicode block, you may be able to pre-process your text to turn it 
into (potentially meaningless) 8 bit characters.


On 24 Jul 2013 at 22:12, Sergiy Radyakin wrote:

Date sent:      	Wed, 24 Jul 2013 22:12:47 -0400
Subject:        	Re: st: neighbourhood size
From:           	Sergiy Radyakin <>
To:             	"" 
Send reply to:

Does not sound like a big deal, except that Stata does not work with
unicode. However even in English you will need to decide how to deal
with ambiguities in the text. Suppose your dictionary is greek
letters: alpha, beta,... you encounter 'opsilon' in the text, do you
increment the frequency of 'epsilon'? 'upsilon'? both (according to
your definition)? or none? (this is not a valid word but a typo) Once
you resolve that: for i=1 { for j=1 {...}} A couple of loops should
suffice. Now that can be slow, so then you investigate what special 
known about your word list, what special is known about your text, 
what is acceptable in terms of performance. A lot depends on the size
of the corpus. If you say it is a page of google search results - we
are ok. If it is the contents of JSTOR for the last 20 years, we 
be in trouble. What is the size of the word list? is it two three ten
keywords? or is it the contents of a novel?

Why is Stata picked as a tool for solving this problem I wonder?


On Wed, Jul 24, 2013 at 8:45 PM, Mehdi Bakhtiar <> 
>>> Dear Experts,
>>> I have a question about how to use stata to calculate 
neighbourhood size for a list of my words. Basically, I have my own 
word list and a corpus.   I need to tell stata to count the number of 
neighbours of each word in my wordlist (words with one letter 
variation)  out of my corpus. Also, I need to mention that my words 
are in Persian script.
>>> In advance many thanks for any attention and support,
>>> Kind regards,
>>> Mehdi Bakhtiar
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index