Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: neighbourhood size

From	"James Beard" <[email protected]>
To	[email protected]
Subject	Re: st: neighbourhood size
Date	Thu, 25 Jul 2013 02:31:03 -0000

If there is some compelling reason for doing this in Stata, have a 
look at -strgroup- and -levenshtein- (-findit strgroup-).

Of course, you may still have the unicode problem. If the actual 
content of the text in unimportant, and the text is all in the same 
unicode block, you may be able to pre-process your text to turn it 
into (potentially meaningless) 8 bit characters.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 24 Jul 2013 at 22:12, Sergiy Radyakin wrote:

Date sent:      	Wed, 24 Jul 2013 22:12:47 -0400
Subject:        	Re: st: neighbourhood size
From:           	Sergiy Radyakin <[email protected]>
To:             	"[email protected]" 
<[email protected]>
Send reply to:  	[email protected]

Does not sound like a big deal, except that Stata does not work with
unicode. However even in English you will need to decide how to deal
with ambiguities in the text. Suppose your dictionary is greek
letters: alpha, beta,... you encounter 'opsilon' in the text, do you
increment the frequency of 'epsilon'? 'upsilon'? both (according to
your definition)? or none? (this is not a valid word but a typo) Once
you resolve that: for i=1 { for j=1 {...}} A couple of loops should
suffice. Now that can be slow, so then you investigate what special 
is
known about your word list, what special is known about your text, 
and
what is acceptable in terms of performance. A lot depends on the size
of the corpus. If you say it is a page of google search results - we
are ok. If it is the contents of JSTOR for the last 20 years, we 
might
be in trouble. What is the size of the word list? is it two three ten
keywords? or is it the contents of a novel?

Why is Stata picked as a tool for solving this problem I wonder?
http://stackoverflow.com/questions/4520876/counting-the-frequency-of-
specific-words-in-text-file

Sergiy

On Wed, Jul 24, 2013 at 8:45 PM, Mehdi Bakhtiar <[email protected]> 
wrote:
>>> Dear Experts,
>>> I have a question about how to use stata to calculate 
neighbourhood size for a list of my words. Basically, I have my own 
word list and a corpus.   I need to tell stata to count the number of 
neighbours of each word in my wordlist (words with one letter 
variation)  out of my corpus. Also, I need to mention that my words 
are in Persian script.
>>> In advance many thanks for any attention and support,
>>> Kind regards,
>>> Mehdi Bakhtiar
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: neighbourhood size
  - From: Sergiy Radyakin <[email protected]>

References:
- st: neighbourhood size
  - From: Mehdi Bakhtiar <[email protected]>
- Re: st: neighbourhood size
  - From: Sergiy Radyakin <[email protected]>

Prev by Date: Re: st: neighbourhood size
Next by Date: Re: st: file cannot be opened in loop appending many datasets
Previous by thread: Re: st: neighbourhood size
Next by thread: Re: st: neighbourhood size
Index(es):
- Date
- Thread