Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: st: creating random groups of observations


From   Luca Campanelli <l.campanelli@yahoo.it>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: Re: st: creating random groups of observations
Date   Fri, 7 Dec 2012 17:26:26 +0000 (GMT)

Thank you very much for your answer, Clyde. I appreciate it. 
I’ve not been very specific about what criterion to use to constrain the groups of words because there is some flexibility, and the final decision 

will be data driven. In other words, the final criterion should allow me to get the 1000 groups (as you said, if the criterion is to strict I cannot 

get 1000 groups), and at the same time I’d like the groups of words to be as homogenous as possible in terms of total number of characters 

(sum of the number of characters of the four words that form a group). 

This is the distribution of the words in term of number of characters (CHA = number of characters): 

 CHA |  Percent
-----+---------
   2 |      0.2
   4 |      3.7
   5 |     14.6
   6 |     28.1
   7 |     24.1
   8 |     18.7
   9 |      7.9
  10 |      2.7

A possible starting criterion could be that each group has to have let’s say between 35 and 50 total number of characters. 

Clyde suggests that this could be done in C++. I’m not familiar with that programming language, and I would not like to go into that unless 

necessary. But I recognize that it’s a good piece of advice. 

What I have in mind, which seems to me possible in Stata, is a sort of loop that create a group of randomly selected words, test it against 

my criterion, then keeps the group if it meets the criterion, otherwise drop it. 
This is what I’ve done, which seems to work, but I don’t know if there are better ways to do it. Note that: rdm=variable with random numbers; 

group=variable with group number (only if the group meets the criterion; totchr=total number of character of the group): 

forvalues i=1/5000 {
replace rdm = .
replace rdm = runiform() if group == . 
sort rdm
replace group = `i' if _n <= 4 & group == . 
egen totchr = total(chr) if group == `i'
replace group = . if totchr < 35 | totchr > 50 & group == `i'
drop totchr
}

If anybody sees any bugs or knows ways to make this loop better please let me know. Any help is appreciated. 
Thanks a lot. 

Luca



On Thu, 6 Dec 2012 19:24, Clyde B Schechter clyde.schechter@einstein.yu.edu wrote: 

Luca Campanelli wants to randomly assort 4000 words into 1000 groups with 4 words each, and he wants to assure that each group has a satisfactory 

mix of long and short words.  He doesn't specify exactly what criterion defines a satisfactory mix, so it is hard to be concrete.  But here are a few thoughts.

First, depending on the frequency distribution of long and short words (and even what is meant by long and short in this context), it may not even be 

posssible.  For example, if there are only 100 "short" words in the data set, then clearly the goal cannot be achieved.

Assuming that long and short words are all prevalent in sufficient numbers then creating 1000 groups of 2 long words and 1000 groups of 2 short words, 

then combining each long word group with its correspondingly numbered short word group might do it, again depending on exactly what you have in mind.

If Luca has in mind some more complex criterion such as constraints on the mean and variance of the number of characters in each group's words, that 

is something I would not try to accomplish in Stata.  It could be done in C++ or a similar programming language using a branch-and-bound algorithm.  

But expect it to take a long time to run even on a fast machine: you are trying to tame a combinatorial explosion by imposing a few constraints.   And, 

again, be prepared for the possibility that the actual distribution of word lengths precludes the existence of any solution at all--which you would only find 

out after a very long time.

Best of luck.

Clyde Schechter
Dept. of Family & Social Medicine
Albert Einstein College of Medicine
Bronx, NY, USA

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index