Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Luca Campanelli <l.campanelli@yahoo.it> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: Re: st: creating random groups of observations |

Date |
Fri, 7 Dec 2012 17:26:26 +0000 (GMT) |

Thank you very much for your answer, Clyde. I appreciate it. I’ve not been very specific about what criterion to use to constrain the groups of words because there is some flexibility, and the final decision will be data driven. In other words, the final criterion should allow me to get the 1000 groups (as you said, if the criterion is to strict I cannot get 1000 groups), and at the same time I’d like the groups of words to be as homogenous as possible in terms of total number of characters (sum of the number of characters of the four words that form a group). This is the distribution of the words in term of number of characters (CHA = number of characters): CHA | Percent -----+--------- 2 | 0.2 4 | 3.7 5 | 14.6 6 | 28.1 7 | 24.1 8 | 18.7 9 | 7.9 10 | 2.7 A possible starting criterion could be that each group has to have let’s say between 35 and 50 total number of characters. Clyde suggests that this could be done in C++. I’m not familiar with that programming language, and I would not like to go into that unless necessary. But I recognize that it’s a good piece of advice. What I have in mind, which seems to me possible in Stata, is a sort of loop that create a group of randomly selected words, test it against my criterion, then keeps the group if it meets the criterion, otherwise drop it. This is what I’ve done, which seems to work, but I don’t know if there are better ways to do it. Note that: rdm=variable with random numbers; group=variable with group number (only if the group meets the criterion; totchr=total number of character of the group): forvalues i=1/5000 { replace rdm = . replace rdm = runiform() if group == . sort rdm replace group = `i' if _n <= 4 & group == . egen totchr = total(chr) if group == `i' replace group = . if totchr < 35 | totchr > 50 & group == `i' drop totchr } If anybody sees any bugs or knows ways to make this loop better please let me know. Any help is appreciated. Thanks a lot. Luca On Thu, 6 Dec 2012 19:24, Clyde B Schechter clyde.schechter@einstein.yu.edu wrote: Luca Campanelli wants to randomly assort 4000 words into 1000 groups with 4 words each, and he wants to assure that each group has a satisfactory mix of long and short words. He doesn't specify exactly what criterion defines a satisfactory mix, so it is hard to be concrete. But here are a few thoughts. First, depending on the frequency distribution of long and short words (and even what is meant by long and short in this context), it may not even be posssible. For example, if there are only 100 "short" words in the data set, then clearly the goal cannot be achieved. Assuming that long and short words are all prevalent in sufficient numbers then creating 1000 groups of 2 long words and 1000 groups of 2 short words, then combining each long word group with its correspondingly numbered short word group might do it, again depending on exactly what you have in mind. If Luca has in mind some more complex criterion such as constraints on the mean and variance of the number of characters in each group's words, that is something I would not try to accomplish in Stata. It could be done in C++ or a similar programming language using a branch-and-bound algorithm. But expect it to take a long time to run even on a fast machine: you are trying to tame a combinatorial explosion by imposing a few constraints. And, again, be prepared for the possibility that the actual distribution of word lengths precludes the existence of any solution at all--which you would only find out after a very long time. Best of luck. Clyde Schechter Dept. of Family & Social Medicine Albert Einstein College of Medicine Bronx, NY, USA * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: ivreg or ivpois with mi estimate** - Next by Date:
**st: Generating a matched pair sample for a case-control study** - Previous by thread:
**Re: Re: st: creating random groups of observations** - Next by thread:
**st: Panel Data Instrumental Variable Test - 2SLS** - Index(es):