Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Random simulations of flow matrix with constraints |

Date |
Fri, 22 Feb 2013 02:11:45 +0000 |

For the manual material see http://www.stata.com/help.cgi?quotes i.e. [U] 18.3.1. On Fri, Feb 22, 2013 at 2:10 AM, Nick Cox <njcoxstata@gmail.com> wrote: > Local macro references start and end with different single quotation > marks. The pattern is `macname'. The first is char(96). This is > explained in the manuals and e.g. at > > http://www.stata.com/support/faqs/programming/correct-delimiter-for-local-macros/ > > I have not tried to understand all your code, but I fear that it will > be very slow. The major problem is repeated loops, some 50 billion of > them. As this is interpreted code, you may need Mata instead. > > Testing expressions such as > > total_out=X_out_clone > > for equality will fail because Stata uses == not = for that purpose. > > You could get small speed-ups by replacing > > gen constrained_out = 0 > replace constrained_out = 1 if total_out=X_out_clone > gen constrained_in = 0 > replace constrained_in = 1 if total_in=X_in_clone > gen constrained = max(constrained_out, constrained_in) > > with one line > > gen constrained = max(total_out == X_out_clone, total_in==X_in_clone) > > so that there is no need for the variables -constrained_in- and > -constrained_out-. In fact that variable -constrained- can be omitted > too. > > Copying -X`j'- to -Y- is also unnecessary. > > -if _n == 1- will be faster as -in 1- for a large dataset. > > So -- a quick and unchecked hack -- > > forvalues j = 1/100 { > gen X`j' = 0 > forvalues i = 1/533361588 { > sort startstationid > egen X_out=total(X`j'), by(startstationid) > sort endstationid > egen X_in=total(X`j'), by(endstationid) > generate double u = runiform() if max(total_out == > X_out_clone, total_in==X_in_clone) !=1 > sort u > replace X`j' = X`j' + 1 in 1 > drop X_out X_in X_out_clone X_in_clone u > } > } > > However, after that it still looks buggy. At the end of the inner > loop, you drop your -*clone- variables, so that they are not there > second time round. So this will fail. > > What else can we do? > > if max(total_out == X_out_clone, total_in==X_in_clone) !=1 > > looks like > > if total_out != X_out_clone & total_in != X_in_clone > > which may be faster. > > Repeated -egen- calls will be slower than you want. There is a lot of > interpreted code inside -egen- that is irrelevant to your purpose. You > can cut out all the -egen- calls by replacing > > sort startstationid > egen X_out=total(X`j'), by(startstationid) > sort endstationid > egen X_in=total(X`j'), by(endstationid) > > with > > bysort startstationid : gen X_out = sum(X`j') > by startstationid : replace X_out = X_out[_N] > bysort endstationid : gen X_in = sum(X`j') > by endstationid : replace X_in = X_in[_N] > > That may not look much different, but it will be much faster. > > What remains as obvious problems are overall speed and the bug > mentioned above. I've only looked at this code, not tried it. Above > all, I didn't read the paper you cited. > > Nick > > On Fri, Feb 22, 2013 at 1:26 AM, Gordon Lee <gordon@gordonlee.me> wrote: > >> I have data on traffic flows amongst a set of stations. >> >> What I would like to do on Stata is to create 100 random simulations >> that preserve the total number of incoming and outgoing links for each >> station. This is in the footsteps of Roth et al (2011) under "the null >> model". Link: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0015923 >> >> In other words, this is a matrix with constrained row sums and column sums. >> >> ---My data set (illustrative only)--- >> >> startstationid endstationid total_out total_in >> --------------------------------------------------------------------------- >> A B oA dB >> A C oA dC >> A D oA dD >> B A oB dA >> B C oB dC >> B D oB dD >> C A oC dA >> C B oC dB >> C D oC dD >> D A oD dA >> D B oD dB >> D C oD dC >> >> ...where total_out is the sum of traffic flows from the startstation, >> and total_in is the sum of traffic flows into the endstation. All >> items take on integer values. >> >> >> **** >> dofile: https://www.dropbox.com/s/c9s4gmqtd6ycmnz/do%20file.txt >> dataset: https://www.dropbox.com/s/zx9oufzekusnsuj/dataset.dta >> >> I tried the following commands, but stata tells me "'invalid name >> r(198);". I can't seem to be able to fix this. >> >> (note: 533361588 is the sum of all flows in the data) >> >> forvalues j = 1/100 { >> gen X'j' = 0 >> forvalues i = 1/533361588 { >> gen Y = X'j' >> sort startstationid >> egen X_out=total(X'j'), by(startstationid) >> sort endstationid >> egen X_in=total(X'j'), by(endstationid) >> >> gen constrained_out = 0 >> replace constrained_out = 1 if total_out=X_out_clone >> gen constrained_in = 0 >> replace constrained_in = 1 if total_in=X_in_clone >> >> gen constrained = max(constrained_out, constrained_in) >> >> generate double u = runiform() if constrained!=1 >> sort u >> >> replace X'j' = Y + 1 if _n == 1 >> >> drop Y X_out X_in X_out_clone X_in_clone constrained_out >> constrained_in constrained u >> } >> } >> * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Random simulations of flow matrix with constraints***From:*Gordon Lee <gordon@gordonlee.me>

**Re: st: Random simulations of flow matrix with constraints***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Random simulations of flow matrix with constraints** - Next by Date:
**RE: st: RE: How to perform a Wald test between the fixed effect...** - Previous by thread:
**Re: st: Random simulations of flow matrix with constraints** - Index(es):