Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Michael Goodwin <michael.goodwin@endeavor.org> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Social Network Analysis shortest path centrality |

Date |
Mon, 7 Apr 2014 18:00:48 -0400 |

This is a follow-on question to one I had a few months ago. The network data I am working with has grown substantially, and we now have ~1300 nodes with ~2000 connections among them. I am using closeness centrality (specifically, harmonic centrality) to analyze the network. In Stata, I first determine all paths between any two nodes, and then determine the shortest path between those two nodes. The process of identifying all paths using joinby no longer works for a network of this size/complexity. Currently, Stata 13 SE 64 bit with 2GB of memory allocated to it crashes when carrying this analysis out. I'd like to be able to create a dataset containing the length of the shortest path between any two nodes, as well as the centrality score. I know that I can get the lengths of every shortest path by successively squaring the adjacency matrix for my network—and can use these numbers to get centrality scores for every node—but I'm unsure how I might arrive at the shortest paths themselves. I've considered Djikstra's algorithm and Seidel's algorithm, but I need help implementing them in Stata. Another idea I've had is to use the shortest-path information from squaring (and cubing, etc.) the adjacency matrix to find the paths themselves, but I'm not sure whether the time complexity of this approach would represent an improvement. Below is the code I am currently using. Section "Determine all paths" at the line joinby sourceID using "`main'", unmatched(master); is where Stata crashes (after expanding to ~9,000,000 observations). Any thoughts on how to skip this portion of the analysis would be much appreciated. * Use the relevant file; use ".\dta\edgeAllCleanComplete.dta"; keep sourceID targetID; * Determine all paths; tempfile main; save "`main'"; isid sourceID targetID; rename sourceID company; local i 0; local more 1; local isDone company; while `more' {; local ++i; rename targetID sourceID; joinby sourceID using "`main'", unmatched(master); drop _merge; rename sourceID level`i'; replace targetID = . if inlist(targetID,`isDone'); local isDone `isDone', level`i'; sort company level`i'; by company level`i': gen one = _n == 1 & !mi(level`i'); by company: egen connect`i' = total(one); drop one; count if !mi(targetID); local more = r(N); }; drop targetID; sort company level*; order company level* connect*; qui compress; save ".\dta\centralityAllPaths.dta", replace; * Shortest path, convert to long from; tempfile results; save "`results'"; forvalues j=1/`i' {; use "`results'", clear; keep company level`j'; rename level`j' level; gen distance = `j'; drop if mi(level); tempfile long`j'; save "`long`j''"; }; use "`long1'", clear; forvalues j=2/`i' {; append using "`long`j''"; }; sort company level distance; by company level: keep if _n == 1; * Centrality; gen inverse = 1/distance; bysort company: egen centrality = sum(inverse); qui compress; save ".\dta\centralityShortestPaths.dta", replace; On Wed, Sep 4, 2013 at 1:16 PM, Robert Picard <picard@netbox.com> wrote: > To find the shortest path, the easiest is to convert the > results to long form. There's perhaps a way to do this > using -reshape- but here's how I think about it > > * --------------------- begin example --------------------- > clear > input source target > 1 2 > 1 5 > 1 6 > 1 9 > 2 3 > 2 5 > 2 7 > 5 8 > 5 6 > 8 9 > 8 1 > end > tempfile main > save "`main'" > * make sure the data has no duplicates > isid source target > > rename source co > local i 0 > local more 1 > local isdone co > while `more' { > local ++i > rename target source > joinby source using "`main'", unmatched(master) > drop _merge > rename source level`i' > replace target = . if inlist(target,`isdone') > local isdone `isdone', level`i' > sort co level`i' > by co level`i': gen one = _n == 1 & !mi(level`i') > by co: egen connect`i' = total(one) > drop one > count if !mi(target) > local more = r(N) > } > drop target > sort co level* > order co level* connect* > list, sepby(co) noobs > > * shortest path, convert to long from > tempfile results > save "`results'" > forvalues j=1/`i' { > use "`results'", clear > keep co level`j' > rename level`j' level > gen distance = `j' > drop if mi(level) > tempfile long`j' > save "`long`j''" > } > use "`long1'", clear > forvalues j=2/`i' { > append using "`long`j''" > } > sort co level distance > by co level: keep if _n == 1 > list, sepby(co) noobs > * --------------------- end example ----------------------- > > > On Wed, Sep 4, 2013 at 12:57 PM, Michael Goodwin > <michael.goodwin@endeavor.org> wrote: >> (prematurely sent) >> >> I'm trying an approach to loop over observations within a given co and >> to make blank any values that match any previous connection values. >> For some reason, this block of code beginning at forv totalCompanies = >> 1/$maxSource { doesn't seem to work, despite the fact that when I >> manually walk through the key line of the code (replace level`i' = . >> if level`i'==level`count'[`totalConnections'] & level`i'!=.), the line >> does what I want, replacing the furthest level with a . if the >> connection has already been made by a previous level. If anyone is >> able to point out why this isn't working, or a different approach, >> that would be a big help. >> >> On Wed, Sep 4, 2013 at 12:54 PM, Michael Goodwin >> <michael.goodwin@endeavor.org> wrote: >>> I'm trying an approach to loop over observations within a given co and >>> to make blank any values that match any previous connection values. >>> For some reason, this block of code beginning at forv totalCompanies = >>> 1/$maxSource { doesn't seem to work, despite the fact that when I >>> >>> >>> clear >>> input source target >>> 1 2 >>> 1 5 >>> 1 6 >>> 1 9 >>> 2 3 >>> 2 5 >>> 2 7 >>> 3 8 >>> 3 6 >>> 4 9 >>> 4 1 >>> end >>> >>> >>> * Create macros with total number of companies >>> preserve >>> duplicates drop source, force >>> global maxSource = _N >>> restore >>> tempfile main >>> save "`main'" >>> >>> * Check for duplicates >>> sort source target >>> isid source target >>> >>> * Create dataset containing all connections >>> rename source co >>> gen level0 = co >>> local i 0 >>> local more 1 >>> local isdone co >>> while `more' { >>> local ++i >>> rename target source >>> joinby source using `main', unmatched(master) >>> drop _merge >>> rename source level`i' >>> replace target = . if inlist(target,`isdone') >>> local isdone `isdone', level`i' >>> forv totalCompanies = 1/$maxSource { >>> preserve >>> keep if co==`totalCompanies' >>> local maxConnections = _N >>> forv num = 1/`i' { >>> local count = `i'-`num' >>> forv totalConnections = 1/`maxConnections' { >>> set trace on >>> replace level`i' = . if level`i'==level`count'[`totalConnections'] & level`i'!=. >>> set trace off >>> } >>> } >>> restore >>> } >>> sort co level`i' >>> by co level`i': gen one = _n == 1 & !mi(level`i') >>> by co: egen connect`i' = total(one) >>> drop one >>> count if !mi(target) >>> local more = r(N) >>> qui compress >>> } >>> >>> drop target >>> sort co level* >>> order co level* connect* >>> >>> On Wed, Sep 4, 2013 at 11:10 AM, Michael Goodwin >>> <michael.goodwin@endeavor.org> wrote: >>>> This is very useful and more efficient than what I had put together. >>>> The final piece of the puzzle is identifying which observations >>>> contain the shortest path between the "co" and a given node. In the >>>> example above, in the third observation, co 1 is connected to 9 in >>>> level4. However, in the tenth observation, co 1 is connected to 9 in >>>> level1. Is there an efficient way to replace level with a missing >>>> value if that combination of co and level already exists and is a >>>> shorter path? I'm working on this final piece now, but would love to >>>> hear any thoughts. Thanks in advance for your help. >>>> >>>> On Wed, Sep 4, 2013 at 8:54 AM, Robert Picard <picard@netbox.com> wrote: >>>>> You can stop a list of connections from looping back onto >>>>> itself by replacing target with a missing value if it >>>>> identifies a company that is already part of the list for >>>>> that observation. This can easily be done using -inlist()-. >>>>> Since -inlist()- does not handle a regular varlist, you can >>>>> use a local macro to build-up the comma separated variable >>>>> list as you go. >>>>> >>>>> Note that it really helps to create a toy dataset to sort >>>>> out these problems. Your proposed code does not stop looping >>>>> when used with the test data I'm using. >>>>> >>>>> * --------------------- begin example --------------------- >>>>> clear >>>>> input source target >>>>> 1 2 >>>>> 1 5 >>>>> 1 6 >>>>> 1 9 >>>>> 2 3 >>>>> 2 5 >>>>> 2 7 >>>>> 5 8 >>>>> 5 6 >>>>> 8 9 >>>>> 8 1 >>>>> end >>>>> tempfile main >>>>> save "`main'" >>>>> * make sure the data has no duplicates >>>>> isid source target >>>>> >>>>> rename source co >>>>> local i 0 >>>>> local more 1 >>>>> local isdone co >>>>> while `more' { >>>>> local ++i >>>>> rename target source >>>>> joinby source using "`main'", unmatched(master) >>>>> drop _merge >>>>> rename source level`i' >>>>> replace target = . if inlist(target,`isdone') >>>>> local isdone `isdone', level`i' >>>>> sort co level`i' >>>>> by co level`i': gen one = _n == 1 & !mi(level`i') >>>>> by co: egen connect`i' = total(one) >>>>> drop one >>>>> count if !mi(target) >>>>> local more = r(N) >>>>> } >>>>> drop target >>>>> sort co level* >>>>> order co level* connect* >>>>> list, sepby(co) noobs >>>>> * --------------------- end example ----------------------- >>>>> >>>>> On Tue, Sep 3, 2013 at 6:15 PM, Michael Goodwin >>>>> <michael.goodwin@endeavor.org> wrote: >>>>>> I think I've figured out the issue. Using this approach, when dropping >>>>>> values that are equal, you need to ensure that you aren't dropping >>>>>> null values. Below is the code to drop only matching non-null values. >>>>>> Following the loop are the final few lines to calculate centrality >>>>>> using the approach I mentioned in my initial post. Thanks for your >>>>>> help! >>>>>> >>>>>> * Create dataset containing all connections; >>>>>> rename Source company; >>>>>> local i 0; >>>>>> local more 1; >>>>>> while `more' {; >>>>>> local ++i; >>>>>> rename Target Source; >>>>>> joinby Source using `main', unmatched(master); >>>>>> drop _merge; >>>>>> rename Source level`i'; >>>>>> sort company level`i'; >>>>>> forv num = 1/`i' {; >>>>>> local count = `i'-`num'; >>>>>> cap drop if level`i'==level`count' & level`i'!=""; >>>>>> cap drop if company==level`count'; >>>>>> }; >>>>>> by company level`i': gen one = _n == 1 & !mi(level`i'); >>>>>> by company: egen connect`i' = total(one); >>>>>> drop one; >>>>>> count if !mi(Target); >>>>>> local more = r(N); >>>>>> qui compress; >>>>>> }; >>>>>> >>>>>> * Centrality; >>>>>> egen totalPath = rownonmiss(connect*), strok; >>>>>> local maxPath = totalPath; >>>>>> forv num = 1/`maxPath' {; >>>>>> gen connectCentrality`num' = connect`num'/`num'; >>>>>> }; >>>>>> egen centrality = rowtotal(connectCentrality*); >>>>>> >>>>>> On Tue, Sep 3, 2013 at 5:10 PM, Michael Goodwin >>>>>> <michael.goodwin@endeavor.org> wrote: >>>>>>> Robert, this is very helpful and actually not so far from what I had >>>>>>> originally coded out. >>>>>>> >>>>>>> The only issue I am encountering now is that some of the networks >>>>>>> actually double back on themselves, so that the loop you've written >>>>>>> out continues on infinitely (or would if Stata didn't have observation >>>>>>> limits). >>>>>>> >>>>>>> I think the solution is to write code that drops any observations in >>>>>>> which the level`i' variable is equal to any of the preceding >>>>>>> level[`i'-1], level[`i'-2], etc. variables. Do you have any thoughts >>>>>>> on how to best accomplish that? My thinking was to create a loop that >>>>>>> compares the current level with each previous level and drops the >>>>>>> observation if any two values match. I have to use capture because >>>>>>> there is no level0. This doesn't seem to be working using my dataset. >>>>>>> >>>>>>> >>>>>>> * --------------------- begin example --------------------- >>>>>>> clear >>>>>>> input Source Target >>>>>>> 1 2 >>>>>>> 1 5 >>>>>>> 1 6 >>>>>>> 1 9 >>>>>>> 2 3 >>>>>>> 2 5 >>>>>>> 2 7 >>>>>>> 5 8 >>>>>>> 5 6 >>>>>>> 8 9 >>>>>>> end >>>>>>> tempfile main >>>>>>> save "`main'" >>>>>>> * make sure the data has no duplicates >>>>>>> isid Source Target >>>>>>> >>>>>>> rename Source co; >>>>>>> local i 0; >>>>>>> local more 1; >>>>>>> while `more' {; >>>>>>> local ++i; >>>>>>> rename Target Source; >>>>>>> joinby Source using `main', unmatched(master); >>>>>>> drop _merge; >>>>>>> rename Source level`i'; >>>>>>> sort co level`i'; >>>>>>> forv num = 1/`i' {; >>>>>>> local count = `i'-`num'; >>>>>>> cap drop if level`i'==level`count'; >>>>>>> cap drop if co==level`count'; >>>>>>> }; >>>>>>> by co level`i': gen one = _n == 1 & !mi(level`i'); >>>>>>> by co: egen connect`i' = total(one); >>>>>>> drop one; >>>>>>> count if !mi(Target); >>>>>>> local more = r(N); >>>>>>> qui compress; >>>>>>> }; >>>>>>> drop Target; >>>>>>> sort co level*; >>>>>>> order co level* connect*; >>>>>>> list, sepby(co) noobs; >>>>>>> * --------------------- end example ----------------------- >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Sep 1, 2013 at 11:44 AM, Robert Picard <picard@netbox.com> wrote: >>>>>>>> There is indeed a problem in merging the list with itself >>>>>>>> as it leads to many-to-many merges and I have yet to see >>>>>>>> one case where an m:m merge is useful. You can however use >>>>>>>> -joinby- to perform what you intuitively would want -merge- >>>>>>>> to do in this case. >>>>>>>> >>>>>>>> * --------------------- begin example --------------------- >>>>>>>> clear >>>>>>>> input source target >>>>>>>> 1 2 >>>>>>>> 1 5 >>>>>>>> 1 6 >>>>>>>> 1 9 >>>>>>>> 2 3 >>>>>>>> 2 5 >>>>>>>> 2 7 >>>>>>>> 5 8 >>>>>>>> 5 6 >>>>>>>> 8 9 >>>>>>>> end >>>>>>>> tempfile main >>>>>>>> save "`main'" >>>>>>>> * make sure the data has no duplicates >>>>>>>> isid source target >>>>>>>> >>>>>>>> rename source co >>>>>>>> local i 0 >>>>>>>> local more 1 >>>>>>>> while `more' { >>>>>>>> local ++i >>>>>>>> rename target source >>>>>>>> joinby source using "`main'", unmatched(master) >>>>>>>> drop _merge >>>>>>>> rename source level`i' >>>>>>>> sort co level`i' >>>>>>>> by co level`i': gen one = _n == 1 & !mi(level`i') >>>>>>>> by co: egen connect`i' = total(one) >>>>>>>> drop one >>>>>>>> count if !mi(target) >>>>>>>> local more = r(N) >>>>>>>> } >>>>>>>> drop target >>>>>>>> sort co level* >>>>>>>> order co level* connect* >>>>>>>> list, sepby(co) noobs >>>>>>>> * --------------------- end example ----------------------- >>>>>>>> >>>>>>>> Original message follows: >>>>>>>> >>>>>>>> st: Social Network Analysis shortest path centrality >>>>>>>> >>>>>>>> From Michael Goodwin <michael.goodwin@endeavor.org> >>>>>>>> To statalist@hsphsun2.harvard.edu >>>>>>>> Subject st: Social Network Analysis shortest path centrality >>>>>>>> Date Fri, 30 Aug 2013 13:53:16 -0400 >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am trying to do some light social network analysis on a dataset >>>>>>>> containing a list of edges. I have the dataset organized such that >>>>>>>> there are two variables, Source and Target. Bot the Source and Target >>>>>>>> are companies, and the connection between indicates that an employee >>>>>>>> from Source went on to found Target. The relationship between these >>>>>>>> two variables is indeterminate (i.e. m:m) and although the variables >>>>>>>> start as strings, I've converted them to numeric values using encode >>>>>>>> (and ensured >>>>>>>> that the numeric values in both Target and Source are equal to one another). >>>>>>>> >>>>>>>> I am attempting to determine the number of first, second, third,...,n >>>>>>>> degree connections that each Source has. For example if an employees >>>>>>>> from Company A went on to found Company B and then employees from >>>>>>>> Company B went on to found Companies C and D, Company A would have 1 >>>>>>>> first degree connection and 2 second degree connections. >>>>>>>> >>>>>>>> My goal is to create something similar to a shortest path measurement >>>>>>>> whereby a first degree connection is equal to 1, a second degree >>>>>>>> connection 1/2, a third degree connection 1/3, and so forth. In the >>>>>>>> above example, Company A's score would be (1/1)+(2/2) or 2. I believe >>>>>>>> this is a closeness/shortest path centrality approach, but I may be >>>>>>>> mistaken (and would love to be corrected!). >>>>>>>> >>>>>>>> After making the connections symmetric (i.e. all pairs are present as >>>>>>>> both inbound and outbound connections), I've attempted three >>>>>>>> approaches, all without success: >>>>>>>> >>>>>>>> 1. Use netsis and netsummarize. Neither the adjacency nor closeness >>>>>>>> calculations seems to get me to the right answer. I don't have >>>>>>>> experience using mata, but it appears that the matrix generate by >>>>>>>> netsis doesn't reflect the appropriate connections (i.e. a connection >>>>>>>> in the original edge list is not represented by a 1 in the matrix) >>>>>>>> >>>>>>>> netsis Source Target, measure(adjacency) name(A, replace); >>>>>>>> netsummarize A/(rows(A)-1), generate(degree) statistic(rowsum); >>>>>>>> >>>>>>>> netsis Source Target, measure(distance) name(D, replace) >>>>>>>> netsummarize (rows(D)-1):/rowsum(D), generate(closeness) statistic(rowsum) >>>>>>>> >>>>>>>> 2. Create a matrix data structure in Stata and use centpow. I keep >>>>>>>> receiving an error noting that the matrix is not symmetrical. I've >>>>>>>> checked and made sure that the dataset is a perfect square (it has 707 >>>>>>>> observations and 707 variables) and that a connection between Company >>>>>>>> A and Company B is also represented by a connection between Company B >>>>>>>> and Company A. Does centpow require the data to actually be in a mata >>>>>>>> matrix? >>>>>>>> >>>>>>>> use ".\dta\\${connection}_connectionIDSymmetric${typeInt}.dta"; >>>>>>>> contract targetID sourceID; >>>>>>>> reshape wide _freq , i(targetID) j(sourceID); >>>>>>>> qui foreach v of var _freq* {; >>>>>>>> replace `v' = 0 if mi(`v'); >>>>>>>> }; >>>>>>>> drop targetID; >>>>>>>> save ".\dta\\${connection}_adjacencyMatrix${typeInt}.dta", replace; >>>>>>>> centpow ".\dta\\${connection}_adjacencyMatrix${typeInt}.dta"; >>>>>>>> >>>>>>>> 3. Start with the edgelist, and merge it with itself, changing the >>>>>>>> Target and Source variable names such that Target becomes Source for >>>>>>>> the second degree connection and so forth (I think this is >>>>>>>> demonstrably not the solution, so I won't elaborate further). >>>>>>>> >>>>>>>> I think this either has a simple solution that I can't think of >>>>>>>> involving the edge list, or will involve a more intensive solution >>>>>>>> using mata. If anyone has experience or could point me in the >>>>>>>> direction of content (Statalist has limited SNA resources), that would >>>>>>>> be a huge help. >>>>>>>> >>>>>>>> Here are some of the resources I've already reviewed: >>>>>>>> http://www.rensecorten.org/index.php/research/social-network-analysis-with-stata/ >>>>>>>> https://sites.google.com/site/statagraphlibrary/netgen111 >>>>>>>> http://www.ats.ucla.edu/stat/sna/sna_stata.htm >>>>>>>> >>>>>>>> Thanks in advance. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Mike >>>>>>>> * >>>>>>>> * For searches and help try: >>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mike Goodwin >>>>>>> Senior Associate, Endeavor Insight >>>>>>> 900 Broadway | Suite 301 | New York, NY 10003 >>>>>>> +1 646 368 6354 | Skype: michael.p.goodwin >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mike Goodwin >>>>>> Senior Associate, Endeavor Insight >>>>>> 900 Broadway | Suite 301 | New York, NY 10003 >>>>>> +1 646 368 6354 | Skype: michael.p.goodwin >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>>> >>>> >>>> -- >>>> Mike Goodwin >>>> Senior Associate, Endeavor Insight >>>> 900 Broadway | Suite 301 | New York, NY 10003 >>>> +1 646 368 6354 | Skype: michael.p.goodwin >>> >>> >>> >>> -- >>> Mike Goodwin >>> Senior Associate, Endeavor Insight >>> 900 Broadway | Suite 301 | New York, NY 10003 >>> +1 646 368 6354 | Skype: michael.p.goodwin >> >> >> >> -- >> Mike Goodwin >> Senior Associate, Endeavor Insight >> 900 Broadway | Suite 301 | New York, NY 10003 >> +1 646 368 6354 | Skype: michael.p.goodwin >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ -- MIKE GOODWIN Project Leader, Endeavor Insight 900 Broadway, Suite 301 New York, NY 10003 www.endeavor.org Tel: 646-368-6354 Skype: michael.p.goodwin * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: complete data import** - Next by Date:
**Re: st: complete data import** - Previous by thread:
**st: complete data import** - Next by thread:
**FW: st: problem with command xtoverid** - Index(es):