Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Social Network Analysis shortest path centrality


From   Robert Picard <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: Social Network Analysis shortest path centrality
Date   Wed, 4 Sep 2013 13:16:14 -0400

To find the shortest path, the easiest is to convert the
results to long form. There's perhaps a way to do this
using -reshape- but here's how I think about it

* --------------------- begin example ---------------------
clear
input source target
1 2
1 5
1 6
1 9
2 3
2 5
2 7
5 8
5 6
8 9
8 1
end
tempfile main
save "`main'"
* make sure the data has no duplicates
isid source target

rename source co
local i 0
local more 1
local isdone co
while `more' {
local ++i
rename target source
joinby source using "`main'", unmatched(master)
drop _merge
rename source level`i'
replace target = . if inlist(target,`isdone')
local isdone `isdone', level`i'
sort co level`i'
by co level`i': gen one = _n == 1 & !mi(level`i')
by co: egen connect`i' = total(one)
drop one
count if !mi(target)
local more = r(N)
}
drop target
sort co level*
order co level* connect*
list, sepby(co) noobs

* shortest path, convert to long from
tempfile results
save "`results'"
forvalues j=1/`i' {
use "`results'", clear
keep co level`j'
rename level`j' level
gen distance = `j'
drop if mi(level)
tempfile long`j'
save "`long`j''"
}
use "`long1'", clear
forvalues j=2/`i' {
append using "`long`j''"
}
sort co level distance
by co level: keep if _n == 1
list, sepby(co) noobs
* --------------------- end example -----------------------


On Wed, Sep 4, 2013 at 12:57 PM, Michael Goodwin
<[email protected]> wrote:
> (prematurely sent)
>
> I'm trying an approach to loop over observations within a given co and
> to make blank any values that match any previous connection values.
> For some reason, this block of code beginning at forv totalCompanies =
> 1/$maxSource { doesn't seem to work, despite the fact that when I
> manually walk through the key line of the code (replace level`i' = .
> if level`i'==level`count'[`totalConnections'] & level`i'!=.), the line
> does what I want, replacing the furthest level with a . if the
> connection has already been made by a previous level. If anyone is
> able to point out why this isn't working, or a different approach,
> that would be a big help.
>
> On Wed, Sep 4, 2013 at 12:54 PM, Michael Goodwin
> <[email protected]> wrote:
>> I'm trying an approach to loop over observations within a given co and
>> to make blank any values that match any previous connection values.
>> For some reason, this block of code beginning at forv totalCompanies =
>> 1/$maxSource { doesn't seem to work, despite the fact that when I
>>
>>
>> clear
>> input source target
>> 1 2
>> 1 5
>> 1 6
>> 1 9
>> 2 3
>> 2 5
>> 2 7
>> 3 8
>> 3 6
>> 4 9
>> 4 1
>> end
>>
>>
>> * Create macros with total number of companies
>> preserve
>> duplicates drop source, force
>> global maxSource = _N
>> restore
>> tempfile main
>> save "`main'"
>>
>> * Check for duplicates
>> sort source target
>> isid source target
>>
>> * Create dataset containing all connections
>> rename source co
>> gen level0 = co
>> local i 0
>> local more 1
>> local isdone co
>> while `more' {
>> local ++i
>> rename target source
>> joinby source using `main', unmatched(master)
>> drop _merge
>> rename source level`i'
>> replace target = . if inlist(target,`isdone')
>> local isdone `isdone', level`i'
>> forv totalCompanies = 1/$maxSource {
>> preserve
>> keep if co==`totalCompanies'
>> local maxConnections = _N
>> forv num = 1/`i' {
>> local count = `i'-`num'
>> forv totalConnections = 1/`maxConnections' {
>> set trace on
>> replace level`i' = . if level`i'==level`count'[`totalConnections'] & level`i'!=.
>> set trace off
>> }
>> }
>> restore
>> }
>> sort co level`i'
>> by co level`i': gen one = _n == 1 & !mi(level`i')
>> by co: egen connect`i' = total(one)
>> drop one
>> count if !mi(target)
>> local more = r(N)
>> qui compress
>> }
>>
>> drop target
>> sort co level*
>> order co level* connect*
>>
>> On Wed, Sep 4, 2013 at 11:10 AM, Michael Goodwin
>> <[email protected]> wrote:
>>> This is very useful and more efficient than what I had put together.
>>> The final piece of the puzzle is identifying which observations
>>> contain the shortest path between the "co" and a given node. In the
>>> example above, in the third observation, co 1 is connected to 9 in
>>> level4. However, in the tenth observation, co 1 is connected to 9 in
>>> level1. Is there an efficient way to replace level with a missing
>>> value if that combination of co and level already exists and is a
>>> shorter path? I'm working on this final piece now, but would love to
>>> hear any thoughts. Thanks in advance for your help.
>>>
>>> On Wed, Sep 4, 2013 at 8:54 AM, Robert Picard <[email protected]> wrote:
>>>> You can stop a list of connections from looping back onto
>>>> itself by replacing target with a missing value if it
>>>> identifies a company that is already part of the list for
>>>> that observation. This can easily be done using -inlist()-.
>>>> Since -inlist()- does not handle a regular varlist, you can
>>>> use a local macro to build-up the comma separated variable
>>>> list as you go.
>>>>
>>>> Note that it really helps to create a toy dataset to sort
>>>> out these problems. Your proposed code does not stop looping
>>>> when used with the test data I'm using.
>>>>
>>>> * --------------------- begin example ---------------------
>>>> clear
>>>> input source target
>>>> 1 2
>>>> 1 5
>>>> 1 6
>>>> 1 9
>>>> 2 3
>>>> 2 5
>>>> 2 7
>>>> 5 8
>>>> 5 6
>>>> 8 9
>>>> 8 1
>>>> end
>>>> tempfile main
>>>> save "`main'"
>>>> * make sure the data has no duplicates
>>>> isid source target
>>>>
>>>> rename source co
>>>> local i 0
>>>> local more 1
>>>> local isdone co
>>>> while `more' {
>>>> local ++i
>>>> rename target source
>>>> joinby source using "`main'", unmatched(master)
>>>> drop _merge
>>>> rename source level`i'
>>>> replace target = . if inlist(target,`isdone')
>>>> local isdone `isdone', level`i'
>>>> sort co level`i'
>>>> by co level`i': gen one = _n == 1 & !mi(level`i')
>>>> by co: egen connect`i' = total(one)
>>>> drop one
>>>> count if !mi(target)
>>>> local more = r(N)
>>>> }
>>>> drop target
>>>> sort co level*
>>>> order co level* connect*
>>>> list, sepby(co) noobs
>>>> * --------------------- end example -----------------------
>>>>
>>>> On Tue, Sep 3, 2013 at 6:15 PM, Michael Goodwin
>>>> <[email protected]> wrote:
>>>>> I think I've figured out the issue. Using this approach, when dropping
>>>>> values that are equal, you need to ensure that you aren't dropping
>>>>> null values. Below is the code to drop only matching non-null values.
>>>>> Following the loop are the final few lines to calculate centrality
>>>>> using the approach I mentioned in my initial post. Thanks for your
>>>>> help!
>>>>>
>>>>> * Create dataset containing all connections;
>>>>> rename Source company;
>>>>> local i 0;
>>>>> local more 1;
>>>>> while `more' {;
>>>>> local ++i;
>>>>> rename Target Source;
>>>>> joinby Source using `main', unmatched(master);
>>>>> drop _merge;
>>>>> rename Source level`i';
>>>>> sort company level`i';
>>>>> forv num = 1/`i' {;
>>>>> local count = `i'-`num';
>>>>> cap drop if level`i'==level`count' & level`i'!="";
>>>>> cap drop if company==level`count';
>>>>> };
>>>>> by company level`i': gen one = _n == 1 & !mi(level`i');
>>>>> by company: egen connect`i' = total(one);
>>>>> drop one;
>>>>> count if !mi(Target);
>>>>> local more = r(N);
>>>>> qui compress;
>>>>> };
>>>>>
>>>>> * Centrality;
>>>>> egen totalPath = rownonmiss(connect*), strok;
>>>>> local maxPath = totalPath;
>>>>> forv num = 1/`maxPath' {;
>>>>> gen connectCentrality`num' = connect`num'/`num';
>>>>> };
>>>>> egen centrality = rowtotal(connectCentrality*);
>>>>>
>>>>> On Tue, Sep 3, 2013 at 5:10 PM, Michael Goodwin
>>>>> <[email protected]> wrote:
>>>>>> Robert, this is very helpful and actually not so far from what I had
>>>>>> originally coded out.
>>>>>>
>>>>>> The only issue I am encountering now is that some of the networks
>>>>>> actually double back on themselves, so that the loop you've written
>>>>>> out continues on infinitely (or would if Stata didn't have observation
>>>>>> limits).
>>>>>>
>>>>>> I think the solution is to write code that drops any observations in
>>>>>> which the level`i' variable is equal to any of the preceding
>>>>>> level[`i'-1], level[`i'-2], etc. variables. Do you have any thoughts
>>>>>> on how to best accomplish that? My thinking was to create a loop that
>>>>>> compares the current level with each previous level and drops the
>>>>>> observation if any two values match. I have to use capture because
>>>>>> there is no level0. This doesn't seem to be working using my dataset.
>>>>>>
>>>>>>
>>>>>> * --------------------- begin example ---------------------
>>>>>> clear
>>>>>> input Source Target
>>>>>> 1 2
>>>>>> 1 5
>>>>>> 1 6
>>>>>> 1 9
>>>>>> 2 3
>>>>>> 2 5
>>>>>> 2 7
>>>>>> 5 8
>>>>>> 5 6
>>>>>> 8 9
>>>>>> end
>>>>>> tempfile main
>>>>>> save "`main'"
>>>>>> * make sure the data has no duplicates
>>>>>> isid Source Target
>>>>>>
>>>>>> rename Source co;
>>>>>> local i 0;
>>>>>> local more 1;
>>>>>> while `more' {;
>>>>>> local ++i;
>>>>>> rename Target Source;
>>>>>> joinby Source using `main', unmatched(master);
>>>>>> drop _merge;
>>>>>> rename Source level`i';
>>>>>> sort co level`i';
>>>>>> forv num = 1/`i' {;
>>>>>> local count = `i'-`num';
>>>>>> cap drop if level`i'==level`count';
>>>>>> cap drop if co==level`count';
>>>>>> };
>>>>>> by co level`i': gen one = _n == 1 & !mi(level`i');
>>>>>> by co: egen connect`i' = total(one);
>>>>>> drop one;
>>>>>> count if !mi(Target);
>>>>>> local more = r(N);
>>>>>> qui compress;
>>>>>> };
>>>>>> drop Target;
>>>>>> sort co level*;
>>>>>> order co level* connect*;
>>>>>> list, sepby(co) noobs;
>>>>>> * --------------------- end example -----------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 1, 2013 at 11:44 AM, Robert Picard <[email protected]> wrote:
>>>>>>> There is indeed a problem in merging the list with itself
>>>>>>> as it leads to many-to-many merges and I have yet to see
>>>>>>> one case where an m:m merge is useful. You can however use
>>>>>>> -joinby- to perform what you intuitively would want -merge-
>>>>>>> to do in this case.
>>>>>>>
>>>>>>> * --------------------- begin example ---------------------
>>>>>>> clear
>>>>>>> input source target
>>>>>>> 1 2
>>>>>>> 1 5
>>>>>>> 1 6
>>>>>>> 1 9
>>>>>>> 2 3
>>>>>>> 2 5
>>>>>>> 2 7
>>>>>>> 5 8
>>>>>>> 5 6
>>>>>>> 8 9
>>>>>>> end
>>>>>>> tempfile main
>>>>>>> save "`main'"
>>>>>>> * make sure the data has no duplicates
>>>>>>> isid source target
>>>>>>>
>>>>>>> rename source co
>>>>>>> local i 0
>>>>>>> local more 1
>>>>>>> while `more' {
>>>>>>> local ++i
>>>>>>> rename target source
>>>>>>> joinby source using "`main'", unmatched(master)
>>>>>>> drop _merge
>>>>>>> rename source level`i'
>>>>>>> sort co level`i'
>>>>>>> by co level`i': gen one = _n == 1 & !mi(level`i')
>>>>>>> by co: egen connect`i' = total(one)
>>>>>>> drop one
>>>>>>> count if !mi(target)
>>>>>>> local more = r(N)
>>>>>>> }
>>>>>>> drop target
>>>>>>> sort co level*
>>>>>>> order co level* connect*
>>>>>>> list, sepby(co) noobs
>>>>>>> * --------------------- end example -----------------------
>>>>>>>
>>>>>>> Original message follows:
>>>>>>>
>>>>>>> st: Social Network Analysis shortest path centrality
>>>>>>>
>>>>>>> From  Michael Goodwin <[email protected]>
>>>>>>> To  [email protected]
>>>>>>> Subject  st: Social Network Analysis shortest path centrality
>>>>>>> Date  Fri, 30 Aug 2013 13:53:16 -0400
>>>>>>>  Hi,
>>>>>>>
>>>>>>> I am trying to do some light social network analysis on a dataset
>>>>>>> containing a list of edges. I have the dataset organized such that
>>>>>>> there are two variables, Source and Target. Bot the Source and Target
>>>>>>> are companies, and the connection between indicates that an employee
>>>>>>> from Source went on to found Target. The relationship between these
>>>>>>> two variables is indeterminate (i.e. m:m) and although the variables
>>>>>>> start as strings, I've converted them to numeric values using encode
>>>>>>> (and ensured
>>>>>>> that the numeric values in both Target and Source are equal to one another).
>>>>>>>
>>>>>>> I am attempting to determine the number of first, second, third,...,n
>>>>>>> degree connections that each Source has. For example if an employees
>>>>>>> from Company A went on to found Company B and then employees from
>>>>>>> Company B went on to found Companies C and D, Company A would have 1
>>>>>>> first degree connection and 2 second degree connections.
>>>>>>>
>>>>>>> My goal is to create something similar to a shortest path measurement
>>>>>>> whereby a first degree connection is equal to 1, a second degree
>>>>>>> connection 1/2, a third degree connection 1/3, and so forth. In the
>>>>>>> above example, Company A's score would be (1/1)+(2/2) or 2. I believe
>>>>>>> this is a closeness/shortest path centrality approach, but I may be
>>>>>>> mistaken (and would love to be corrected!).
>>>>>>>
>>>>>>> After making the connections symmetric (i.e. all pairs are present as
>>>>>>> both inbound and outbound connections), I've attempted three
>>>>>>> approaches, all without success:
>>>>>>>
>>>>>>> 1. Use netsis and netsummarize. Neither the adjacency nor closeness
>>>>>>> calculations seems to get me to the right answer. I don't have
>>>>>>> experience using mata, but it appears that the matrix generate by
>>>>>>> netsis doesn't reflect the appropriate connections (i.e. a connection
>>>>>>> in the original edge list is not represented by a 1 in the matrix)
>>>>>>>
>>>>>>> netsis Source Target, measure(adjacency) name(A, replace);
>>>>>>> netsummarize A/(rows(A)-1), generate(degree) statistic(rowsum);
>>>>>>>
>>>>>>> netsis Source Target, measure(distance) name(D, replace)
>>>>>>> netsummarize (rows(D)-1):/rowsum(D), generate(closeness) statistic(rowsum)
>>>>>>>
>>>>>>> 2. Create a matrix data structure in Stata and use centpow. I keep
>>>>>>> receiving an error noting that the matrix is not symmetrical. I've
>>>>>>> checked and made sure that the dataset is a perfect square (it has 707
>>>>>>> observations and 707 variables) and that a connection between Company
>>>>>>> A and Company B is also represented by a connection between Company B
>>>>>>> and Company A. Does centpow require the data to actually be in a mata
>>>>>>> matrix?
>>>>>>>
>>>>>>> use ".\dta\\${connection}_connectionIDSymmetric${typeInt}.dta";
>>>>>>> contract targetID sourceID;
>>>>>>> reshape wide _freq , i(targetID) j(sourceID);
>>>>>>> qui foreach v of var _freq* {;
>>>>>>> replace `v' = 0 if mi(`v');
>>>>>>> };
>>>>>>> drop targetID;
>>>>>>> save ".\dta\\${connection}_adjacencyMatrix${typeInt}.dta", replace;
>>>>>>> centpow ".\dta\\${connection}_adjacencyMatrix${typeInt}.dta";
>>>>>>>
>>>>>>> 3. Start with the edgelist, and merge it with itself, changing the
>>>>>>> Target and Source variable names such that Target becomes Source for
>>>>>>> the second degree connection and so forth (I think this is
>>>>>>> demonstrably not the solution, so I won't elaborate further).
>>>>>>>
>>>>>>> I think this either has a simple solution that I can't think of
>>>>>>> involving the edge list, or will involve a more intensive solution
>>>>>>> using mata. If anyone has experience or could point me in the
>>>>>>> direction of content (Statalist has limited SNA resources), that would
>>>>>>> be a huge help.
>>>>>>>
>>>>>>> Here are some of the resources I've already reviewed:
>>>>>>> http://www.rensecorten.org/index.php/research/social-network-analysis-with-stata/
>>>>>>> https://sites.google.com/site/statagraphlibrary/netgen111
>>>>>>> http://www.ats.ucla.edu/stat/sna/sna_stata.htm
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Mike
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Goodwin
>>>>>> Senior Associate, Endeavor Insight
>>>>>> 900 Broadway | Suite 301 | New York, NY 10003
>>>>>> +1 646 368 6354 | Skype: michael.p.goodwin
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mike Goodwin
>>>>> Senior Associate, Endeavor Insight
>>>>> 900 Broadway | Suite 301 | New York, NY 10003
>>>>> +1 646 368 6354 | Skype: michael.p.goodwin
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>>
>>> --
>>> Mike Goodwin
>>> Senior Associate, Endeavor Insight
>>> 900 Broadway | Suite 301 | New York, NY 10003
>>> +1 646 368 6354 | Skype: michael.p.goodwin
>>
>>
>>
>> --
>> Mike Goodwin
>> Senior Associate, Endeavor Insight
>> 900 Broadway | Suite 301 | New York, NY 10003
>> +1 646 368 6354 | Skype: michael.p.goodwin
>
>
>
> --
> Mike Goodwin
> Senior Associate, Endeavor Insight
> 900 Broadway | Suite 301 | New York, NY 10003
> +1 646 368 6354 | Skype: michael.p.goodwin
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index