The one-hop neighborhood of Paul Erdös

Transcription

The one-hop neighborhood of Paul Erdös
The one-hop neighborhood of Paul Erdös
Souvik Bhattacherjee ([email protected])
Introduction
Paul Erdös (26 March 1913 – 20 September 1996) was a prolific Hungarian mathematician of the 20th
century, who spent a significant portion of his life out of a suitcase and writing papers with those of his
colleagues willing to give him room and board. He published more papers than any other mathematician
in history.
The idea of the Erdös number was created by his fellow mathematicians as a humorous tribute to his
enormous output as one of the most prolific modern writers of mathematical papers [1]. An Erdös
number 1 is awarded to the person who has published at least one mathematical paper with the
celebrated mathematician. Similarly, joint publications with someone with an Erdös number of 1 yield
an Erdös number of 2. Erdös himself has the number 0. The Erdös number has gained prominence in
scientific circles as one of the important metrics of adjudging mathematical prowess of a
mathematician.
In this project, we try to understand the collaboration network of authors having an Erdös number of 1
using NodeXL, a popular visualization tool for network analysis.
Dataset & Preprocessing
We obtain the dataset for this project from the Erdös Number Project [2]. Two datasets were used for
this project which is described below:
1. Erdos0 - This dataset lists all authors who have written a joint paper with Paul Erdös (i.e., who
have Erdös number 1). It is in alphabetical order and shows the date of first collaboration, as
well as the number of papers that each person has written with Erdös. There are currently 511
names on this list [3].
2. Erdos1graph – It contains the adjacency lists for the induced subgraph of the collaboration
graph on all Erdös coauthors, as of 2007. In other words, its vertices are people with Erdös
number 1, and are joined by an edge if they have published a joint paper (with or without other
collaborators). Paul Erdös himself and people with Erdös number 2 are not included. In
addition it also contains the number of Erdös number 2 authors that an author in this list has
collaborated with [4].
We had to preprocess the Erdos0 list to include a 1 in those places which did not have an entry to
indicate the number of publications between that author and Erdös. For the Erdos1graph the
adjacency list had to be converted to an undirected graph with no duplicate edges (taking the lower
triangular matrix). Also, the row author names and related information had to be separately parsed and
joined with the author list from the Erdos0 dataset.
Results and Analysis
We analyze the edge list of the Erdös #1 coauthors. There are 511 vertices excluding Erdös and 3208
edges in total. In some analysis we include Erdös as well to enhance the visualization, where we felt it
was necessary.
Headline 1: There are 2 types of authors that Erdös wrote papers with: Those
who also wrote papers among themselves and those who did not, at all.
Figure 1: Graph of Erdös #1 coauthors (grouped by connected components)
We group the coauthors having Erdös #1 by connected components and the results can be seen from
Figure 1. There are 42 connected components in which the largest component has 466 authors which
constitute 91.19% of the total number of authors in this network. The remaining authors as can be seen
from Figure 1 are either isolated or form a component of size at most 2. The diameter of the largest
connected component is 10. The presence of quite a number of isolated components intrigues us to
explore the properties of those authors further (presented later).
We analyze the graph further by grouping them into clique motif of size more than 4. It is easy to see
that the cliques would all be formed within the large connected component as all the other components
have a size less than 4. We found that the largest clique is a 7-clique. Apart from this, this component
contains 1 6-clique, 4 5-cliques and 19 4-cliques.
Figure 2: Graph of Erdös #1 coauthors (grouped by clique motif of size 4 or more)
Headline 2: a) Harary Frank* and Noga Alon coauthored actively with both
Erdös #1 and Erdös #2 authors whereas Peter Salamon coauthored only with
Erdös #2 authors
We layout the graph of Erdös #1 coauthors with the X-axis representing the # of Erdös #1 coauthors and
the Y-axis representing the # of Erdös #2 coauthors in Figure 3. The authors to the extreme right in the
X-axis (circled) are also the authors who are also positioned highest along the Y-axis. They are Harary
Frank* (44 Erdös #1 coauthors and 271 Erdös #2 coauthors) and Noga Alon (51 Erdös #1 coauthors and
228 Erdös #2 coauthors). We also notice Peter Salamon (circled) to the extreme left along the X-axis who
stands out among the rest of the authors in the same region with 113 Erdös #2 coauthors.
b) Lee Albert Rubel* plays a central role in this network with a comparatively
lower number of Erdös #1 coauthors
In the same graph, we order the size of the vertices by their degree and color them by betweenness
centrality. A blue node (circled in red) along the middle of the X-axis catches our attention. To observe
the node in detail, we use dynamic filtering to retain the top-10 nodes having the highest values of
betweenness centrality (Figure 4). This node is particularly interesting because it is the node with a
comparatively low degree which has a significant betweenness centrality value. Upon careful scrutiny,
we found that this author has the lowest degree among the top-10 authors but ranks 4th in the
betweenness centrality value. He coauthored with 3 of the most prolific Erdös #1 authors; Ernst Gabor
Straus (rank 3), Carl Bernard Pomerance (rank 5) and Zoltan Furedi (rank 8), who in turn collaborated
with the top coauthors in this graph. Thus even with a comparatively low degree of 12 this node plays a
central role in the coauthor network, the next highest degree being 26.
Figure 3: Graph of Erdös #1 coauthors (X-axis: # of Erdös #1 coauthors, Y-axis: # of Erdös #2
coauthors)
Figure 4: Graph of Erdös #1 coauthors (Top-10 ordered by betweenness centrality)
Headline 3: Most of the Erdös #1 authors who did not collaborate with any
Erdös #1 author also collaborated less with Erdös #2 authors
Figure 5: Graph of Erdös #1 coauthors with Erdös who did not coauthor any paper with Erdös #1
author
We construct the graph of isolated authors in the Erdös #1 collaboration graph, keeping Erdös in this
case to have edges in this graph (Figure 5). The vertices (representing authors) are labeled by their
names and the edges are labeled by the year in which the corresponding author first published a paper
with Erdös. The edge width is determined by the total number of publications that this author has with
Erdös, with 3 as the maximum edge weight in this graph. The size of the vertices represents the number
of Erdös #2 coauthors that the author has collaborated with. The color of the vertices indicates whether
the author is living (blue) or has deceased (orange). We also order the authors (manually) by the year in
which they first publish a paper with Erdös, with the year increasing in a clockwise fashion.
We observe from Figure 5 that the sizes of most of the vertices are very less indicating that these
isolated authors also collaborated less with Erdös #2 coauthors, with the notable exceptions being Peter
Salamon, Marcus Solomon and Tarski Alfred* having collaborated with 113, 33 and 26 Erdös #2 authors,
respectively. This visualization also helps us to identify the oldest collaborators of Erdös, who are still
living; Joseph Lehner, in this graph.
Headline 4: Birds of same feather flock together: The top Erdös #1 collaborators
also collaborated highly among themselves
Figure 6: Graph of Erdös #1 coauthors having 30 or more collaborations (with Erdös #1 authors)
Collaboration graph of Erdös #1 coauthors having 30 or more collaborators is presented in Figure 6. We
observe that this graph is strongly connected indicating that the authors in this graph also collaborated
highly with each other. Figure 7 clusters these 14 authors using Girvan Newman clustering algorithm.
We observe that there is a 6-clique and a 4-clique which furthers establishes the high connectivity of this
network.
Figure 7: Graph of Erdös #1 coauthors having 30 or more collaborators (clustered using Girvan
Newman clustering algorithm)
Headline 5: Erdös #1 authors having high collaborations with Erdös #2 authors
did not collaborate highly among themselves
We construct the graph of Erdös #1 authors who has collaborated with 100 or more Erdös #2
collaborators (Figure 8). The sizes of the nodes represent the number of Erdös #2 coauthors that this
author has. The coloring is done based on the actual degree of the node in the Erdös #1 collaboration
graph (Figure 1). The observation here is that the graph in Figure 8 is not so strongly connected unlike in
Figure 6. This implies that these authors do not collaborate highly among themselves. In fact, the author
Saharon Shelah does not have any collaboration in this graph although he has collaborated with 15
other Erdös #1 authors. Peter Salamon did not have any collaboration with any of the Erdös #1
coauthors and is therefore not a surprise here.
Figure 8: Graph of Erdös #1 coauthors having 100 or more Erdös #2 collaborators
NodeXL Critiques
NodeXL is a great tool for handling graphs especially because of the fact that it is integrated with
Microsoft Excel. I have had the chance of using Pajek (another network analysis tool) before but haven’t
found it be as flexible as NodeXL. The features of NodeXL that interest me the most are the Grouping
options; especially the cluster and the motifs. The Graph Metric and the Autofill options were equally
useful. However there are few things that I feel needs more attention (as I found out during the course
of my NodeXL usage) and are listed below:
1. The user needs to handle isolated nodes manually for displaying them. If there are a lot of
isolated nodes in the graph it becomes problematic.
2. The legends occupy a large portion of the actual screen below the actual display (shown in
Figure 6 and Figure 8) which is wasteful.
3. The dynamic filter does not change the attributes of the graph dynamically. Consider for
example, a large graph is filtered based on some vertex attribute (say, betweenness centrality)
and the vertex size is dependent on the degree of the vertex. The vertices present in the filtered
graph might have low degrees now but the sizes of the vertices pertain to their original degrees.
In some cases, the original degree might be a requirement but an option may be presented to
the user where the vertex properties change dynamically, as well.
4. It would be useful if the edges in a graph can be laid out in some order in a Star layout (which is
one of the most common layouts). Although the Fruchterman-Reingold layout does give the
layout a Star shape but it does not have the option of ordering the edges. This idea comes from
the clock glyph designs studied earlier in this course. In this case, I had to lay out the edges
manually (Figure 5).
References
1.
2.
3.
4.
Erdös Number. http://en.wikipedia.org/wiki/Erd%C5%91s_number
The Erdös Number Project. http://www.oakland.edu/enp/
Erdos0 dataset. https://files.oakland.edu/users/grossman/enp/Erdos0.html
Erdos1graph dataset. https://files.oakland.edu/users/grossman/enp/erdos1graph.html