Friday, 16 September 2011

Using Gephi to Visualise our Sites and Users

Top50 Network Chart


We took as our starting point the talk by Tony Hirst on visualisation, in particular the slides produced using Gephi.

The problem with this approach was that all the examples using Gephi were using it to map a network of the same items – eg a network of Twitter users. We wanted to visualise a network of sites and their users – a bipartite network. I found a reference to another tool that had some capabilities for the display and analysis of bipartite networks: ORA (developed at Carnegie Mellon, part of the CASOS (sic) package: but the comment also said that 'the visualizations are simply not on par with those of Gephi'. A quick look at this did not make me feel that it would be useful for our needs.

Instead I decided to try using the number of common users to link individual sites and to produce a Gephi visualisation where the sites were the nodes with their size representing the number of active users of the site. The thickness of the connecting lines between the sites would be related to the number of common users they had.

To do this, I developed a Perl script to determine the number of shared users between each of a list of sites and produce the nodes and edges data files that could be imported into Gephi. The detailed steps taken to produce network visualisations using this data will be reported in a separate blog post.

College Chart

The visualisation above shows the Sites used within one of our colleges and links them by the number of shared users. The size of the circle increases with the number of users of that site. So we see the biggest site is the College main site with the next site being Computing with many shared users.

The Modularity Algorithm within Gephi was used to partition the sites into clusters where the sites are tightly associated with each other and more loosely linked to other sites. We can see that the green coloured sites could be categorised as 'college services' (IT, catering etc), the blue sites are mostly admissions related and the red sites include many subject related sites.

A similar process was applied to one of our larger departments (Illustration 1). This showed clustering and shared usage around their first year sites as well as heavy usage of their library/resource site. We will be able to use this visualisation as a resource to help us quickly identify important sites within this department when we upgrade their VLE.

This procedure can be followed for either a list of users or a list of sites to produce similar visualisations which we believe will be helpful in understanding more about how our sites are related and linked in the future.

1 comment:

  1. Yeah - Gephi in the wild;-)

    If you want to plot a bipartite network, it's easy enough... (an example of this in the twitter context would be to show list membership, where the two node types are Twitter user and Twitter list. For example: )

    I think the route you have taken is much cleaner though, using the common user count to weight the edges between nodes (this also has the affect of "anonymising" the data in terms of users...; however it loses out in terms of showing which sites individual users visited.)

    To generate the network, in the simplest case you can just create a simple two column CSV file with columns: siteID, userID

    I think you can just load this in from the file menu and then you'll be presented with a preview view of the network...

    There are more elaborate file formats possible for getting data into Gephi. One I use is .gdf, which separates nodes and edges (if a node identifier used in a edge description has not been declared as a node explicitly, Gephi will create it). The nodes can be defined with name (the node identifier) and label (the label that gets displayed by default) attributes, followed by an arbitrary number of other attributes.

    One of the things that gephi doesn't do is handle multiple edges between two nodes, even in the case of directed edges. If you want to graph traffic going from node A to node B, and separately traffic from B to A, a workaround I've used in the past is to define interstitial nodes (eg AtoB and BtoA) and then define a new graph where directed edges from A to B are represented as A-AtoB-B and those from B to A as B-BtoA-A. (In the layout, it makes sense to size the interstitial nodes as size 0.) But it's a faff and it messes with network stats...

    Finally, if you need to work with graphs programmatically, I use the NetworkX python library. (On the to do list is look as the network handling libraries in R.)