Tuesday 1 November 2011

'Final' project post

This is the final official post. You may well hear more from us, but this puts a ribbon around the funded project. Below we answer some of JISC's prompts (in green):


Description of the things you have produced and links to each item.

The blog posts contain many small items. We could list them, but they would distract from the big item, which was the dataset here (approx 4GB)

Next steps

One of the fundamental difficulties we encountered was that statisticians were difficult to recruit. We understood how to analyse data to discover if a particular hypothesis was supported by the data, we had anticipated being able to detect unexpected usage patterns and to be able to determine what had driven these patterns. We managed to establish that there is reason to suppose statistical techniques exist that can be applied, but we did not find suitably experienced statisticians with an interest in our data within the timescale of the project. BTW the statistical field we identified was credit card fraud detection in which we are looking for the inverse of their interest - they identify patterns in spending behaviour in order to spot outliers whereas we are interested in analysing the dominant pattern. I think it would be good to encourage data mining centres to explore this further or to fund an investigation of this issue with a longer time frame.

What can other institutions do to benefit from your work

We set out to focus on analysing data, but our most useful lessons came from our success in releasing open data:
  • Put good privacy terms on every site that collects personal data and ensure possible publication of anonymised data is anticipate in those terms
  • Take a well prepared anonymisation case to legal services
  • Choose the licence first and determine what data can be released under that licence
  • Don’t forget your dataset may have IPR in its format
Most significant lessons (no more than 3, in bullet point form)

On analysing data:
  • Looking for patterns you suspect and confirming/disproving is much easier than asking what information the data contains
  • The visualisation technique is the servant of the story the data tells. In other words, visualisation tools don’t reveal stories, they tell stories
  • Visualisations tend to be dominated by data collection artefacts for the first very many iterations of visualising (e.g. most of New York is in New Jersey, so NY population can look small)
  • Complex data is hard to analyse, allow time and take a phased approach
Addressing the big issues :

I would say we had a smooth path to releasing data compared to accounts from other projects. I attribute this to preparation. That is, we went to the legal office with answers - asking for confirmation that we had done enough - not questions asking what we should do. We were also lucky that the privacy statement we put in place 5 years ago was adequate for this purpose. Getting privacy statements right is important and we would have been better placed if we had collected acknowledgement/acceptance of terms.

More details on how you've addressed anonymisation

The principle technique for anonymisation was hashing of sensitive fields using a ‘salted’ SHA-1 algorithm. SHA-1 turns a data item such as ‘CRSID’ into a string such as ‘ef479946c02076c9c25b34a38a80cf22d0ecb9cb‘. On its own, this technique would be vulnerable to a ‘dictionary attack’ where someone with a list of CRSIDs could run them through the published SHA-1 algorithm and match the resulting strings to our dataset. To prevent this we also fed a secret key into the algorithm and destroyed of the key afterwards. This ensured that nobody could recreate the hashed strings and decode the data.

What algorithms or processing techniques have you used
  • Salted SHA-1
  • Perl code for ‘mixing in’ reference data to make analysis easier (e.g. a students identity may not be revealed, but their year of study can be)
  • Pivot Tables, Graphing tools and Gephi for analysis

Friday 28 October 2011

Legal clearance for releasing personal data


I was involved in getting the legal office to approve our release of data. This post summarises some of the key aspects of the conversation:

Q.  Was a data collection notice used which indicated this possible use of the data?  
A. Users of CamTools have always [since the service first went online] had the option to open the privacy policy which says "Logs are used to create summary statistics which may be made publicly available. Summary statistics do not include personal data." Although it would be ideal to send an email to tell users as a matter of courtesy that we are going to anonymise and release the data (because users were not required to read the privacy policy which is not prominently displayed), it was felt we did not have to do this.
The nature of the personal information concerned includes CRSids but not sensitive personal data. It was therefore deemed that we did not need consent to anonymise (ie. process the data). It is not completely clear if anonymising is processing but it is safest to assume it is. The personal information was not provided on a confidential basis, so there should be no issue on that score.

Conclusion. Be rigorous about having a privacy policy for systems that collect personal data and anticipate public use of the data. We will probably move to actively collecting agreement to terms and conditions (including privacy policy) with click-through pages in future.

Q. Is the anonymisation secure enough?
A. We had taken advice from an expert in the Computer Lab about anonymisation of data, but we were also advised to cross-check with the University Computing Service that; (1) they were happy with the security of the anonymisation process used, including the identifiers which you might insert in relation to some of the data (for example identifying the school to which the users belong), was adequate to protect the identity of individuals, and (2) since the open data licence contemplates commercial use, the UCS is confident that this is within JANET acceptable use policy.
I was pointed out that if we use the PDDL open data licence, we are releasing the material irrevocably and with no restraints.  Both commercialistion and patenting at a future date is precluded in fact and by the terms of the licence.

Conclusion. This is a tricky area and it is good to have as many opinions as possible confirming that the anonymisation is adequate. We also need to track (if we can) uses made of the data and any indication the anonymisation is inadequate. However, the terms of the PDDL mean there is little we can do about it once the data is released, which just increases the pressure to ensure we have good anonymisation in the first place.

Q. Would this breach anyone’s intellectual property rights?
A. The data was created for the administrative purposes of the University, so there should be no problem under the IP Ordinance (the University would own the data). Release of the data also releases database table structures.  Since  the software that produced the log tables was licensed under the ECL2 open source licence, which allows for onward distribution on an open source basis, there should be no problem here either.  We were advised to check there is no conflict between the ECL2 and the PDDL licence.

Conclusion. I hadn’t thought about this aspect of releasing data. This question represents an unexpected benefit of running open source code and a potential minor headache for anyone releasing data from a commercial system.


Friday 16 September 2011

What can our Perl analysis tool do ?

Introduction

We have developed a Perl script that allows us to extract information from the event logs and enrich this with data from other sources.

We have used this extracted data in several ways – for example looking at when the peak usage occurs for different events and visualising a network of sites linked by their users.

This article is to give an overview of the information that can be extracted using this script and the additional data that is needed.

Institutional Unique Logins

Input Data:

  • Event Logs broken down by Academic Year

  • Users and the Institutions (eg colleges, departments) they belong to

Request:

  • Academic years to query

  • Period over which to accumulate counts (year, month, week).

Output:

  • Report of how many users belong to only 1 institution and how many to multiple (2,3 or 4) institutions.

    inst_membership_count.csv

  • For each institution, produces a count of unique logins over the requested time unit. These are produced separately for each unit of time and also joined into one overall file for the requested period. This can be pivoted in a spreadsheet to produce overall total unique logins per year/month/week for each institution.

    unique_insts_year_wk or month no.csv for each week or month no

    unique_insts_by_week.csv or unique_insts_by_month.csv

Warning: Many undergraduates are not assigned to a Department but only a college during their initial years of study.

Counts of Events

Input Data:

  • Event Logs broken down by Academic Year

Request:

  • Academic years to query

  • Period over which to accumulate counts (month, week).

Output:

  • Produces a separate file for each requested time unit containing total counts for each type of event. These are produced separately for each unit of time and also joined into one overall file for the requested period. This can be pivoted in at spreadsheet to produce overall total counts for each event per month or week.

    total_events_year_wk or month no.csv for each week or month no

    total_events_by_week.csv or total_events_by_month.csv

This was used to analyse when peaks in activity occurred for different events.


Detailed and Summary reports for a list of Users

Input Data:

  • Event Logs broken down by Academic Year

  • Sakai Site file for Site Title

Parameters:

  • List of user identifiers – can just be one user

  • Academic years to query

Output:

For each user:

  • Detailed breakdown of each session (see later). events_for_user_year.csv


Illustration
Illustration 1: Selected columns for one User session showing Sites and Content Type

  • Sites visited by this user with counts for each week sites_for_user_year.csv



Illustration
Illustration 2: Sites visited by a User with Weekly Counts


  • Events and sites visited by this user with counts for each week

    events_for_user_year_weekly_event&site.csv

Would be like the illustration below with weekly counts too

  • Summary events and sites (no weekly counts)

    events_for_user_year_summary_event&site.csv


  • Illustration
    Illustration 3: Event Types, Sites and Counts for a User


  • Total counts for each event across all sites

    events_for_user_year_summary_event_all_sites.csv


Illustration
Illustration 4: Totals for each Event Type



Overall list of sites visited by these users sites_for_user_list_.csv



Detailed and Summary reports for a list of Sites

Input Data:

  • Event Logs broken down by Academic Year

  • Users and the Institutions (eg colleges, departments) they belong to

  • Sakai Site file for Site Title for other sites visited.

Parameters:

  • List of Site Titles – can just be one site (will also take a Site Id).

  • Academic years to query

  • Look for admin users/non admin users or both

Output:

For each user:

  • Detailed breakdown of each session (see later).

    usertype_events_for_site_year.csv

  • List of users which have logged events on this site.

    active_users_for_site_year.csv

    These can be joined across sites and pivoted to give a list of all active users that have visited these sites. This data was used for Gephi Network visualisation.

  • Events and sites visited by this user with counts for each week. All sites visited within a session in which requested site is visited.

    usertype_events_for_site_year_weekly_event&site.csv

  • Events and sites visited by this user with counts for each week. Just this site.

    usertype_events_for_site_year_weekly_event_this_site.csv

  • Summary events and sites (no weekly counts). All sites visited within a session in which requested site is visited.

    usertype_events_for_site_year_summary_event&site.csv


    Illustration
    Illustration 5: Sample Rows for this report


  • Summary events and sites (no weekly counts). Just this site

    usertype_events_for_site_year_summary_event_this_site.csv


    Illustration
    Illustration 6: Sample rows for this report (just events for this site)


Detailed Session Breakdown

This is pretty much the same whether we are looking at a list of users or sites. Much has been done to extract and derive further information but the tool needs to be developed further (or spreadsheet pivoting used) to summarise this into useful reports/visualisations.

Warning:

Site and Tool information is embedded in the Event Ref field and will need customised code to extract it from non-Cambridge University users of the Sakai VLE.

Session Information (Only produced for a Site request)

A session is normally a chain of events that takes place between a user logging in and logging out (or being automatically logged out).

The detailed breakdown of each session includes the session length, an indication of whether logout was forced by the system after a period of inactivity and the time of day (am, pm, eve or night) when the session occurred. This could data could be used to analyse how the session length or time of day varies across sites and time of year.


Illustration
Illustration 7: Selected columns for a Session for a Site showing Session Information


Event Information

'Routine' events such as login and logout, searching for announcements and pres.begin have been ignored from this report (they can be turned on again within the code if required).

Institution Details

The first two institutions have been added for each user where available. This also includes an indication of whether the user is an admin or regular user or both. This is based on a flag set in the Site file for Admin users.


Event Time

As well as a time stamp for each event, we include the calendar year and week number.

Event, Site and Tool

This shows the event type (eg content.read). The Site is extracted from the Event Ref using knowledge of the content of the Event Ref data for each event type. This routine will most likely need changing for different institutional implementation of Sakai VLE. The Sakai Site file is used to convert a Site Number to a Site Title. Where there is no match in the Sakai Site file '*Site not found*' will be shown.

The Tool used is derived from the Event Type and the Event Ref. For some events such as those dealing with content, the exact tool used has to be determined from further information in the reference. This may indicate assignments, announcements, calendar, course outline, forums and the mail tool.

The reports that the Perl tool produces are based on events not the underlying tools and further work would be needed to be able to report by tools.

Content and content type

The content can be extracted from the Event Ref for many events including content.read. There is a rudimentary analysis of what the content contains (based on searching for terms like “Examination, Exercise, Timetable and Syllabus”. This is printed in the detailed event report but further work would be needed to look at which sites display what type of content or when students access this content.

The length of the content address string and its depth on terms of files have also been extracted to allow for future analysis here.

Created By, Created On, Modified By and Modified On

(Currently only printed in reports produced for Users)

Records who created and last modified this site and when.

Counts

Various counts are printed to be used in spreadsheet pivoting

The following are currently only printed in reports produced for Sites:

Search For

Could be used to investigate what terms users search for and then used in conjunction with the next site they go to to understand more about search issues.

Assessment Title

For Samigo Tests and Quizzes the tool can determine the title of the Assessment provided it has access to the following files:

sam_assessmentgrading_t

sam_publishedassessment_t

Evaluation Owner and Title

For the Swift Tool, the Perl tool can determine the Evaluation owner and title provided it has access to the following files:

eval_response

eval_adhoc_group







How to work with Gephi to visualise a network of Sites connected by Users


Top50 Network Chart
Purpose

Instructions on how to import data from a list of sites and their active users and visualise the network using Gephi.

Background

We have developed tools that can be run to extract a given list of sites and their active users from the VLE logging data. This gives us users that have actually visited the site as opposed to those who have been given access to the site.

Ingredients

  • Perl statistics tool developed for this project (this may need customising to fit your version of the VLE logging data).

  • Gephi application

  • Basic knowledge of speadsheets – OpenOffice was used here.

Assumptions

The user has downloaded the current version of the Gephi application and worked through at least the basic on-line tutorial.

Method

You will first prepare the data for import into Gephi. After importing it, you will be able to visually explore the network using Gephi

Individual steps

Prepare Data for Gephi:

  • Create a list of sites and active users (eg all_active_users_for_sites.csv):

    This can be done by inputting a list of sites to the perl stats.pl script and running this to produce a list of active users for each site. Eg active_users_for_.csv

Join these files together: cat act* > all_active_users_for_sites.csv

  • Use spreadsheet sort to sort in order of Site and remove all the heading lines (Site, User etc).

Check and correct the user ids at this point – I found some (eg thu21) which had been mangled by the spreadsheet into date format.

  • Run gephi perl script against this to produce list of nodes and edges for import into Gephi.

eg perl gephi.pl -f all_active_users_for_sites.csv -w

This determines the number of active users for each site and the number of shared users between any 2 sites. It will produce 'nodes.csv' and 'edges.csv' in the input file location.

Importing Data into Gephi:

  • Open up Gephi and select File > New Project

  • Click on the Data Laboratory Tab

  • Click on Import Spreadsheet and browse to the location of the nodes file that you created above.

  • Make sure 'As Table' is set to Nodes Table

  • Click on Next

  • Set the Total Users field to Integer – choose from the drop down list (you may need to go up though the list of options)

  • Click on Finish

  • Nodes table should now be imported

  • Repeat for Edges table

  • Go to Overview tab

  • You should now see a network !


Working with Gephi:

(Please follow the Gephi Getting Started tutorial to learn the details for these steps)

Setting the node size to Number of Users:

  • Select the Ranking tab at the top right of the screen

  • Choose Total Users as the Ranking Parameter

  • Click on the colour wheel button

  • Set the Color range

  • Click on Apply

  • The nodes will now have colour intensity linked to the number of users

  • Click on the diamond icon

  • Set the Min Size (eg 10) and the Max Size (eg 100). The range should already be set to the max and min no of users.

  • Click on Apply

  • The node sizes should now reflect the number of users.


Changing the Network Layout:

  • From the layout window on the bottom left of the page select Force Atlas

  • Set Repulsion strength 10000, tick Adjust by sizes.

  • Click on Run

  • Click Stop when it has stopped changing


Showing the Labels

  • Click on the black 'T' icon at the bottom of the Graph display

  • You should see the node labels

  • You can use the rightmost slider at the bottom of the the Graph display to change the text size.

To stop the Labels overwriting each other:

  • From the Layout tab – bottom of left hand side of page:

  • Choose Label Adjust

  • Click on Run

To highlight clusters of nodes:

  • Click on the Statistics tab on the left hand panel.

  • Click to run Modularity – this looks for clusters of nodes that are tightly associated with each other and more loosely linked to other nodes.

  • Click on the Partition tab on the right hand panel.

  • Choose Modularity Class and click on Run

  • You should now see tightly associated nodes have the same colours.

Hiding Nodes

  • Goto Filters menu

  • Click on Topology then drag 'Degree range' down to the Query section.

  • Use the slider to change the Degree range settings

  • Press filter

  • You should see that the less connected nodes have been hidden.


Interaction with Gephi

  • If you click on one of the nodes in Overview mode you can see the linked nodes highlighted.


Producing an output view

  • Click on the Preview tab.

  • Click on Refresh

  • If you don't see anything try clicking on the Reset Zoom button

  • Use the Default preset with the following settings:

    • Show labels ticked

    • Increase label font if needed

    • Undirected Curved set

    • Rescale weight ticked (this makes sure the connecting lines are not really thick)

    • Thickness – maybe set to 4 – this thickens up the lines reduced above

    • Proportional Label size unticked

    • Click on Refresh

  • You can use the Preview ratio slider to filter out less active nodes.

  • Export to SVG/PDF to save a copy


Using Gephi to Visualise our Sites and Users


Top50 Network Chart

Discussion

We took as our starting point the talk by Tony Hirst on visualisation http://blog.activitydata.org/2011/07/online-exchange-4-event-recording-21.html, in particular the slides produced using Gephi.

The problem with this approach was that all the examples using Gephi were using it to map a network of the same items – eg a network of Twitter users. We wanted to visualise a network of sites and their users – a bipartite network. I found a reference to another tool that had some capabilities for the display and analysis of bipartite networks: ORA (developed at Carnegie Mellon, part of the CASOS (sic) package: http://www.casos.cs.cmu.edu/projects/ora/ but the comment also said that 'the visualizations are simply not on par with those of Gephi'. A quick look at this did not make me feel that it would be useful for our needs.

Instead I decided to try using the number of common users to link individual sites and to produce a Gephi visualisation where the sites were the nodes with their size representing the number of active users of the site. The thickness of the connecting lines between the sites would be related to the number of common users they had.

To do this, I developed a Perl script to determine the number of shared users between each of a list of sites and produce the nodes and edges data files that could be imported into Gephi. The detailed steps taken to produce network visualisations using this data will be reported in a separate blog post.


College Chart

The visualisation above shows the Sites used within one of our colleges and links them by the number of shared users. The size of the circle increases with the number of users of that site. So we see the biggest site is the College main site with the next site being Computing with many shared users.

The Modularity Algorithm within Gephi was used to partition the sites into clusters where the sites are tightly associated with each other and more loosely linked to other sites. We can see that the green coloured sites could be categorised as 'college services' (IT, catering etc), the blue sites are mostly admissions related and the red sites include many subject related sites.

A similar process was applied to one of our larger departments (Illustration 1). This showed clustering and shared usage around their first year sites as well as heavy usage of their library/resource site. We will be able to use this visualisation as a resource to help us quickly identify important sites within this department when we upgrade their VLE.

This procedure can be followed for either a list of users or a list of sites to produce similar visualisations which we believe will be helpful in understanding more about how our sites are related and linked in the future.