Friday 16 September 2011

What can our Perl analysis tool do ?

Introduction

We have developed a Perl script that allows us to extract information from the event logs and enrich this with data from other sources.

We have used this extracted data in several ways – for example looking at when the peak usage occurs for different events and visualising a network of sites linked by their users.

This article is to give an overview of the information that can be extracted using this script and the additional data that is needed.

Institutional Unique Logins

Input Data:

  • Event Logs broken down by Academic Year

  • Users and the Institutions (eg colleges, departments) they belong to

Request:

  • Academic years to query

  • Period over which to accumulate counts (year, month, week).

Output:

  • Report of how many users belong to only 1 institution and how many to multiple (2,3 or 4) institutions.

    inst_membership_count.csv

  • For each institution, produces a count of unique logins over the requested time unit. These are produced separately for each unit of time and also joined into one overall file for the requested period. This can be pivoted in a spreadsheet to produce overall total unique logins per year/month/week for each institution.

    unique_insts_year_wk or month no.csv for each week or month no

    unique_insts_by_week.csv or unique_insts_by_month.csv

Warning: Many undergraduates are not assigned to a Department but only a college during their initial years of study.

Counts of Events

Input Data:

  • Event Logs broken down by Academic Year

Request:

  • Academic years to query

  • Period over which to accumulate counts (month, week).

Output:

  • Produces a separate file for each requested time unit containing total counts for each type of event. These are produced separately for each unit of time and also joined into one overall file for the requested period. This can be pivoted in at spreadsheet to produce overall total counts for each event per month or week.

    total_events_year_wk or month no.csv for each week or month no

    total_events_by_week.csv or total_events_by_month.csv

This was used to analyse when peaks in activity occurred for different events.


Detailed and Summary reports for a list of Users

Input Data:

  • Event Logs broken down by Academic Year

  • Sakai Site file for Site Title

Parameters:

  • List of user identifiers – can just be one user

  • Academic years to query

Output:

For each user:

  • Detailed breakdown of each session (see later). events_for_user_year.csv


Illustration
Illustration 1: Selected columns for one User session showing Sites and Content Type

  • Sites visited by this user with counts for each week sites_for_user_year.csv



Illustration
Illustration 2: Sites visited by a User with Weekly Counts


  • Events and sites visited by this user with counts for each week

    events_for_user_year_weekly_event&site.csv

Would be like the illustration below with weekly counts too

  • Summary events and sites (no weekly counts)

    events_for_user_year_summary_event&site.csv


  • Illustration
    Illustration 3: Event Types, Sites and Counts for a User


  • Total counts for each event across all sites

    events_for_user_year_summary_event_all_sites.csv


Illustration
Illustration 4: Totals for each Event Type



Overall list of sites visited by these users sites_for_user_list_.csv



Detailed and Summary reports for a list of Sites

Input Data:

  • Event Logs broken down by Academic Year

  • Users and the Institutions (eg colleges, departments) they belong to

  • Sakai Site file for Site Title for other sites visited.

Parameters:

  • List of Site Titles – can just be one site (will also take a Site Id).

  • Academic years to query

  • Look for admin users/non admin users or both

Output:

For each user:

  • Detailed breakdown of each session (see later).

    usertype_events_for_site_year.csv

  • List of users which have logged events on this site.

    active_users_for_site_year.csv

    These can be joined across sites and pivoted to give a list of all active users that have visited these sites. This data was used for Gephi Network visualisation.

  • Events and sites visited by this user with counts for each week. All sites visited within a session in which requested site is visited.

    usertype_events_for_site_year_weekly_event&site.csv

  • Events and sites visited by this user with counts for each week. Just this site.

    usertype_events_for_site_year_weekly_event_this_site.csv

  • Summary events and sites (no weekly counts). All sites visited within a session in which requested site is visited.

    usertype_events_for_site_year_summary_event&site.csv


    Illustration
    Illustration 5: Sample Rows for this report


  • Summary events and sites (no weekly counts). Just this site

    usertype_events_for_site_year_summary_event_this_site.csv


    Illustration
    Illustration 6: Sample rows for this report (just events for this site)


Detailed Session Breakdown

This is pretty much the same whether we are looking at a list of users or sites. Much has been done to extract and derive further information but the tool needs to be developed further (or spreadsheet pivoting used) to summarise this into useful reports/visualisations.

Warning:

Site and Tool information is embedded in the Event Ref field and will need customised code to extract it from non-Cambridge University users of the Sakai VLE.

Session Information (Only produced for a Site request)

A session is normally a chain of events that takes place between a user logging in and logging out (or being automatically logged out).

The detailed breakdown of each session includes the session length, an indication of whether logout was forced by the system after a period of inactivity and the time of day (am, pm, eve or night) when the session occurred. This could data could be used to analyse how the session length or time of day varies across sites and time of year.


Illustration
Illustration 7: Selected columns for a Session for a Site showing Session Information


Event Information

'Routine' events such as login and logout, searching for announcements and pres.begin have been ignored from this report (they can be turned on again within the code if required).

Institution Details

The first two institutions have been added for each user where available. This also includes an indication of whether the user is an admin or regular user or both. This is based on a flag set in the Site file for Admin users.


Event Time

As well as a time stamp for each event, we include the calendar year and week number.

Event, Site and Tool

This shows the event type (eg content.read). The Site is extracted from the Event Ref using knowledge of the content of the Event Ref data for each event type. This routine will most likely need changing for different institutional implementation of Sakai VLE. The Sakai Site file is used to convert a Site Number to a Site Title. Where there is no match in the Sakai Site file '*Site not found*' will be shown.

The Tool used is derived from the Event Type and the Event Ref. For some events such as those dealing with content, the exact tool used has to be determined from further information in the reference. This may indicate assignments, announcements, calendar, course outline, forums and the mail tool.

The reports that the Perl tool produces are based on events not the underlying tools and further work would be needed to be able to report by tools.

Content and content type

The content can be extracted from the Event Ref for many events including content.read. There is a rudimentary analysis of what the content contains (based on searching for terms like “Examination, Exercise, Timetable and Syllabus”. This is printed in the detailed event report but further work would be needed to look at which sites display what type of content or when students access this content.

The length of the content address string and its depth on terms of files have also been extracted to allow for future analysis here.

Created By, Created On, Modified By and Modified On

(Currently only printed in reports produced for Users)

Records who created and last modified this site and when.

Counts

Various counts are printed to be used in spreadsheet pivoting

The following are currently only printed in reports produced for Sites:

Search For

Could be used to investigate what terms users search for and then used in conjunction with the next site they go to to understand more about search issues.

Assessment Title

For Samigo Tests and Quizzes the tool can determine the title of the Assessment provided it has access to the following files:

sam_assessmentgrading_t

sam_publishedassessment_t

Evaluation Owner and Title

For the Swift Tool, the Perl tool can determine the Evaluation owner and title provided it has access to the following files:

eval_response

eval_adhoc_group







How to work with Gephi to visualise a network of Sites connected by Users


Top50 Network Chart
Purpose

Instructions on how to import data from a list of sites and their active users and visualise the network using Gephi.

Background

We have developed tools that can be run to extract a given list of sites and their active users from the VLE logging data. This gives us users that have actually visited the site as opposed to those who have been given access to the site.

Ingredients

  • Perl statistics tool developed for this project (this may need customising to fit your version of the VLE logging data).

  • Gephi application

  • Basic knowledge of speadsheets – OpenOffice was used here.

Assumptions

The user has downloaded the current version of the Gephi application and worked through at least the basic on-line tutorial.

Method

You will first prepare the data for import into Gephi. After importing it, you will be able to visually explore the network using Gephi

Individual steps

Prepare Data for Gephi:

  • Create a list of sites and active users (eg all_active_users_for_sites.csv):

    This can be done by inputting a list of sites to the perl stats.pl script and running this to produce a list of active users for each site. Eg active_users_for_.csv

Join these files together: cat act* > all_active_users_for_sites.csv

  • Use spreadsheet sort to sort in order of Site and remove all the heading lines (Site, User etc).

Check and correct the user ids at this point – I found some (eg thu21) which had been mangled by the spreadsheet into date format.

  • Run gephi perl script against this to produce list of nodes and edges for import into Gephi.

eg perl gephi.pl -f all_active_users_for_sites.csv -w

This determines the number of active users for each site and the number of shared users between any 2 sites. It will produce 'nodes.csv' and 'edges.csv' in the input file location.

Importing Data into Gephi:

  • Open up Gephi and select File > New Project

  • Click on the Data Laboratory Tab

  • Click on Import Spreadsheet and browse to the location of the nodes file that you created above.

  • Make sure 'As Table' is set to Nodes Table

  • Click on Next

  • Set the Total Users field to Integer – choose from the drop down list (you may need to go up though the list of options)

  • Click on Finish

  • Nodes table should now be imported

  • Repeat for Edges table

  • Go to Overview tab

  • You should now see a network !


Working with Gephi:

(Please follow the Gephi Getting Started tutorial to learn the details for these steps)

Setting the node size to Number of Users:

  • Select the Ranking tab at the top right of the screen

  • Choose Total Users as the Ranking Parameter

  • Click on the colour wheel button

  • Set the Color range

  • Click on Apply

  • The nodes will now have colour intensity linked to the number of users

  • Click on the diamond icon

  • Set the Min Size (eg 10) and the Max Size (eg 100). The range should already be set to the max and min no of users.

  • Click on Apply

  • The node sizes should now reflect the number of users.


Changing the Network Layout:

  • From the layout window on the bottom left of the page select Force Atlas

  • Set Repulsion strength 10000, tick Adjust by sizes.

  • Click on Run

  • Click Stop when it has stopped changing


Showing the Labels

  • Click on the black 'T' icon at the bottom of the Graph display

  • You should see the node labels

  • You can use the rightmost slider at the bottom of the the Graph display to change the text size.

To stop the Labels overwriting each other:

  • From the Layout tab – bottom of left hand side of page:

  • Choose Label Adjust

  • Click on Run

To highlight clusters of nodes:

  • Click on the Statistics tab on the left hand panel.

  • Click to run Modularity – this looks for clusters of nodes that are tightly associated with each other and more loosely linked to other nodes.

  • Click on the Partition tab on the right hand panel.

  • Choose Modularity Class and click on Run

  • You should now see tightly associated nodes have the same colours.

Hiding Nodes

  • Goto Filters menu

  • Click on Topology then drag 'Degree range' down to the Query section.

  • Use the slider to change the Degree range settings

  • Press filter

  • You should see that the less connected nodes have been hidden.


Interaction with Gephi

  • If you click on one of the nodes in Overview mode you can see the linked nodes highlighted.


Producing an output view

  • Click on the Preview tab.

  • Click on Refresh

  • If you don't see anything try clicking on the Reset Zoom button

  • Use the Default preset with the following settings:

    • Show labels ticked

    • Increase label font if needed

    • Undirected Curved set

    • Rescale weight ticked (this makes sure the connecting lines are not really thick)

    • Thickness – maybe set to 4 – this thickens up the lines reduced above

    • Proportional Label size unticked

    • Click on Refresh

  • You can use the Preview ratio slider to filter out less active nodes.

  • Export to SVG/PDF to save a copy


Using Gephi to Visualise our Sites and Users


Top50 Network Chart

Discussion

We took as our starting point the talk by Tony Hirst on visualisation http://blog.activitydata.org/2011/07/online-exchange-4-event-recording-21.html, in particular the slides produced using Gephi.

The problem with this approach was that all the examples using Gephi were using it to map a network of the same items – eg a network of Twitter users. We wanted to visualise a network of sites and their users – a bipartite network. I found a reference to another tool that had some capabilities for the display and analysis of bipartite networks: ORA (developed at Carnegie Mellon, part of the CASOS (sic) package: http://www.casos.cs.cmu.edu/projects/ora/ but the comment also said that 'the visualizations are simply not on par with those of Gephi'. A quick look at this did not make me feel that it would be useful for our needs.

Instead I decided to try using the number of common users to link individual sites and to produce a Gephi visualisation where the sites were the nodes with their size representing the number of active users of the site. The thickness of the connecting lines between the sites would be related to the number of common users they had.

To do this, I developed a Perl script to determine the number of shared users between each of a list of sites and produce the nodes and edges data files that could be imported into Gephi. The detailed steps taken to produce network visualisations using this data will be reported in a separate blog post.


College Chart

The visualisation above shows the Sites used within one of our colleges and links them by the number of shared users. The size of the circle increases with the number of users of that site. So we see the biggest site is the College main site with the next site being Computing with many shared users.

The Modularity Algorithm within Gephi was used to partition the sites into clusters where the sites are tightly associated with each other and more loosely linked to other sites. We can see that the green coloured sites could be categorised as 'college services' (IT, catering etc), the blue sites are mostly admissions related and the red sites include many subject related sites.

A similar process was applied to one of our larger departments (Illustration 1). This showed clustering and shared usage around their first year sites as well as heavy usage of their library/resource site. We will be able to use this visualisation as a resource to help us quickly identify important sites within this department when we upgrade their VLE.

This procedure can be followed for either a list of users or a list of sites to produce similar visualisations which we believe will be helpful in understanding more about how our sites are related and linked in the future.







Tuesday 13 September 2011

Analysis of when peak VLE activity has occurred

Purpose

To look at the total counts from events logging and determine when peak activity occurs.

Background

Caret has developed a tool that can be run to extract total event counts by month or week from the VLE logging data. This document will explore some of the trends we have found from this data.

Warnings

The event logging data we are working with was collected until February 2011 so does not include full data for the last academic year (2010-11).

The week numbers were added to the underlying events file by using a function within MySQL. This is not ideal in that it leaves part weeks at the beginning and end of year and these points needed some manual fixing within the spreadsheet. It may also have helped to number weeks starting from the beginning of the Academic Year.

On initial analysis, a very large proportion of event activity was found to be from an event (pres.begin and pres.end) for which logging was later switched off. This has been disregarded from our counts.

Method

A script was run that produced monthly event counts from the VLE activity data over the last 5 years – this ran until Feb 2011.

The count for each event type was compared with the total count to produce a percentage. These results were then filtered to produce charts according to the scale of usage.

Results

The top 2 events were found to be reading content (green - 44% of events logged) and user logins (dark green - 22% of events logged). The chart below shows monthly counts for these events.

Content read and logins Chart
Whilst the content read shows a regular peak in Oct/Nov with a secondary peak around May, the logins do not show such a clear pattern. Both show a clear gradual underlying increase over the years. More investigation would be needed to find out exactly what is causing the peaks in logins. This is especially the case since it bears little relation to the number of user logouts recorded (see later graph).


The following shows a weekly count for content read giving more granularity of when exactly the peaks are. The main peaks are repeated in late October with secondary peaks in Feb and mid-June.

Weekly Chart

This considers weekly counts over the last year. (Feb 2010-2011).

Term dates were:

12th January - 12th March 2010. Easter (5th April 2010) shows a slight dip.

20th April - 11th June 2010. A gradual increase with a peak at the end of May.

This coincides with the examination period. There is a long dip over the summer vacation period.

5th October-3rd December 2010 First peak is early October with the highest peak at the end of October. This peak at the end of October might not be expected and it would be interesting to find out more about what is going on here. There is a large dip over the Christmas period

18th January - 18th March 2011. There is a wide peak during January and early February.

Content Read Year Chart

Although content read does have a peak at the start of the Academic Year in October, there is a higher peak in November. It would be interesting to investigate if there is an explanation for this peak half way through the term – such as a large number of students starting assignments which rely on Camtools.

New Content Chart

This shows the next active events in terms of counts. Note the scale is roughly 10% of that for viewing content and logging in. However the overall peaks for site visits, search and user logout very much mirror those for viewing connect and logins showing consistent periods of peak activity corresponding to term dates.

As one might expect, lots of new content (blue line) has been added each October, at the start of the academic year, but there is a more recent and much higher peak in January of this year.

The following series of charts show analysis of the next highest peaks broken down by event type. Note the scale is several times smaller that that for reading content.

The calendar event is logged when users create (red line) or revise (yellow line) site calendars which may be used to display lecture times, assignment submission dates and so on. There is a clear October peak for both creating and revising calenders.

Calendars Chart


Swift Chart

This tool has been added in the last few years and is used to for student surveys. The green update peaks in February would correspond to lecturers preparing surveys at this point which are returned by students roughly a month later (blue line).



Wiki Chart
The counts for Wiki activity show a peak in new material just prior to the October start of year with high revision activity at the start of October.


Site Maintenance Chart


Site Maintenance demonstrates a clear October peak with a secondary peak in Jan/Feb. There is clearly a gradual increase in the number of sites being updated (red line) whereas the number of new sites (blue line) has leveled off.


Premissions Chart

Permissions changes reflect people being added or updated on sites. Again peak activity is in October.

Forums Chart

Forums again show a peak in October but their usage looks like it has tailed off.

Chat Chart

The chat tool also shows peaks in October and for the last year in November too.


Roster Chart

Again the roster tool, which is used to view other members of a site shows peak usage in October.

One area that I was particularly asked to look at was the Tests and Quizzes (T&Q) tool. The scale for this is again of a much smaller order of magnitude but again shows October peaks:

T&Q Chart

This final chart shows the overall totals for each month (blue) and the monthly totals excluding logins and content read (orange). This again shows a peak in October and other peaks corresponding to the terms. There is not much growth, in terms of event counts, over recent years in these other activities.


Overall Chart

Conclusions

Many activities show a clear usage peak corresponding with the start of the Academic year in October. However reading content, which accounts for over 44% of the counts, has a higher peak in November than in October. It would be interesting to learn exactly what is causing this November peak.

It also seems likely, given the different pattern in logouts, that the high spikes in login activity seen in the first chart are caused by spurious data and further investigation is needed here.





Friday 9 September 2011

Releasing anonymised data - Stage 1

We are now in a position to make an initial release of our VLE logging data. This blog details the process we went through to get to this stage and the decisions that we made along the way.

We were helped in this process by David Evans from Cambridge Computer Lab (http://www.cl.cam.ac.uk/~de239/) who is an expert in data privacy issues. We also considered the Privacy Policy for the site and consulted with our Legal Services team on releasing anonymous logging data under an Open Licence.

There are 3 files that we will be releasing:
Sakai Event File
– Individual events such as a site visit which occur within a Session
(held as a separate file for each Academic Year)

Sakai Session File
– Session details which include browser information and session times

Sakai Site File
– Details of the VLE Site being accessed

A tarball file can be obtained from here - however this is a 4GB file that will expand to over 12GB when uncompressed.


Our first step was to provide David Evans with the database schema for these files so he could consider, based on discussions with us, which fields might contain sensitive data. We then discussed the implications of making these fields public and what information they might expose. This was of course a balancing act between releasing data from which others could draw value and not giving away details of what a known individual might have used the VLE system for.

We decided on a first data release which would use a cautious approach to data privacy. Should other institutions find this data of interest, we can work with them to reveal more information in the area of interest to them in a manner that does not compromise individuals.

This cautious approach meant hiding any data that would identify an individual user, site names and anything that might link a session back to a particular user. We settled on a hashing algorithm to use to obscure any such items of data yielding a string that can be determined uniquely from the value; we also used a salt to prevent inversion of the hash through exhaustive search of short inputs.

At this stage, we also looked at some sample data to reinforce our decisions.
The decision on what to hash was straightforward in many cases such as concealing any field with Site, User Name, URL or Content in it. Some things were less clear cut. For instance, the skin around a site could be used to identify a Department. The Session Ids that we looked at appeared to be numeric and we decided there was little risk in leaving this in its raw state. However, later testing revealed that, in some instances and points of time, this has included a user identifier so we agreed to hash this. It is worth remembering that the hashing algorithm is consistent so even though the value of the Session Id has been changed, it can still be used to link the Event and Session tables.

The Session Server holds both the Host Name and software module Id. We decided to hash the Host Name, in case this might reveal something about the network's internal structure, but leave the numeric part (software module id) alone as it reveals nothing other than which particular instance of software processed that session. We discovered that the format of this field had changed over time so we needed a mildly complex function to detect and extract from both formats of the Session Server.

The Session User Agent may provide the Browser Name and Version and the Operating System and Version of the computer used for the session. However this is a free-form field which may be changed by the user. There was a danger that this could identify a given user. A visual inspection showed at least one College Name, some company names and some school names within this field which could present a risk. Ideally we would extract and expose data such as the Browser Name but as this is a free-form field this in non-trivial. We therefore took the decision to hash this field.

As a final sanity check, we revisited some sample data from each of the tables once they had been hashed to satisfy ourselves that there was no raw data left that might possibly contain a user identifier.