Monday, 13 June 2011

More about our data

A lot of our time so far has gone into marshalling our data (as explained in our Data blog post), and getting it into a workable form, properly indexed and so forth. This means that it should now be comparatively quick to process the data into chart form - no setting it going and leaving it overnight now!

The Sakai event table, which is most of our data, has given us 240million rows of data for the last 5 years (a little over, actually). Only 70m of those turned out to be useful, as the rest were mostly things like the 'presence' event. We turned that event off in 2007 because it was causing unacceptable server load. Basically, our VLE is made up of a series of sites, and the presence event was a row written to the database every 5 seconds for every logged in user, saying which site they were currently looking at, so that an up-to-date list of who was in the site was displayed. (Facebook's indication of who's around in Chat at the moment probably does something similar.) So you can guess that this presence event generated an awful lot of data until we turned it off.

We also have 500m rows of Apache web data logs, telling us who visited which pages using which web browser and operating system. This is currently not something we're looking at so much (beyond a quick analysis of which web browsers we need to do our most thorough testing against), but it will be most useful when we're looking at which of our help resources have been most visited.

For our sets of data, we've been breaking it up by academic year (well, academic financial year - the year end is 31st July), and by week, so that we can see the fluctuation in usage. (We're starting to break it up by day as well, but this takes a long time to index.)

Monday, 6 June 2011

The story so far...

Sorry about the quietness here over the past couple of weeks: you must be wondering what we were up to.
  • We've been extracting the data from Sakai, which was more difficult than it sounds. Sakai stores its events in a massive SQL table, one after the other, so that it's tens of millions of rows long before very long at all. Merging tables, fixing corrupt old data, that kind of thing. Anyway, all done now.
  • We're investigating tools to help us analyse the data. Pentaho looks very promising.
But all this is just detail (albeit time-consuming, irritating detail) around the core issue of what data have we got and what can we do with it. To that end we've had a few internal workshops, sent out a few emails, bent some ears, and so on.

Though none of this should be treated as doctrine, and we're still definitely open to ideas, we thought it was time to do some initial data investigations, now that we have it. The key structuring concept for me is:

Who will be interested in our data, and what would they like to know?
An easy to imagine, but not entirely encompassing imaginary situations are these.
  • If someone else were running the VLE, what would we want to know about it?
  • If we could get secret, spy-style access to our deadliest rival institution (identity an exercise for the reader) what would we want to find out to make our VLE more awe-inspiring than theirs?
  • If a charismatic leader were to rouse academics or students to come to our door bearing pitchforks and burning torches, demanding VLE data, what would be the rhetoric -- what would they be demanding?
If we bear these (and similar) questions in mind when we are steering, we shouldn't go far wrong. Let's not get caught producing a series of odd, disconnected charts, they need to inspire thought and change. We need charts, data and stats that connect with the machinery of change.

In terms of the data, what we have is:
who does what
So to do a meaningful analysis we have two axes: Who and What. While we'll give away as much raw data as is possible, we need to provide supporting mappings. Who is dps10? What is site 85? We also need to make sure, when we anonymise that we don't lose those aspects that enable external people to ask questions.

We're working out how we should take a first stab at Who and What, and are looking at finding sources. I imagine that when we've done this first round of analysis we'll discover the world doesn't divide up how we imagine. That seems to be the near universal experience of user experience analysis, certainly we learnt in our JISC Academic Networking project that the world of networking isn't divided up in quite the way we imagined. As we discover this from the activity data, we will iterate around, trying again and again.

It might even be worth applying Bayesian Clustering or Entropy-Based Tree Building to see how a machine would cluster behaviour. All very exciting (to me, anyway!). See pages 15-21 of this powerpoint by Allan Neymark at SJSU to see all this simply explained in terms of Simpsons characters.

Exciting times. At the same time, extremely tedious for the guys doing the database extraction and normalisation. Personally, I seem to have escaped that bit for this project. Phew!