Monday 6 June 2011

The story so far...

Sorry about the quietness here over the past couple of weeks: you must be wondering what we were up to.
  • We've been extracting the data from Sakai, which was more difficult than it sounds. Sakai stores its events in a massive SQL table, one after the other, so that it's tens of millions of rows long before very long at all. Merging tables, fixing corrupt old data, that kind of thing. Anyway, all done now.
  • We're investigating tools to help us analyse the data. Pentaho looks very promising.
But all this is just detail (albeit time-consuming, irritating detail) around the core issue of what data have we got and what can we do with it. To that end we've had a few internal workshops, sent out a few emails, bent some ears, and so on.

Though none of this should be treated as doctrine, and we're still definitely open to ideas, we thought it was time to do some initial data investigations, now that we have it. The key structuring concept for me is:

Who will be interested in our data, and what would they like to know?
An easy to imagine, but not entirely encompassing imaginary situations are these.
  • If someone else were running the VLE, what would we want to know about it?
  • If we could get secret, spy-style access to our deadliest rival institution (identity an exercise for the reader) what would we want to find out to make our VLE more awe-inspiring than theirs?
  • If a charismatic leader were to rouse academics or students to come to our door bearing pitchforks and burning torches, demanding VLE data, what would be the rhetoric -- what would they be demanding?
If we bear these (and similar) questions in mind when we are steering, we shouldn't go far wrong. Let's not get caught producing a series of odd, disconnected charts, they need to inspire thought and change. We need charts, data and stats that connect with the machinery of change.

In terms of the data, what we have is:
who does what
So to do a meaningful analysis we have two axes: Who and What. While we'll give away as much raw data as is possible, we need to provide supporting mappings. Who is dps10? What is site 85? We also need to make sure, when we anonymise that we don't lose those aspects that enable external people to ask questions.

We're working out how we should take a first stab at Who and What, and are looking at finding sources. I imagine that when we've done this first round of analysis we'll discover the world doesn't divide up how we imagine. That seems to be the near universal experience of user experience analysis, certainly we learnt in our JISC Academic Networking project that the world of networking isn't divided up in quite the way we imagined. As we discover this from the activity data, we will iterate around, trying again and again.

It might even be worth applying Bayesian Clustering or Entropy-Based Tree Building to see how a machine would cluster behaviour. All very exciting (to me, anyway!). See pages 15-21 of this powerpoint by Allan Neymark at SJSU to see all this simply explained in terms of Simpsons characters.

Exciting times. At the same time, extremely tedious for the guys doing the database extraction and normalisation. Personally, I seem to have escaped that bit for this project. Phew!

No comments:

Post a Comment