Thursday 21 April 2011

Data visualisation

We're looking at a couple of tools here: BIRT and Pentalho, both of which have free business visualisation software packages. We're hoping that they can offer us more than you can get from Excel pivot tables, and be easier to set up than a bespoke solution involving some PHP and graphing software.

This isn't as straightforward as you might imagine. Raad's been working on setting up a Pentalho instance for most of the last week, and hasn't yet managed to get a significant improvement on what Excel provides, though it's taken considerable effort to get this far. Pentalho requires various modules to be installed, but its documentation is rather incomplete, especially the documentation for creating aggregate tables. Aggregate tables are essential when dealing with large volumes of data - we have over 10m rows of Sakai event data, so without aggregate tables, every time we try to look at a large section of the dataset, we run out of resources. So thus far, our suggestion would be that if you want business information software, you may be better off paying for a commercial product.

Saturday 2 April 2011

The Data

We’ve just started work on our JISC project on Exposing VLE Activity Data. First, we’ve had to get our data (first, catch your rabbit..), from when we started using CamTools (our current institutional VLE) to December 31 2010. This involved retrieving archived data, which didn’t go as smoothly as we’d hoped. We had to do some restoration of corrupted data, and we’re missing about two weeks of data as well. This just illustrates the problems of dealing with data that’s collected but not looked at very often.


The kinds of data we’ve collected are all the events from the Sakai event table. Sakai is the underlying software that powers our VLE (Virtual Learning Environment). Its event tables contains details of software ‘events’ - something that’s happened. Typical events are things like ‘content.read’ (someone’s read some content), ‘content.update’ (someone’s updated some content) or ‘search’ (this is probably easy to work out!). We’ve also collected data about who’s visited which web pages inside Sakai, when they did it, and which web browser they were using at the time - more typical access log data for web pages..

Now that we’ve got all this data from our logs, we need to make sure it’s in a format where we can process it, to find the answers to some of our questions about how the VLE is used. However, we may also want to collect other, ‘softer’ data, such as what each area of the VLE is used for (teaching, research, admin, or something else), and why it’s used. This will require more human input, whether by examining individual sub-sites of the VLE, questionnaires or interviews.
General Observations on what the limitations of the data are

As mentioned above, we mostly can’t determine what a site is used for, other than by human inspection. The exception to this is sites designed to support lecture or degree courses, for which we maintain a list. So while we may be able to track usage patterns for an individual site, we can’t easily do so for a set of related sites, unless we define the relation manually.

We’ve observed sites being used for: teaching, research, administration, social activities and testing (using sites as a sandbox to try things out before updating a site that’s already being used by students or researchers). More specifically, we’ve seen sites used for teaching lecture courses, whole degree programs, small-group teaching, language learning. We’ve seen sites used to organise research projects, from PhD theses up to large international collaborations. CamTools has been used to administer part of the college applications process, and for university and college societies, and to organise conferences. But unless a human looks at a site, we’ve got no way of deducing this from the data (we don’t capture extensive metadata on site creation).

So, how do we categorise a site?
Currently, sites which are associated with a specific lecture course or a degree course are tagged with metadata on creation. This is a relatively new procedure, so only sites active from October 2010 are tagged. However, signalling that a site is no longer in active use for teaching (because a new site has been created for the new academic year, for example), is harder. The case I just mentioned can be done by editing the metadata, because we will have information that there is a new site; but if a lecture course has been discontinued, we can’t currently update that.
For other sites, we have to rely on manual inspection. What is the site called? How many people are in it? What documents are stored there? Which tools does it use? From this information, we can usually deduce what the site is used for.

Does a site’s purpose change?
There are two aspects to this question: does a site, for example a small-group teaching site, turn into something else - perhaps a research site, or a site for that teacher’s lecture course? Or, does someone set up a site expecting it to be used in one way (putting in certain tools), and it turns out to be used in another?
The former is difficult to determine. All we can do is examine a site and say that it was being used in a particular way at a particular time, unless we can find out particular ‘signatures’ which denote the type of a site (at the moment, we don’t know whether sites would have distinctive signatures). The latter may be more amenable to analysis, in two ways. One, we can look at tool usage: tool X was added in 2008, but was never used, tool Y was added in May 2010, and has some hits. Two, we can conduct interviews with site owners, and ask them what they thought they were going to do, and what they actually found. (This does have the problem that people’s memories may be unreliable, but we can check what they say against the data we hold about their site.)

These kinds of approaches allow us to augment the automatically collected data from the past four years of running the VLE.