Monday 13 June 2011

More about our data

A lot of our time so far has gone into marshalling our data (as explained in our Data blog post), and getting it into a workable form, properly indexed and so forth. This means that it should now be comparatively quick to process the data into chart form - no setting it going and leaving it overnight now!

The Sakai event table, which is most of our data, has given us 240million rows of data for the last 5 years (a little over, actually). Only 70m of those turned out to be useful, as the rest were mostly things like the 'presence' event. We turned that event off in 2007 because it was causing unacceptable server load. Basically, our VLE is made up of a series of sites, and the presence event was a row written to the database every 5 seconds for every logged in user, saying which site they were currently looking at, so that an up-to-date list of who was in the site was displayed. (Facebook's indication of who's around in Chat at the moment probably does something similar.) So you can guess that this presence event generated an awful lot of data until we turned it off.

We also have 500m rows of Apache web data logs, telling us who visited which pages using which web browser and operating system. This is currently not something we're looking at so much (beyond a quick analysis of which web browsers we need to do our most thorough testing against), but it will be most useful when we're looking at which of our help resources have been most visited.

For our sets of data, we've been breaking it up by academic year (well, academic financial year - the year end is 31st July), and by week, so that we can see the fluctuation in usage. (We're starting to break it up by day as well, but this takes a long time to index.)

No comments:

Post a Comment