The blog posts contain many small items. We could list them, but they would distract from the big item, which was the dataset here (approx 4GB)
One of the fundamental difficulties we encountered was that statisticians were difficult to recruit. We understood how to analyse data to discover if a particular hypothesis was supported by the data, we had anticipated being able to detect unexpected usage patterns and to be able to determine what had driven these patterns. We managed to establish that there is reason to suppose statistical techniques exist that can be applied, but we did not find suitably experienced statisticians with an interest in our data within the timescale of the project. BTW the statistical field we identified was credit card fraud detection in which we are looking for the inverse of their interest - they identify patterns in spending behaviour in order to spot outliers whereas we are interested in analysing the dominant pattern. I think it would be good to encourage data mining centres to explore this further or to fund an investigation of this issue with a longer time frame.
What can other institutions do to benefit from your work
We set out to focus on analysing data, but our most useful lessons came from our success in releasing open data:
- Put good privacy terms on every site that collects personal data and ensure possible publication of anonymised data is anticipate in those terms
- Take a well prepared anonymisation case to legal services
- Choose the licence first and determine what data can be released under that licence
- Don’t forget your dataset may have IPR in its format
On analysing data:
- Looking for patterns you suspect and confirming/disproving is much easier than asking what information the data contains
- The visualisation technique is the servant of the story the data tells. In other words, visualisation tools don’t reveal stories, they tell stories
- Visualisations tend to be dominated by data collection artefacts for the first very many iterations of visualising (e.g. most of New York is in New Jersey, so NY population can look small)
- Complex data is hard to analyse, allow time and take a phased approach
I would say we had a smooth path to releasing data compared to accounts from other projects. I attribute this to preparation. That is, we went to the legal office with answers - asking for confirmation that we had done enough - not questions asking what we should do. We were also lucky that the privacy statement we put in place 5 years ago was adequate for this purpose. Getting privacy statements right is important and we would have been better placed if we had collected acknowledgement/acceptance of terms.
More details on how you've addressed anonymisation
The principle technique for anonymisation was hashing of sensitive fields using a ‘salted’ SHA-1 algorithm. SHA-1 turns a data item such as ‘CRSID’ into a string such as ‘ef479946c02076c9c25b34a38a80cf22d0ecb9cb‘. On its own, this technique would be vulnerable to a ‘dictionary attack’ where someone with a list of CRSIDs could run them through the published SHA-1 algorithm and match the resulting strings to our dataset. To prevent this we also fed a secret key into the algorithm and destroyed of the key afterwards. This ensured that nobody could recreate the hashed strings and decode the data.
What algorithms or processing techniques have you used
- Salted SHA-1
- Perl code for ‘mixing in’ reference data to make analysis easier (e.g. a students identity may not be revealed, but their year of study can be)
- Pivot Tables, Graphing tools and Gephi for analysis