We are now in a position to make an initial release of our VLE logging data. This blog details the process we went through to get to this stage and the decisions that we made along the way.
We were helped in this process by David Evans from Cambridge Computer Lab (http://www.cl.cam.ac.uk/~de239/) who is an expert in data privacy issues. We also considered the Privacy Policy for the site and consulted with our Legal Services team on releasing anonymous logging data under an Open Licence.
There are 3 files that we will be releasing:
Sakai Event File
– Individual events such as a site visit which occur within a Session
Sakai Session File
– Session details which include browser information and session times
Sakai Site File
– Details of the VLE Site being accessed
A tarball file can be obtained from here - however this is a 4GB file that will expand to over 12GB when uncompressed.
Our first step was to provide David Evans with the database schema for these files so he could consider, based on discussions with us, which fields might contain sensitive data. We then discussed the implications of making these fields public and what information they might expose. This was of course a balancing act between releasing data from which others could draw value and not giving away details of what a known individual might have used the VLE system for.
We decided on a first data release which would use a cautious approach to data privacy. Should other institutions find this data of interest, we can work with them to reveal more information in the area of interest to them in a manner that does not compromise individuals.
This cautious approach meant hiding any data that would identify an individual user, site names and anything that might link a session back to a particular user. We settled on a hashing algorithm to use to obscure any such items of data yielding a string that can be determined uniquely from the value; we also used a salt to prevent inversion of the hash through exhaustive search of short inputs.
At this stage, we also looked at some sample data to reinforce our decisions.
The decision on what to hash was straightforward in many cases such as concealing any field with Site, User Name, URL or Content in it. Some things were less clear cut. For instance, the skin around a site could be used to identify a Department. The Session Ids that we looked at appeared to be numeric and we decided there was little risk in leaving this in its raw state. However, later testing revealed that, in some instances and points of time, this has included a user identifier so we agreed to hash this. It is worth remembering that the hashing algorithm is consistent so even though the value of the Session Id has been changed, it can still be used to link the Event and Session tables.
The Session Server holds both the Host Name and software module Id. We decided to hash the Host Name, in case this might reveal something about the network's internal structure, but leave the numeric part (software module id) alone as it reveals nothing other than which particular instance of software processed that session. We discovered that the format of this field had changed over time so we needed a mildly complex function to detect and extract from both formats of the Session Server.
The Session User Agent may provide the Browser Name and Version and the Operating System and Version of the computer used for the session. However this is a free-form field which may be changed by the user. There was a danger that this could identify a given user. A visual inspection showed at least one College Name, some company names and some school names within this field which could present a risk. Ideally we would extract and expose data such as the Browser Name but as this is a free-form field this in non-trivial. We therefore took the decision to hash this field.
As a final sanity check, we revisited some sample data from each of the tables once they had been hashed to satisfy ourselves that there was no raw data left that might possibly contain a user identifier.
We were helped in this process by David Evans from Cambridge Computer Lab (http://www.cl.cam.ac.uk/~de239/) who is an expert in data privacy issues. We also considered the Privacy Policy for the site and consulted with our Legal Services team on releasing anonymous logging data under an Open Licence.
There are 3 files that we will be releasing:
Sakai Event File
– Individual events such as a site visit which occur within a Session
(held as a separate file for each Academic Year)
Sakai Session File
– Session details which include browser information and session times
Sakai Site File
– Details of the VLE Site being accessed
A tarball file can be obtained from here - however this is a 4GB file that will expand to over 12GB when uncompressed.
Our first step was to provide David Evans with the database schema for these files so he could consider, based on discussions with us, which fields might contain sensitive data. We then discussed the implications of making these fields public and what information they might expose. This was of course a balancing act between releasing data from which others could draw value and not giving away details of what a known individual might have used the VLE system for.
We decided on a first data release which would use a cautious approach to data privacy. Should other institutions find this data of interest, we can work with them to reveal more information in the area of interest to them in a manner that does not compromise individuals.
This cautious approach meant hiding any data that would identify an individual user, site names and anything that might link a session back to a particular user. We settled on a hashing algorithm to use to obscure any such items of data yielding a string that can be determined uniquely from the value; we also used a salt to prevent inversion of the hash through exhaustive search of short inputs.
At this stage, we also looked at some sample data to reinforce our decisions.
The decision on what to hash was straightforward in many cases such as concealing any field with Site, User Name, URL or Content in it. Some things were less clear cut. For instance, the skin around a site could be used to identify a Department. The Session Ids that we looked at appeared to be numeric and we decided there was little risk in leaving this in its raw state. However, later testing revealed that, in some instances and points of time, this has included a user identifier so we agreed to hash this. It is worth remembering that the hashing algorithm is consistent so even though the value of the Session Id has been changed, it can still be used to link the Event and Session tables.
The Session Server holds both the Host Name and software module Id. We decided to hash the Host Name, in case this might reveal something about the network's internal structure, but leave the numeric part (software module id) alone as it reveals nothing other than which particular instance of software processed that session. We discovered that the format of this field had changed over time so we needed a mildly complex function to detect and extract from both formats of the Session Server.
The Session User Agent may provide the Browser Name and Version and the Operating System and Version of the computer used for the session. However this is a free-form field which may be changed by the user. There was a danger that this could identify a given user. A visual inspection showed at least one College Name, some company names and some school names within this field which could present a risk. Ideally we would extract and expose data such as the Browser Name but as this is a free-form field this in non-trivial. We therefore took the decision to hash this field.
As a final sanity check, we revisited some sample data from each of the tables once they had been hashed to satisfy ourselves that there was no raw data left that might possibly contain a user identifier.
I am William..
ReplyDeleteI just browsing through some blogs and came across yours!
Excellent blog, good to see someone actually uses for quality posts.
Your site kept me on for a few minutes unlike the rest :)
Keep up the good work!Thanks for sharing a important information on datastage