Data-sharing wisdom
(2014 update: Hubway's new data release is better anonymized, yay! This essay refers to the 2012 data release.)
I love Hubway and data, but was it really a good idea to release everyone's trip data? This is past, somewhat-anonymized data, for a contest to visualize how people are using the bike-sharing system. I can history-stalk everyone in Southborough who has a Hubway membership now :-(
[Quick note: zip codes are US-centric and "%" lines are shell-script computations; the rest of this post should be accessible.]
% cat trips.csv | cut -d, -f10,11,12 | sort | uniq -c | grep 01772
5 01772,1956,Male
42 01772,1957,Male
52 01772,1971,Male
58 01772,1974,Male
Each line is the number of trips made by a (zip-code, birthdate, gender) triple, filtered to only 01772 (Southborough) people.
Then, say, % cat trips.csv | grep ,01772,1957,Male and look up the begin and end Hubway station numbers to find out where they were using Hubway bikes at what times.
I don't know who it is, but I daresay that triple (among Hubway registered users) uniquely identifies a person! Sorry, 55-ish bicyclist dude! I support your bicycling & only want to meet you if we run into each other in a non-creepy way... Also, you are a pretty much completely random one of the one or two thousand people who can with high probability be identified this way.
People whose permanent address is closer to the city, luckily, have more other people with the same zipcode, age, and gender. That's not perfect but it helps a lot.
How could this have been mitigated? Age, location, and gender are interesting to analyze. Analysis wouldn't be hurt much by, say, rounding age to multiples of 5 years (thus removing just over 2 bits of identifying information from the 'age' field). I am still kind of dubious. Any triples that only refer to one (or two? or more?) persons could have been removed from the data set or anonymized further. I'm not sure how much information that would lose (I can't know from this data, because it wisely doesn't say how many people correspond to a given triple).
Blargh.
(If you don't know already — do you know a good introductory blog post that I could link to instead of amateurly describing it myself? — there are
- people whose physical safety is in danger from ways they can be tracked
- other sorts of risk
- people who like being tracked
- (the majority, including me) people who don't realize all the ways that massive data mining can be used by private individuals against them, by accident and/or intention.
I'd like to think that all these people could use a bike-sharing system without fear. Also, I look forward to the charts and graphs, but with dread at the social cost that may have been incurred to produce them. Maybe everyone will be unhurt. I hope so!)