UPDATE 2012-02-16: raver1975 released a SQL database w/35M Google Profiles as .torrent on The Pirate Bay.
UPDATE 2011-06-10: Central question in the Google discussion is whether mass-aggregation of profile data by unknown third parties is considered acceptable. We should neither exaggerate NOR DENY possibilities that public profile data offers to adversaries. We should THINK about them. How will YOUR LinkedIn + Facebook + Twitter + Google Profile + (…) make you look when I combine them and subject you to longitudinal study? I seriously doubt that such activities will turn out all good and harmless.
To quote from Tali Sharot’s piece on The Optimism Bias in Time Magazine June 2011 : “The question then is, How can we remain hopeful — benefiting from the fruits of [techno-]optimism — while at the same time guarding ourselves from its pitfalls?” Like him, I too believe knowledge is key in that.
====== START OF ORIGINAL BLOGPOST FROM 2011-05-24 ======
This is a follow-up to my previous blogpost on this topic.
In February 2011 it showed trivial to create a database containing ALL ~35.000.000 Google Profiles without Google throttling, blocking, CAPTCHAing or otherwise make more difficult mass-downloading attempts. It took only 1 month to retrieve the data, convert it to SQL using spidermonkey and some custom Javascript code, and import it into a database. The database contains Twitter conversations (also stored in the OZ_initData variable) , person names, aliases/nicknames, multiple past educations (institute, study, start/end date), multiple past work experiences (employer, function, start/end date), links to Picasa photoalbums, …. — and in ~15.000.000 cases, also the username and therefore @gmail.com address. In summary: 1 month + 1 connection = 1 database containing 35.000.000 Google Profiles.
My activities are directed at feeding debate about privacy — not to create distrust but to achieve realistic trust — and on the meaning of “informed consent”. Which, when signing up for online services like Google Profile, amounts to checking a box. How can a user possibly be considered to be “informed” when they’re not made aware 1) about the fact that it does not seem to bother Google that profiles can be mass-downloaded (Dutch) and 2) about misuse value –or hopefully the lack of it– of their social data to criminals and certain types of marketeers? Does this enable mass spear phishing attacks and other types of social engineering, or is that risk negligible, e.g. because criminals use other methods of attack and/or have other, better sources of personal data? Absence of ANY protection against mass-downloading is the status quo at Google Profile. Strictly speaking I did not even violate Google policy in retrieving the profiles, because http://www.google.com/robots.txt explicitly ALLOWS indexing of Google Profiles and my code is part of a personal experimental search engine project. At the time of this writing, the robots.txt file contains:
Allow: /profiles Allow: /s2/profiles Allow: /s2/photos Allow: /s2/static
I’m curious about whether there are any implications to the fact that it is completely trivial for a single individual to do this — possibly there aren’t. That’s something worth knowing too. I’m curious whether Google will apply some measures to protect against mass downloading of profile data, or that this is a non-issue for them too. In my opinion the misuse value of personal data on social networks ought to be elicited before publishing it under a false perception of ‘informed’ consent.
My activities were performed as part of my research on anonymity/privacy at the University of Amsterdam. I’m writing a research paper about the above. Repeating from my previous post: this blog runs at Google Blogger. I sincerely hope my account “mrkoot” and blog.cyberwar.nl will not be blocked or banned – I did not publish the database and did not violate any Google policy.