Tuesday, May 24, 2011

1 Database Containing 35.000.000 Google Profiles. Implications?

UPDATE 2012-02-16: raver1975 released a SQL database w/35M Google Profiles as .torrent on The Pirate Bay. 

UPDATE 2011-06-10: Central question in the Google discussion is whether mass-aggregation of profile data by unknown third parties is considered acceptable. We should neither exaggerate NOR DENY possibilities that public profile data offers to adversaries. We should THINK about them. How will YOUR LinkedIn + Facebook + Twitter + Google Profile + (...) make you look when I combine them and subject you to longitudinal study? I seriously doubt that such activities will turn out all good and harmless. 

To quote from Tali Sharot's piece on The Optimism Bias in Time Magazine June 2011 : "The question then is, How can we remain hopeful — benefiting from the fruits of [techno-]optimism — while at the same time guarding ourselves from its pitfalls?" Like him, I too believe knowledge is key in that.

====== START OF ORIGINAL BLOGPOST FROM 2011-05-24 ======
This is a follow-up to my previous blogpost on this topic.

In February 2011 it showed trivial to create a database containing ALL ~35.000.000 Google Profiles without Google throttling, blocking, CAPTCHAing or otherwise make more difficult mass-downloading attempts. It took only 1 month to retrieve the data, convert it to SQL using spidermonkey and some custom Javascript code, and import it into a database. The database contains Twitter conversations (also stored in the OZ_initData variable) , person names, aliases/nicknames, multiple past educations (institute, study, start/end date), multiple past work experiences (employer, function, start/end date), links to Picasa photoalbums, .... -- and in ~15.000.000 cases, also the username and therefore @gmail.com address. In summary: 1 month + 1 connection = 1 database containing 35.000.000 Google Profiles.

My activities are directed at inciting, or poking up, debate about privacy -- NOT to create DISTRUST but to achieve REALISTIC trust -- and the meaning of "informed consent". Which, when signing up for online services like Google Profile, amounts to checking a box. How can a user possibly be considered to be "informed" when they're not made aware 1) about the fact that it does not seem to bother Google that profiles can be mass-downloaded (Dutch) and 2) about misuse value -or hopefully the lack of it- of their social data to criminals and certain types of marketeers? Does this enable mass spear phishing attacks and other types of social engineering, or is that risk negligible, e.g. because criminals use other methods of attack and/or have other, better sources of personal data? Absence of ANY protection against mass-downloading is the status quo at Google Profile. Strictly speaking I did not even violate Google policy in retrieving the profiles, because http://www.google.com/robots.txt explicitly ALLOWS indexing of Google Profiles and my code is part of a personal experimental search engine project. At the time of this writing, the robots.txt file contains:

Allow: /profiles
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static

I'm curious about whether there are any implications to the fact that it is completely trivial for a single individual to do this -- possibly there aren't. That's something worth knowing too. I'm curious whether Google will apply some measures to protect against mass downloading of profile data, or that this is a non-issue for them too. In my opinion the misuse value of personal data on social networks ought to be elicited before publishing it under a false perception of 'informed' consent.

My activities were performed as part of my research on anonymity/privacy at the University of Amsterdam. I'm writing a research paper about the above. Repeating from my previous post: this blog runs at Google Blogger. I sincerely hope my account "mrkoot" and blog.cyberwar.nl will not be blocked or banned - I did NOT publish the database and did NOT violate any Google policy.

Contact me by e-mail(*): kootNO_SPAM_PLEASE@uva.nl  (remove "NO_SPAM_PLEASE")
Contact me on Twitter: http://twitter.com/mrkoot.

(*)I prefer insults to be sent to mrkoot@gmail.com, as gmail has superior filters.


  1. very nice article


  2. Well, they ARE openly listed in a sitemap.xml and they ARE _public_ profiles. As much as Google should make a better effort of informing people that whatever they enter on the web will be visible to everyone, it IS the profile owners' fault.

  3. Nicely done. (The irony I use a google account to comment on tis is not lost on me).

  4. It's just like torrents. With a decent internet connection and enough time you can damn near do anything.

    This is also tactic of intelligence gathering. Gather small bits of information from many sources putting all the information together (Basically a Mash-up) to form intelligent data.

  5. Interesting and topical in the UK as the privacy laws that enable individuals to stop the Press and media outlets from publishing information about that individual have been challenged by Twitter. something like 75,000 users could (theoretically) be changed and imprisoned. Because they shared information deemed to be private.

    If that would have been in the pub no one would have cared as the scope is small. It is on a website available nearly world wide.

    I think my generation's concept of privacy is going to be very different from
    the next generation's concepts! I'm 33 so I've no idea what those 20 years older than me must be thinking.

  6. And so? You retrieved public info, via a public account... your point is what?

  7. This is not a spectacular new Information. Xing also has a Sitemap of all public profiles. ....

  8. By the Way: People Search Engines still use this Information: http://en.yourtraces.com

  9. @Blog-Manager: you are right in stating that this is not "new" information. The "surprisal" in this to most people is the scale vs required resources.

    @daman: IMHO, too little research has been done on possibly undesired uses of public profile information. In [1], subtitled "How password recovery threatens banking security", Ben Smyth points out that credential recovery mechanisms for (bank) customers who forgot their passwords "are insecure due to their reliance on publicly available information". Too little (proper!) statistics are available about what various people reveal about themselves. With a huge database, a criminal is not limited to 1000 queries a day (like he would be when using Google Search). He can also try to link this public data to other data, attempt (but that needs research) to use it to de-anonymize data, et cetera. My claim is not relevant to criminals targeting a pre-chosen, specific individual. My claim *is*, however, relevant to criminals who are looking for ANY target. Knowing that bank X has said insecure credential recovery that only requires knowing Y and Z, the criminal wants to enumate as many persons as possible who are 1) customer of bank X AND 2) reveal Y and Z on their profile. There are probably MANY ways to attempt to figure that out -- let's do research on such scenario's.

    In general, profile holders are justly considered to be rational actors capable of making informed judgements about what to publish about themselves and what not to publish. But "informed", IMHO, also requires knowledge about possible misuses, and that knowledge is simply not out there yet - at least not publicly and not widely. Perhaps is will turn out that there is not a single convincing example of misuse. But I seriously doubt that and think it needs research (and many are on my side with that, but that appeal is no valid argument).

    [1] Smyth, B. (2010) "Forgotten your responsibilities? How password recovery threatens banking security." Technical Report CSR-10-13, University of Birmingham, UK.

  10. @mrkoot

    You are spot on with your point about the difficulty of informed consent. How can individuals possibly grasp what could be done with their data?

    I just wrote an article for O'Reilly's Radar blog about unexpected privacy implications on smartphones ( http://oreil.ly/mRBYHg ) and I think that there is a huge gap between what people expect their data to be used for, and what it ends up being used for.

    The more you push these boundaries, the more likely there will be an informed debate about privacy.

    Looking forward to you mashing your database together with something like fluidinfo ;-)

    All the best


  11. Good work Matthijs,

    I expect at least all iGoogle users to benefit from your work. Hope Google acts appropriately.

    Thanx again and success with your Phd.

    Wim Vriend

  12. Man that´s a really good and scaring article. Please answer me,how can I protect myself from that hackers?
    Congrats for the job.

  13. Thanks for posting this article. I would like to add something about a marketing technique known as Social Graphing.

    i.e The collection of available personal information that is collated for use in an automated process that is designed to categorise people into types, groups, common units, etc

    It has been described as "the global mapping of everybody and how they're related".

    mrkoot has demonstrated that the data available for analysis is only limited by the technology that we use to search and collect it; and we can continue to refine these until we find what we want.

    Then readily available tools can assist us in the search for meaning in the data. This means that an analyst with reasonable skills can quickly sort a long list of possible relationships into a much shorter list of probable ones.

    With the development of these techniques, privacy erosion seems to be creeping up slowly as a sort of by-product. i.e. The gradual improvement in availability and linkages between datasets is continually creating more possibilities for aggregating marketing information. This can build also be used to build complex profiles of an individual and their activities and the primary limit to this is the imagination and talent of the analyst.

    Social Graphing also uses the concept of synergy. i.e. the whole is more than the sum of its parts; it makes the development of new information relationships possible and these information relationships are formed without the subjects’ permission or knowledge. We may have contributed information about ourselves freely to a number of separate databases but it is unlikely that we will have anticipated the effect of merging the data, especially where that merger might reveal something more about us.

    Put simply, when there are more connections between more datasets we can build a more comprehensive picture, either in aggregate for a population, or in detail for an individual.

    Marketeers’ should be aware that in creating marketing connections they could reduce potential customer privacy as a by-product of their work. In turn this could generate enough mistrust to undermine their long term customer relationships.

  14. @adamblackie You are absolutely right! The same concern was expressed in 1994 in the influential EU report "Europe and the global information society" ("the Bangemann Report") [1]:

    "The Group believes that without the legal security of a Union-wide approach, lack of consumer confidence will certainly undermine the rapid development of the information society. Given the importance and sensitivity of the privacy issue, a fast decision from Member States is required on the Commission's proposed Directive setting out general principles of data protection."

    People's (reasonable) expectations of privacy may change over time and between cultures, but it is clear that privacy and solitude neither recently emerged [2] nor can reasonably be expected to "fade away" as we move toward ubiquitous human–computer interaction.

    Thanks for your elaborate contribution!

    [1] http://ec.europa.eu/archives/ISPO/infosoc/backg/bangeman.html
    [2] http://www.history.ac.uk/reviews/review/650

  15. I've released this database as a bittorrent on PirateBay.

    1. Nice one! I tweeted about it and informed some of the journalists who picked it up last year: Dan Goodin from el Reg, Mathew J. Schwartz from InformationWeek and Graham Cluley from NakedSecurity/Sophos.