Please also read the follow-up post I published on May 24th 2011. It contains a better description of the motivation, and less technical details.
— Matthijs R. Koot, 2011-06-01
UPDATE 2011-05-23 #1: I’m currently writing a paper about the topic discussed below. The activities are performed as part of my research on anonymity/privacy in the System & Network Engineering research group at the University of Amsterdam. A tweet on May 20th 2011 by Mikko Hypponen, as described here, urged me to post a bit prematurely. Google has been informed.
====== START OF ORIGINAL BLOGPOST ======
The existence of Google’s profiles-sitemap.xml has been known outside Google since at least 2008. The XML file, last updated March 16th 2011, points to 7000+ sitemap-NNN(N).txt files that each contain 5000 hyperlinks to Google profiles; 35M links in total. Snippet from sitemap-000.txt:
https://profiles.google.com/117135902571938793602 https://profiles.google.com/112006952710949332145 https://profiles.google.com/105382462492606983441 https://profiles.google.com/109299750146769054739 https://profiles.google.com/104555562341640123846 https://profiles.google.com/112956845518767535694
Google Profile allows users to choose whether they want to use their username in the Google Profile URL to make it more easy to find and remember:
The text explicitly warns the user about possible exposure (bold emphasis added):
“To make it easier for people to find your profile, you can customize your URL with your Google email username. (Note this can make your Google email address publicly discoverable.)”
Selecting the second option gives an URL like https://profiles.google.com/USERNAME. Accessing profiles using the identifiers found in the sitemaps indeed reveals the Google username — and therefore @gmail.com address. E.g. for me w/username “mrkoot“:
irbaboon:be monkey$ curl -i -X HEAD http://www.google.com/profiles/115572197788225218471
HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 May 2011 14:00:31 GMT
Expires: Mon, 23 May 2011 14:00:31 GMT
Cache-Control: private, max-age=0
X-XSS-Protection: 1; mode=block
Note that the HTTP 301 Redirect discloses the username before any HTML is requested. During February 2011 I checked all 35 million links –my connection did NOT get blocked after any amount of connections– and found that ~40% of the Google Profiles expose their owner’s username and hence @gmail.com address in this way. It totals to ~15 MILLION exposed usernames / @gmail.com addresses(*). With no apparent download restriction in place for connections to https://profiles.google.com and Google users disclosing their profession, employer, education, location, links to their Twitter account, Picasa photoalbums, LinkedIn accounts et cetera this seems like a large-scale spear phishing attack waiting to happen?(**) But hey, the users have been warned.
(*) I can provide proof if necessary.
(**) Pardon the alarmist tone.