Month: May 2011

1 Database Containing 35.000.000 Google Profiles. Implications?

UPDATE 2012-02-16: raver1975 released a SQL database w/35M Google Profiles as .torrent on The Pirate Bay. 
UPDATE 2011-06-10: Central question in the Google discussion is whether mass-aggregation of profile data by unknown third parties is considered acceptable. We should neither exaggerate NOR DENY possibilities that public profile data offers to adversaries. We should THINK about them. How will YOUR LinkedIn + Facebook + Twitter + Google Profile + (…) make you look when I combine them and subject you to longitudinal study? I seriously doubt that such activities will turn out all good and harmless. 
To quote from Tali Sharot’s piece on The Optimism Bias in Time Magazine June 2011 : “The question then is, How can we remain hopeful — benefiting from the fruits of [techno-]optimism — while at the same time guarding ourselves from its pitfalls?” Like him, I too believe knowledge is key in that.

====== START OF ORIGINAL BLOGPOST FROM 2011-05-24 ======
This is a follow-up to my previous blogpost on this topic.

In February 2011 it showed trivial to create a database containing ALL ~35.000.000 Google Profiles without Google throttling, blocking, CAPTCHAing or otherwise make more difficult mass-downloading attempts. It took only 1 month to retrieve the data, convert it to SQL using spidermonkey and some custom Javascript code, and import it into a database. The database contains Twitter conversations (also stored in the OZ_initData variable) , person names, aliases/nicknames, multiple past educations (institute, study, start/end date), multiple past work experiences (employer, function, start/end date), links to Picasa photoalbums, …. — and in ~15.000.000 cases, also the username and therefore @gmail.com address. In summary: 1 month + 1 connection = 1 database containing 35.000.000 Google Profiles.

My activities are directed at feeding debate about privacy — not to create distrust but to achieve realistic trust — and on the meaning of “informed consent”. Which, when signing up for online services like Google Profile, amounts to checking a box. How can a user possibly be considered to be “informed” when they’re not made aware 1) about the fact that it does not seem to bother Google that profiles can be mass-downloaded (Dutch) and 2) about misuse value –or hopefully the lack of it– of their social data to criminals and certain types of marketeers? Does this enable mass spear phishing attacks and other types of social engineering, or is that risk negligible, e.g. because criminals use other methods of attack and/or have other, better sources of personal data? Absence of ANY protection against mass-downloading is the status quo at Google Profile. Strictly speaking I did not even violate Google policy in retrieving the profiles, because http://www.google.com/robots.txt explicitly ALLOWS indexing of Google Profiles and my code is part of a personal experimental search engine project. At the time of this writing, the robots.txt file contains:

Allow: /profiles
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static

I’m curious about whether there are any implications to the fact that it is completely trivial for a single individual to do this — possibly there aren’t. That’s something worth knowing too. I’m curious whether Google will apply some measures to protect against mass downloading of profile data, or that this is a non-issue for them too. In my opinion the misuse value of personal data on social networks ought to be elicited before publishing it under a false perception of ‘informed’ consent.

My activities were performed as part of my research on anonymity/privacy at the University of Amsterdam. I’m writing a research paper about the above. Repeating from my previous post: this blog runs at Google Blogger. I sincerely hope my account “mrkoot” and blog.cyberwar.nl will not be blocked or banned – I did not publish the database and did not violate any Google policy.

Ziliak/McCloskey’s “Statement on the proprieties of Substantive Significance”

UPDATE 2015-03-13: interesting article in PLOS Biology: The Extent and Consequences of P-Hacking in Science (2015, Head, Holman Lanfear, Kahn & Jennions) + press release.

Two days ago I wrote a blogpost on the book The Cult of Statistical Significance written by Stephen T. Ziliak and Deirde N. McCloskey. It seems that the “Statement on the proprieties of Substantive Significance“(*) proposed by the authors on p.249/p.250, chapter “What to Do”, is absent from the internet (Google Books being an exception) — while it is of plenty value to be shared online. So I did some data entry. Typo’s are mine, emphasis is original.

——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——–

  1. Sampling variance is sometimes interesting, but a low value of it is not the same thing as scientific importance. Economic significance is the chief scientific issue in economic science; clinical significance is the chief issue in medical and psychiatric and pharmacological science; epidemiological significance is the chief issue in infectious disease science; and substantive significance is the chief issue in any science, from agronomy to zoology. No amount of sampling significance can substitute for it.
  2. In any case, scientists should prefer Neyman’s confidence intervals, Rothman’s p-value functions, Zellner’s random prior odds, Rossi’s real Type I error, Leamer’s extreme bound analysis, and above all Gosset’s real error bars (Student 1927) to the Fisher-circumscribed method of reporting sampling variance (Leamer 1982; Leamer and Leonard 1983; Zellner 2008). No uniform minimum level of Type I error should be specified or enforced by journals, governments, or professional associations.
  3. Scientists should prefer power functions and operating characteristic functions to vague talk about alternative hypotheses, unspecified. Freiman et al. (1978), Rossi (1990), and similar large-scale surveys of power against medium and large effect sizes should server as minimum standards for small and moderate sample size investigations. Lack of power–say, less than 65 percent for medium-sized effects and 85 percent for large effects–should be highlighted. How the balance should be struck in any given case depends on the issues at stake.
  4. Competing hypotheses should be tested against explicit economic or clinical or other substantively significant standards. For example, in studies of treatments of breast cancer a range of the size and timing of expected net benefits should be stated and argued explicitly. In economics the approximate employment and earnings rates of workers following enactment of a welfare reform bill should be explicitly articulated. Is the Weibull distribution parameter of the cancer patient data substantively different from 1.0, suggesting greatly diminishing chances of patient survival? How greatly? What does one mean by the claim that welfare reform is “working”? In a labor supply regression does β = “about -0.05” on the public assistance variable meet a defensible minimum standard of oomph? In what units? At what level of power? Science needs discernible Jeffreys’ d‘s (minimum important effect sizes)–showing differences of oomph. It does not need unadorned- yet “significant” t‘s.
  5. Hypothesis testing–examining probabilistic, experimental, and other warrants for believing one hypothesis more than the alternative hypotheses–should be sharply distinguished from significance testing, which in Fisher’s procedures assumes a true null [hypothesis, MRK]. It is an elementary point of logic that “If H, then O” is not the same as “If O, then H“. Good statistical science requires genuine hypothesis testing. As Jeffreys observed, a p-valueallows on to make at best a precise statement about a narrow event that has not occured.
  6. Scientists should estimate, not testimate. Quantitative measures of oomph such as Jeffreys’ d, Wald’s “loss function”, Savage’s “admissibility”, Wald and Savage’s “minimax”, Neymand-Pearson’s “decision”, and above all Gosset’s “net pecuniary advantage” should be brought back to the center of statistical inquiry.
  7. Fit is not a good all-purpose measure of scientific validity, and should be deemphasized in favor of inquiry into other measures of error and importance. 

——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——–
 
I will add bibliography and hyperlinks to this post later this week.

(*) Which the authors abbreviate as SpSS — I wonder whether that is an intended pun/reference to SPSS 🙂

Google Profiles Exposes Millions of Usernames, Gmails

Please also read the follow-up post I published on May 24th 2011. It contains a better description of the motivation, and less technical details.
— Matthijs R. Koot, 2011-06-01


UPDATE 2011-05-23 #1: I’m currently writing a paper about the topic discussed below. The activities are performed as part of my research on anonymity/privacy in the System & Network Engineering research group at the University of Amsterdam. A tweet on May 20th 2011 by Mikko Hypponen, as described here, urged me to post a bit prematurely. Google has been informed.

UPDATE 2011-05-23 #2: here is code that can convert most of the data in your Google Profile into a single SQL statement: https://cyberwar.nl/GProfile2SQL.js . When accessing a profile in a browser, the profile data (names, profession, education, …) is stored in a single multidimensional Javascript array named OZ_initdata[][][…]. Install spidermonkey for its C-based Javascript engine js, download your own profile and save it as e.g.  mrkoot.html. Then execute someting like sed -n ‘/var OZ_initData = /,/^;window/{ s/.*var OZ_initData = /var OZ_initData = /g; s/^;window.*//g; p; }’  mrkoot.html | tee tmpjs | js -f tmpjs -e ‘print(OZ_initData[5]);’ | js -f tmpjs -f GProfile2SQL.js to get an INSERT statement. Optimizations are left as an exercise to the reader; you can figure out the table structure from the Javascript code and extend everything as you wish.

====== START OF ORIGINAL BLOGPOST ======
The existence of Google’s profiles-sitemap.xml has been known outside Google since at least 2008. The XML file, last updated March 16th 2011, points to 7000+ sitemap-NNN(N).txt files that each contain 5000 hyperlinks to Google profiles; 35M links in total. Snippet from sitemap-000.txt:

https://profiles.google.com/117135902571938793602
https://profiles.google.com/112006952710949332145
https://profiles.google.com/105382462492606983441
https://profiles.google.com/109299750146769054739
https://profiles.google.com/104555562341640123846
https://profiles.google.com/112956845518767535694

Google Profile allows users to choose whether they want to use their username in the Google Profile URL to make it more easy to find and remember:

The text explicitly warns the user about possible exposure (bold emphasis added):

“To make it easier for people to find your profile, you can customize your URL with your Google email username. (Note this can make your Google email address publicly discoverable.)”

Selecting the second option gives an URL like https://profiles.google.com/USERNAME. Accessing profiles using the identifiers found in the sitemaps indeed reveals the Google username — and therefore @gmail.com address. E.g. for me w/username “mrkoot“:

irbaboon:be monkey$ curl -i -X HEAD http://www.google.com/profiles/115572197788225218471
HTTP/1.1 301 Moved Permanently
Location: /profiles/mrkoot
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 May 2011 14:00:31 GMT
Expires: Mon, 23 May 2011 14:00:31 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Transfer-Encoding: chunked

Note that the HTTP 301 Redirect discloses the username before any HTML is requested. During February 2011 I checked all 35 million links –my connection did NOT get blocked after any amount of connections– and found that ~40% of the Google Profiles expose their owner’s username and hence @gmail.com address in this way. It totals to ~15 MILLION exposed usernames / @gmail.com addresses(*). With no apparent download restriction in place for connections to https://profiles.google.com and Google users disclosing their profession, employer, education, location, links to their Twitter account, Picasa photoalbums, LinkedIn accounts et cetera this seems like a large-scale spear phishing attack waiting to happen?(**) But hey, the users have been warned.

(*) I can provide proof if necessary.
(**) Pardon the alarmist tone.

Statistical Significance != Scientific Significance

UPDATE 2019-03-20: Scientists rise up against statistical significance (Valentin Amrhein, Sander Greenland & Blake McShane; comment in Nature, 3 Mar 2019) + comments (Hacker News).

UPDATE 2018-01-31: Dispense with redundant P values (Joachim Goedhart; comment piece in Nature, 31 Jan 2018).

UPDATE 2017-07-03: Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence (.pdf, McShane & Gal, 2016) + comments (Hacker News).

UPDATE 2017-xx-xx: Response to the ASA’s Statement on p-Values: Context, Process, and Purpose (Edward L. Ionides, Alexander Giessing, Yaacov Ritov & Scott E. Page, in:  The American Statistician 71:1, 2017; paywalled, but also available here.)

UPDATE 2016-xx-xx: The [American Statistical Association’s] Statement on p-Values: Context, Process, and Purpose (Ronald L. Wasserstein & Nicole A. Lazar, in: The American Statistician 70:2, 2016; Open Access).

UPDATE 2015-03-13: The Extent and Consequences of P-Hacking in Science (Head, Holman Lanfear, Kahn & Jennions; in: PLOS Biology, 2015; Open Access) + press release.

UPDATE 2011-08-24: When am I going to get my money back? <– a good post on “publication count vs making impact on society” — not exactly the same topic as my post below, but also makes the case for focusing on the “oomph” factor, i.e., the “qualitative size” of results rather than on a sole metric alone.

In The Cult of Statistical Significance, economists Stephen T. Ziliak and Deirde N. McCloskey consider various empirical sciences and remind us that statistical significance, by itself, does NOT equal scientific significance (.pdf). The authors criticize the ideas of R.A Fisher and restore the ideas of W.S. Gosset in honor while explaining their point.

“X has at the .05 level a significant effect on Y, therefore X is important for explaining Y”. So what? HOW important is X is for explaining Y? How does this finding help the world DECIDE AMONG POSSIBLE COURSES OF ACTION? What is the potential IMPACT of the claimed effect, e.g. measured in units of HEALTH, MONEY and OTHER ‘HUMAN’ VALUES? How LARGE is the impact in your field of science — the clinical significance, biological significance, psycho-pharmacological significance, …? Seemingly evident questions, but the authors convincingly demonstrate, using concrete examples, that these questions are often not answered (or even asked?) in real-world scientific practice.

The authors correctly state that a finding with LESS statistical significance may have MORE scientific significance, and contend against using a rather arbitrary threshold of statistical significance, e.g. p<0.05 (why not p<0.06 or p<0.15?), as a fixed, non-negotiable demarcation of science or ‘scientific proof’. The authors assert that a minimax strategy or other loss function needs to be employed in addition to P-value, R2, Student’s t, etc. Their point is summarized in this graph (*):

Judging by the text, it is clear that the book scratches a serious personal itch of the authors – I’d almost speculate they wrote this book as an assignment in anger management. Either how, I strongly recommend this book.

(*) This version is a screencapture from http://www.deirdremccloskey.com/docs/jsm.pdf, but it’s the same graph from the same authors.

U.S. DoD Counterintelligence Awareness and Reporting (CIAR)

UPDATE 2015-03-07: from US DoD PERSEREC: Counterintelligence Reporting Essentials (CORE) – A Practical Guide for Reporting Counterintelligence and Security Indicators (.pdf, 2005; mirror) and Reporting of Counterintelligence and Security Indicators by Supervisors and Coworkers (.pdf, 2005; mirror).

UPDATE 2014-10-30: US DoD DSS released a 24-page handout (.pdf; mirror) entitled “Counterintelligence – Best Practices for Cleared Industry”.

The below is an excerpt from the United States DoD Directive 5240.06 on Counterintelligence Awareness and Reporting (CIAR) (.pdf, 2011; updated 2017; mirror) as published at www.dtic.mil www.esd.whs.mil: “All DoD personnel shall receive CIAR training in accordance with this Directive (…) within 90 days of initial assignment or employment (…) and every 12 months thereafter“. Table 3 is specifically related to cyber threats.

5. REPORTABLE CONTACTS, ACTIVITIES, INDICATORS, AND BEHAVIORS.

Tables 1 through 3 contain reportable contacts, activities, indicators, behaviors, and cyber threats associated with FIEs.

a. Table 1. Personnel who fail to report the contacts, activities, indicators, and behaviors in items 1 through 22 are subject to punitive action in accordance with section 2 of this enclosure. The activities in items 23 and 24 are reportable, but failure to report these activities may not alone serve as the basis for punitive action.

Table 1. Reportable Foreign Intelligence Contacts, Activities, Indicators, and Behaviors

  1. When not related to official duties, contact with anyone known or believed to have information of planned, attempted, actual, or suspected espionage, sabotage, subversion, or other intelligence activities against DoD facilities, organizations, personnel, or information systems. This includes contact through SNS that is not related to official duties.
  2. Contact with an individual who is known or suspected of being associated with a foreign intelligence or security organization.
  3. Visits to foreign diplomatic facilities that are unexplained or inconsistent with an individual’s official duties.
  4. Acquiring, or permitting others to acquire, unauthorized access to classified or sensitive information systems.
  5. Attempts to obtain classified or sensitive information by an individual not authorized to receive such information.
  6. Persons attempting to obtain access to sensitive information inconsistent with their duty requirements.
  7. Attempting to expand access to classified information by volunteering for assignments or duties beyond the normal scope of responsibilities.
  8. Discovery of suspected listening or surveillance devices in classified or secure areas.
  9. Unauthorized possession or operation of cameras, recording devices, computers, and communication devices where classified information is handled or stored.
  10. Discussions of classified information over a non-secure communication device.
  11. Reading or discussing classified or sensitive information in a location where such activity is not permitted.
  12. Transmitting or transporting classified information by unsecured or unauthorized means.
  13. Removing or sending classified or sensitive material out of secured areas without proper authorization.
  14. Unauthorized storage of classified material, regardless of medium or location, to include unauthorized storage of classified material at home.
  15. Unauthorized copying, printing, faxing, e-mailing, or transmitting classified material.
  16. Improperly removing classification markings from documents or improperly changing classification markings on documents.
  17. Unwarranted work outside of normal duty hours.
  18. Attempts to entice co-workers into criminal situations that could lead to blackmail or extortion.
  19. Attempts to entice DoD personnel or contractors into situations that could place them in a compromising position.
  20. Attempts to place DoD personnel or contractors under obligation through special treatment, favors, gifts, or money.
  21. Requests for witness signatures certifying the destruction of classified information when the witness did not observe the destruction.
  22. Requests for DoD information that make an individual suspicious, to include suspicious or questionable requests over the internet or SNS.
  23. Trips to foreign countries that are: a. Short trips inconsistent with logical vacation travel or not part of official duties. b. Trips inconsistent with an individual’s financial ability and official duties.
  24. Unexplained or undue affluence. a. Expensive purchases an individual’s income does not logically support. b. Attempts to explain wealth by reference to an inheritance, luck in gambling, or a successful business venture. c. Sudden reversal of a bad financial situation or repayment of large debts.

b. Table 2. Personnel who fail to report the contacts, activities, indicators, and behaviors in items 1 through 9 are subject to punitive action in accordance with section 2 of this enclosure. The activity in item 10 is reportable, but failure to report this activity may not alone serve as the basis for punitive action.

Table 2. Reportable International Terrorism Contacts, Activities, Indicators, and Behaviors

  1. Advocating violence, the threat of violence, or the use of force to achieve goals on behalf of a known or suspected international terrorist organization.
  2. Advocating support for a known or suspected international terrorist organizations or objectives.
  3. Providing financial or other material support to a known or suspected international terrorist organization or to someone suspected of being an international terrorist.
  4. Procuring supplies and equipment, to include purchasing bomb making materials or obtaining information about the construction of explosives, on behalf of a known or suspected international terrorist organization.
  5. Contact, association, or connections to known or suspected international terrorists, including online, e-mail, and social networking contacts.
  6. Expressing an obligation to engage in violence in support of known or suspected international terrorism or inciting others to do the same.
  7. Any attempt to recruit personnel on behalf of a known or suspected international terrorist organization or for terrorist activities.
  8. Collecting intelligence, including information regarding installation security, on behalf of a known or suspected international terrorist organization.
  9. Familial ties, or other close associations, to known or suspected international terrorists or terrorist supporters.
  10. Repeated browsing or visiting known or suspected international terrorist websites that promote or advocate violence directed against the United States or U.S. forces, or that promote international terrorism or terrorist themes, without official sanction in the performance of duty.

c. Table 3. Personnel who fail to report the contacts, activities, indicators, and behaviors in items 1 through 10 are subject to punitive action in accordance with section 2 of this enclosure. The indicators in items 11 through 19 are reportable, but failure to report these indicators may not alone serve as the basis for punitive action.

Table 3. Reportable FIE-Associated Cyberspace Contacts, Activities, Indicators, and Behaviors

  1. Actual or attempted unauthorized access into U.S. automated information systems and unauthorized transmissions of classified or controlled unclassified information.
  2. Password cracking, key logging, encryption, steganography, privilege escalation, and account masquerading.
  3. Network spillage incidents or information compromise.
  4. Use of DoD account credentials by unauthorized parties.
  5. Tampering with or introducing unauthorized elements into information systems.
  6. Unauthorized downloads or uploads of sensitive data.
  7. Unauthorized use of Universal Serial Bus, removable media, or other transfer devices.
  8. Downloading or installing non-approved computer applications.
  9. Unauthorized network access.
  10. Unauthorized e-mail traffic to foreign destinations.
  11. Denial of service attacks or suspicious network communications failures.
  12. Excessive and abnormal intranet browsing, beyond the individual’s duties and responsibilities, of internal file servers or other networked system contents.
  13. Any credible anomaly, finding, observation, or indicator associated with other activity or behavior that may also be an indicator of terrorism or espionage.
  14. Data exfiltrated to unauthorized domains.
  15. Unexplained storage of encrypted data.
  16. Unexplained user accounts.
  17. Hacking or cracking activities.
  18. Social engineering, electronic elicitation, e-mail spoofing or spear phishing.
  19. Malicious codes or blended threats such as viruses, worms, trojans, logic bombs, malware, spyware, or browser hijackers, especially those used for clandestine data exfiltration.

Source: http://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/524006p.pdf