1 Database Containing 35.000.000 Google Profiles. Implications?

UPDATE 2012-02-16: raver1975 released a SQL database w/35M Google Profiles as .torrent on The Pirate Bay. 
UPDATE 2011-06-10: Central question in the Google discussion is whether mass-aggregation of profile data by unknown third parties is considered acceptable. We should neither exaggerate NOR DENY possibilities that public profile data offers to adversaries. We should THINK about them. How will YOUR LinkedIn + Facebook + Twitter + Google Profile + (…) make you look when I combine them and subject you to longitudinal study? I seriously doubt that such activities will turn out all good and harmless. 
To quote from Tali Sharot’s piece on The Optimism Bias in Time Magazine June 2011 : “The question then is, How can we remain hopeful — benefiting from the fruits of [techno-]optimism — while at the same time guarding ourselves from its pitfalls?” Like him, I too believe knowledge is key in that.

====== START OF ORIGINAL BLOGPOST FROM 2011-05-24 ======
This is a follow-up to my previous blogpost on this topic.

In February 2011 it showed trivial to create a database containing ALL ~35.000.000 Google Profiles without Google throttling, blocking, CAPTCHAing or otherwise make more difficult mass-downloading attempts. It took only 1 month to retrieve the data, convert it to SQL using spidermonkey and some custom Javascript code, and import it into a database. The database contains Twitter conversations (also stored in the OZ_initData variable) , person names, aliases/nicknames, multiple past educations (institute, study, start/end date), multiple past work experiences (employer, function, start/end date), links to Picasa photoalbums, …. — and in ~15.000.000 cases, also the username and therefore @gmail.com address. In summary: 1 month + 1 connection = 1 database containing 35.000.000 Google Profiles.

My activities are directed at feeding debate about privacy — not to create distrust but to achieve realistic trust — and on the meaning of “informed consent”. Which, when signing up for online services like Google Profile, amounts to checking a box. How can a user possibly be considered to be “informed” when they’re not made aware 1) about the fact that it does not seem to bother Google that profiles can be mass-downloaded (Dutch) and 2) about misuse value –or hopefully the lack of it– of their social data to criminals and certain types of marketeers? Does this enable mass spear phishing attacks and other types of social engineering, or is that risk negligible, e.g. because criminals use other methods of attack and/or have other, better sources of personal data? Absence of ANY protection against mass-downloading is the status quo at Google Profile. Strictly speaking I did not even violate Google policy in retrieving the profiles, because http://www.google.com/robots.txt explicitly ALLOWS indexing of Google Profiles and my code is part of a personal experimental search engine project. At the time of this writing, the robots.txt file contains:

Allow: /profiles
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static

I’m curious about whether there are any implications to the fact that it is completely trivial for a single individual to do this — possibly there aren’t. That’s something worth knowing too. I’m curious whether Google will apply some measures to protect against mass downloading of profile data, or that this is a non-issue for them too. In my opinion the misuse value of personal data on social networks ought to be elicited before publishing it under a false perception of ‘informed’ consent.

My activities were performed as part of my research on anonymity/privacy at the University of Amsterdam. I’m writing a research paper about the above. Repeating from my previous post: this blog runs at Google Blogger. I sincerely hope my account “mrkoot” and blog2.cyberwar.nl/ will not be blocked or banned – I did not publish the database and did not violate any Google policy.

Ziliak/McCloskey’s “Statement on the proprieties of Substantive Significance”

UPDATE 2015-03-13: interesting article in PLOS Biology: The Extent and Consequences of P-Hacking in Science (2015, Head, Holman Lanfear, Kahn & Jennions) + press release.

Two days ago I wrote a blogpost on the book The Cult of Statistical Significance written by Stephen T. Ziliak and Deirde N. McCloskey. It seems that the “Statement on the proprieties of Substantive Significance“(*) proposed by the authors on p.249/p.250, chapter “What to Do”, is absent from the internet (Google Books being an exception) — while it is of plenty value to be shared online. So I did some data entry. Typo’s are mine, emphasis is original.

——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——–

  1. Sampling variance is sometimes interesting, but a low value of it is not the same thing as scientific importance. Economic significance is the chief scientific issue in economic science; clinical significance is the chief issue in medical and psychiatric and pharmacological science; epidemiological significance is the chief issue in infectious disease science; and substantive significance is the chief issue in any science, from agronomy to zoology. No amount of sampling significance can substitute for it.
  2. In any case, scientists should prefer Neyman’s confidence intervals, Rothman’s p-value functions, Zellner’s random prior odds, Rossi’s real Type I error, Leamer’s extreme bound analysis, and above all Gosset’s real error bars (Student 1927) to the Fisher-circumscribed method of reporting sampling variance (Leamer 1982; Leamer and Leonard 1983; Zellner 2008). No uniform minimum level of Type I error should be specified or enforced by journals, governments, or professional associations.
  3. Scientists should prefer power functions and operating characteristic functions to vague talk about alternative hypotheses, unspecified. Freiman et al. (1978), Rossi (1990), and similar large-scale surveys of power against medium and large effect sizes should server as minimum standards for small and moderate sample size investigations. Lack of power–say, less than 65 percent for medium-sized effects and 85 percent for large effects–should be highlighted. How the balance should be struck in any given case depends on the issues at stake.
  4. Competing hypotheses should be tested against explicit economic or clinical or other substantively significant standards. For example, in studies of treatments of breast cancer a range of the size and timing of expected net benefits should be stated and argued explicitly. In economics the approximate employment and earnings rates of workers following enactment of a welfare reform bill should be explicitly articulated. Is the Weibull distribution parameter of the cancer patient data substantively different from 1.0, suggesting greatly diminishing chances of patient survival? How greatly? What does one mean by the claim that welfare reform is “working”? In a labor supply regression does β = “about -0.05” on the public assistance variable meet a defensible minimum standard of oomph? In what units? At what level of power? Science needs discernible Jeffreys’ d‘s (minimum important effect sizes)–showing differences of oomph. It does not need unadorned- yet “significant” t‘s.
  5. Hypothesis testing–examining probabilistic, experimental, and other warrants for believing one hypothesis more than the alternative hypotheses–should be sharply distinguished from significance testing, which in Fisher’s procedures assumes a true null [hypothesis, MRK]. It is an elementary point of logic that “If H, then O” is not the same as “If O, then H“. Good statistical science requires genuine hypothesis testing. As Jeffreys observed, a p-valueallows on to make at best a precise statement about a narrow event that has not occured.
  6. Scientists should estimate, not testimate. Quantitative measures of oomph such as Jeffreys’ d, Wald’s “loss function”, Savage’s “admissibility”, Wald and Savage’s “minimax”, Neymand-Pearson’s “decision”, and above all Gosset’s “net pecuniary advantage” should be brought back to the center of statistical inquiry.
  7. Fit is not a good all-purpose measure of scientific validity, and should be deemphasized in favor of inquiry into other measures of error and importance. 

——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——–
 
I will add bibliography and hyperlinks to this post later this week.

(*) Which the authors abbreviate as SpSS — I wonder whether that is an intended pun/reference to SPSS 🙂

Google Profiles Exposes Millions of Usernames, Gmails

Please also read the follow-up post I published on May 24th 2011. It contains a better description of the motivation, and less technical details.
— Matthijs R. Koot, 2011-06-01


UPDATE 2011-05-23 #1: I’m currently writing a paper about the topic discussed below. The activities are performed as part of my research on anonymity/privacy in the System & Network Engineering research group at the University of Amsterdam. A tweet on May 20th 2011 by Mikko Hypponen, as described here, urged me to post a bit prematurely. Google has been informed.

UPDATE 2011-05-23 #2: here is code that can convert most of the data in your Google Profile into a single SQL statement: https://cyberwar.nl/GProfile2SQL.js . When accessing a profile in a browser, the profile data (names, profession, education, …) is stored in a single multidimensional Javascript array named OZ_initdata[][][…]. Install spidermonkey for its C-based Javascript engine js, download your own profile and save it as e.g.  mrkoot.html. Then execute someting like sed -n ‘/var OZ_initData = /,/^;window/{ s/.*var OZ_initData = /var OZ_initData = /g; s/^;window.*//g; p; }’  mrkoot.html | tee tmpjs | js -f tmpjs -e ‘print(OZ_initData[5]);’ | js -f tmpjs -f GProfile2SQL.js to get an INSERT statement. Optimizations are left as an exercise to the reader; you can figure out the table structure from the Javascript code and extend everything as you wish.

====== START OF ORIGINAL BLOGPOST ======
The existence of Google’s profiles-sitemap.xml has been known outside Google since at least 2008. The XML file, last updated March 16th 2011, points to 7000+ sitemap-NNN(N).txt files that each contain 5000 hyperlinks to Google profiles; 35M links in total. Snippet from sitemap-000.txt:

https://profiles.google.com/117135902571938793602
https://profiles.google.com/112006952710949332145
https://profiles.google.com/105382462492606983441
https://profiles.google.com/109299750146769054739
https://profiles.google.com/104555562341640123846
https://profiles.google.com/112956845518767535694

Google Profile allows users to choose whether they want to use their username in the Google Profile URL to make it more easy to find and remember:

The text explicitly warns the user about possible exposure (bold emphasis added):

“To make it easier for people to find your profile, you can customize your URL with your Google email username. (Note this can make your Google email address publicly discoverable.)”

Selecting the second option gives an URL like https://profiles.google.com/USERNAME. Accessing profiles using the identifiers found in the sitemaps indeed reveals the Google username — and therefore @gmail.com address. E.g. for me w/username “mrkoot“:

irbaboon:be monkey$ curl -i -X HEAD http://www.google.com/profiles/115572197788225218471
HTTP/1.1 301 Moved Permanently
Location: /profiles/mrkoot
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 May 2011 14:00:31 GMT
Expires: Mon, 23 May 2011 14:00:31 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Transfer-Encoding: chunked

Note that the HTTP 301 Redirect discloses the username before any HTML is requested. During February 2011 I checked all 35 million links –my connection did NOT get blocked after any amount of connections– and found that ~40% of the Google Profiles expose their owner’s username and hence @gmail.com address in this way. It totals to ~15 MILLION exposed usernames / @gmail.com addresses(*). With no apparent download restriction in place for connections to https://profiles.google.com and Google users disclosing their profession, employer, education, location, links to their Twitter account, Picasa photoalbums, LinkedIn accounts et cetera this seems like a large-scale spear phishing attack waiting to happen?(**) But hey, the users have been warned.

(*) I can provide proof if necessary.
(**) Pardon the alarmist tone.

Statistical Significance != Scientific Significance

UPDATE 2019-03-20: Scientists rise up against statistical significance (Valentin Amrhein, Sander Greenland & Blake McShane; comment in Nature, 3 Mar 2019) + comments (Hacker News).

UPDATE 2018-01-31: Dispense with redundant P values (Joachim Goedhart; comment piece in Nature, 31 Jan 2018).

UPDATE 2017-07-03: Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence (.pdf, McShane & Gal, 2016) + comments (Hacker News).

UPDATE 2017-xx-xx: Response to the ASA’s Statement on p-Values: Context, Process, and Purpose (Edward L. Ionides, Alexander Giessing, Yaacov Ritov & Scott E. Page, in:  The American Statistician 71:1, 2017; paywalled, but also available here.)

UPDATE 2016-xx-xx: The [American Statistical Association’s] Statement on p-Values: Context, Process, and Purpose (Ronald L. Wasserstein & Nicole A. Lazar, in: The American Statistician 70:2, 2016; Open Access).

UPDATE 2015-03-13: The Extent and Consequences of P-Hacking in Science (Head, Holman Lanfear, Kahn & Jennions; in: PLOS Biology, 2015; Open Access) + press release.

UPDATE 2011-08-24: When am I going to get my money back? <– a good post on “publication count vs making impact on society” — not exactly the same topic as my post below, but also makes the case for focusing on the “oomph” factor, i.e., the “qualitative size” of results rather than on a sole metric alone.

In The Cult of Statistical Significance, economists Stephen T. Ziliak and Deirde N. McCloskey consider various empirical sciences and remind us that statistical significance, by itself, does NOT equal scientific significance (.pdf). The authors criticize the ideas of R.A Fisher and restore the ideas of W.S. Gosset in honor while explaining their point.

“X has at the .05 level a significant effect on Y, therefore X is important for explaining Y”. So what? HOW important is X is for explaining Y? How does this finding help the world DECIDE AMONG POSSIBLE COURSES OF ACTION? What is the potential IMPACT of the claimed effect, e.g. measured in units of HEALTH, MONEY and OTHER ‘HUMAN’ VALUES? How LARGE is the impact in your field of science — the clinical significance, biological significance, psycho-pharmacological significance, …? Seemingly evident questions, but the authors convincingly demonstrate, using concrete examples, that these questions are often not answered (or even asked?) in real-world scientific practice.

The authors correctly state that a finding with LESS statistical significance may have MORE scientific significance, and contend against using a rather arbitrary threshold of statistical significance, e.g. p<0.05 (why not p<0.06 or p<0.15?), as a fixed, non-negotiable demarcation of science or ‘scientific proof’. The authors assert that a minimax strategy or other loss function needs to be employed in addition to P-value, R2, Student’s t, etc. Their point is summarized in this graph (*):

Judging by the text, it is clear that the book scratches a serious personal itch of the authors – I’d almost speculate they wrote this book as an assignment in anger management. Either how, I strongly recommend this book.

(*) This version is a screencapture from http://www.deirdremccloskey.com/docs/jsm.pdf, but it’s the same graph from the same authors.

U.S. DoD Counterintelligence Awareness and Reporting (CIAR)

UPDATE 2015-03-07: from US DoD PERSEREC: Counterintelligence Reporting Essentials (CORE) – A Practical Guide for Reporting Counterintelligence and Security Indicators (.pdf, 2005; mirror) and Reporting of Counterintelligence and Security Indicators by Supervisors and Coworkers (.pdf, 2005; mirror).

UPDATE 2014-10-30: US DoD DSS released a 24-page handout (.pdf; mirror) entitled “Counterintelligence – Best Practices for Cleared Industry”.

The below is an excerpt from the United States DoD Directive 5240.06 on Counterintelligence Awareness and Reporting (CIAR) (.pdf, 2011; updated 2017; mirror) as published at www.dtic.mil www.esd.whs.mil: “All DoD personnel shall receive CIAR training in accordance with this Directive (…) within 90 days of initial assignment or employment (…) and every 12 months thereafter“. Table 3 is specifically related to cyber threats.

5. REPORTABLE CONTACTS, ACTIVITIES, INDICATORS, AND BEHAVIORS.

Tables 1 through 3 contain reportable contacts, activities, indicators, behaviors, and cyber threats associated with FIEs.

a. Table 1. Personnel who fail to report the contacts, activities, indicators, and behaviors in items 1 through 22 are subject to punitive action in accordance with section 2 of this enclosure. The activities in items 23 and 24 are reportable, but failure to report these activities may not alone serve as the basis for punitive action.

Table 1. Reportable Foreign Intelligence Contacts, Activities, Indicators, and Behaviors

  1. When not related to official duties, contact with anyone known or believed to have information of planned, attempted, actual, or suspected espionage, sabotage, subversion, or other intelligence activities against DoD facilities, organizations, personnel, or information systems. This includes contact through SNS that is not related to official duties.
  2. Contact with an individual who is known or suspected of being associated with a foreign intelligence or security organization.
  3. Visits to foreign diplomatic facilities that are unexplained or inconsistent with an individual’s official duties.
  4. Acquiring, or permitting others to acquire, unauthorized access to classified or sensitive information systems.
  5. Attempts to obtain classified or sensitive information by an individual not authorized to receive such information.
  6. Persons attempting to obtain access to sensitive information inconsistent with their duty requirements.
  7. Attempting to expand access to classified information by volunteering for assignments or duties beyond the normal scope of responsibilities.
  8. Discovery of suspected listening or surveillance devices in classified or secure areas.
  9. Unauthorized possession or operation of cameras, recording devices, computers, and communication devices where classified information is handled or stored.
  10. Discussions of classified information over a non-secure communication device.
  11. Reading or discussing classified or sensitive information in a location where such activity is not permitted.
  12. Transmitting or transporting classified information by unsecured or unauthorized means.
  13. Removing or sending classified or sensitive material out of secured areas without proper authorization.
  14. Unauthorized storage of classified material, regardless of medium or location, to include unauthorized storage of classified material at home.
  15. Unauthorized copying, printing, faxing, e-mailing, or transmitting classified material.
  16. Improperly removing classification markings from documents or improperly changing classification markings on documents.
  17. Unwarranted work outside of normal duty hours.
  18. Attempts to entice co-workers into criminal situations that could lead to blackmail or extortion.
  19. Attempts to entice DoD personnel or contractors into situations that could place them in a compromising position.
  20. Attempts to place DoD personnel or contractors under obligation through special treatment, favors, gifts, or money.
  21. Requests for witness signatures certifying the destruction of classified information when the witness did not observe the destruction.
  22. Requests for DoD information that make an individual suspicious, to include suspicious or questionable requests over the internet or SNS.
  23. Trips to foreign countries that are: a. Short trips inconsistent with logical vacation travel or not part of official duties. b. Trips inconsistent with an individual’s financial ability and official duties.
  24. Unexplained or undue affluence. a. Expensive purchases an individual’s income does not logically support. b. Attempts to explain wealth by reference to an inheritance, luck in gambling, or a successful business venture. c. Sudden reversal of a bad financial situation or repayment of large debts.

b. Table 2. Personnel who fail to report the contacts, activities, indicators, and behaviors in items 1 through 9 are subject to punitive action in accordance with section 2 of this enclosure. The activity in item 10 is reportable, but failure to report this activity may not alone serve as the basis for punitive action.

Table 2. Reportable International Terrorism Contacts, Activities, Indicators, and Behaviors

  1. Advocating violence, the threat of violence, or the use of force to achieve goals on behalf of a known or suspected international terrorist organization.
  2. Advocating support for a known or suspected international terrorist organizations or objectives.
  3. Providing financial or other material support to a known or suspected international terrorist organization or to someone suspected of being an international terrorist.
  4. Procuring supplies and equipment, to include purchasing bomb making materials or obtaining information about the construction of explosives, on behalf of a known or suspected international terrorist organization.
  5. Contact, association, or connections to known or suspected international terrorists, including online, e-mail, and social networking contacts.
  6. Expressing an obligation to engage in violence in support of known or suspected international terrorism or inciting others to do the same.
  7. Any attempt to recruit personnel on behalf of a known or suspected international terrorist organization or for terrorist activities.
  8. Collecting intelligence, including information regarding installation security, on behalf of a known or suspected international terrorist organization.
  9. Familial ties, or other close associations, to known or suspected international terrorists or terrorist supporters.
  10. Repeated browsing or visiting known or suspected international terrorist websites that promote or advocate violence directed against the United States or U.S. forces, or that promote international terrorism or terrorist themes, without official sanction in the performance of duty.

c. Table 3. Personnel who fail to report the contacts, activities, indicators, and behaviors in items 1 through 10 are subject to punitive action in accordance with section 2 of this enclosure. The indicators in items 11 through 19 are reportable, but failure to report these indicators may not alone serve as the basis for punitive action.

Table 3. Reportable FIE-Associated Cyberspace Contacts, Activities, Indicators, and Behaviors

  1. Actual or attempted unauthorized access into U.S. automated information systems and unauthorized transmissions of classified or controlled unclassified information.
  2. Password cracking, key logging, encryption, steganography, privilege escalation, and account masquerading.
  3. Network spillage incidents or information compromise.
  4. Use of DoD account credentials by unauthorized parties.
  5. Tampering with or introducing unauthorized elements into information systems.
  6. Unauthorized downloads or uploads of sensitive data.
  7. Unauthorized use of Universal Serial Bus, removable media, or other transfer devices.
  8. Downloading or installing non-approved computer applications.
  9. Unauthorized network access.
  10. Unauthorized e-mail traffic to foreign destinations.
  11. Denial of service attacks or suspicious network communications failures.
  12. Excessive and abnormal intranet browsing, beyond the individual’s duties and responsibilities, of internal file servers or other networked system contents.
  13. Any credible anomaly, finding, observation, or indicator associated with other activity or behavior that may also be an indicator of terrorism or espionage.
  14. Data exfiltrated to unauthorized domains.
  15. Unexplained storage of encrypted data.
  16. Unexplained user accounts.
  17. Hacking or cracking activities.
  18. Social engineering, electronic elicitation, e-mail spoofing or spear phishing.
  19. Malicious codes or blended threats such as viruses, worms, trojans, logic bombs, malware, spyware, or browser hijackers, especially those used for clandestine data exfiltration.

Source: http://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/524006p.pdf

WANTED: Journal of Negative Results in Security, Privacy and Surveillance

Biomedicine has a Journal of Negative Results in Biomedicine, edited by Bjorn Olsen from Harvard. Could a Journal of Negative Results in Security and Privacy be viable? Perhaps it’s quixotism, considering the persistent lack of reliable metrics to measure even positive outcomes in these domains. But the absence of “it should do X”-criteria does not imply impossibility to establish “it should NOT do -X” or “it should not do Y”-criteria. Marked for further deliberation.

Study Materials on Cyberwar, Intelligence and Security Services

Here’s a list of (mostly) books about cyberwar, intelligence and security services.

Materials about Netherlands

Not specifically about Netherlands

Cyberwar-related
(thx Niels Groeneveld)

What additional study materials do you recommend? Please comment!

    Meta-Data in Public Documents, Cont’d

    For fun, I extracted metadata from most of the documents publicly available at these websites:

    aivd.nl
    belastingdienst.nl
    cia.gov
    ctivd.nl
    defensie.nl
    eerstekamer.nl
    europol.eu
    fbi.gov
    gchq.gov.uk
    minbuza.nl
    mindef.nl
    nsa.gov
    officielebekendmakingen.nl
    om.nl
    overheid.nl
    politie.nl 
    rijksbegroting.nl
    rijksoverheid.nl
    sis.gov.uk
    tno.nl
    tweedekamer.nl

    Here is a count of e-mail addresses I found in Tag_AuthorEmail and Bytes:

    1    accor.com
    1    aesn.fr
    1    agentschapnl.nl
    1    atech-acoustictechnologies.com
    1    bda.amsterdam.nl
    1    bieleveldvanhoek.nl
    1    bletchleypark.org.uk
    1    brgm.fr
    2    cbs.nl
    2    cesg.gsi.gov.uk
    1    coe.int
    1    CvT.nl
    1    diplomatie.gouv.fr
    3    ec.europa.eu
    2    ecologie.gouv.fr
    4    eerstekamer.nl
    1    europolhq.net
    7    fbinet.fbi  <— internal FBI mail
    1    gakushikai.jp
    2    gchq.gsi.gov.uk
    1    gmail.com
    2    hotmail.com
    1    hydro.nl
    6    ic.fbi.gov
    1    inro.tno.nl
    1    isc-cie.com
    1    iwiweb.nl
    1    kabinets-formatie.nl
    1    klpd.politie.nl
    7    leo.gov
    1    let.ru.nl
    1    mail.ing.unibo.it
    3    militairefondsen.nl
    2    minbuza.nl
    14    minbzk.nl
    1    mindef.nl
    7    minfin.nl
    22    minjus.nl
    3    minlnv.nl
    8    MINSZW.NL
    1    minvws.nl
    1    mma.es
    1    mrw.wallonie.be
    2    noord-holland.nl
    1    oieau.fr
    1    olemiss.edu
    1    prv.gelderland.nl
    1    ross.nl
    1    sdu.nl
    1    SMOJMD.USDOJ.gov
    5    sp.nl
    1    sp.se
    1    steunfondsofficieren.nl
    2    tg.nl
    3    tk.parlement.nl
    1    tmleuven.be
    2    tno.nl
    146    tweedekamer.nl
    2    unesco.org
    2    uwv.nl
    2    wereldschool.nl
    1    wwi.minbzk.nl
    1    wxs.nl
    1    xs4all.nl

    Furthermore, these are some network/directory paths found in Title and Hyperlinks tags:

    http://cd0.bistro.ro.minjus/cgi1frnt.exe
    Sggv12fkdbbTemplates GMODCDC-kl+DB 2.jpg
    VAF0002groups03$COAlHDPDPLMAPMPmm100 Projecten190 Business Proces Redesign296Fase 11a-04-Digitaliseren formulierenPi Digitale formulierenkastLogo defensie.gif
    tante-eshome$LienekeSdatapdfHeffingsverordening Marktgelden Zeeburg 2009, tabel 2009.d…
    sk1ntdata03homedir$MRoosDesktopTekening deel 1.xps
    U:wp51wp51verlof tbsgesteldeverlof tbs gestelde 7-7-2010.wpd
    N:HDP AIMPF GO4 Processen99. Financiële werkinstructiesWerkmap Gerard3TekentjesLogo defensie.gif
    T:_PPentaBP Badge.jpg
    F:dataProjectenCivTecGroenBomenbeleidsplanBomenbeleidsplan 25-10-2010 Totaal (1)
    G:Realisatie en BeheerTeam Vastgoed en ProjectenStedenbouwKimAlgemeenkomgrenzenkomgrenzen Layout1 (1)
    sfgvp12FEBCOBiaProjectenZeusZeuswerkODPPwerkChrisLogo´sdefensie.wmf, sfgvp12FEBCOBiaProjectenZeusZeuswerkODPPwerkChrisLogo´sdefensie.wmf, sfgvp12FEBCOBiaProjectenZeusZeuswerkODPPwerkChrisLogo´sdefensie.wmf
    V:SHAREDNICS SHAREDEDASDRAFTSKisnerOPS 2008FINAL OPs 2008Copy of 2008 NICS OPERATIONS REPORT PDF.wpd
    V:SHAREDNICS SHAREDEDASDRAFTSKisnerOPS 20072007 Operations Report PDF.wpd
    H:AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA2006 NICS Operations Report PDF.wpd
    N:AutocadMedewerker TVDBStrooiroutes Berkenwoude (1)
    /cgi-bin/pdcsns.cgi?user=%26dir=/9202000/g/%26filename=/9202000/t5.sns%26via=direct%26v01=25910
    C:WpdocWORD97LogoConsLogoCons.jpg
    C:Documents and Settingsu0072s1Local SettingsTemporary Internet Filesu0072s1Local Settingsu00a110Local SettingsLocal Settingsu00g5m0Local SettingsTemporary Internet FilesOLK12wetsuwior1tog00.htm

    Public Figures and Their Personal Data

    Politicians are public figures and therefore have reduced reasonable expectations of privacy. The Dutch House of Representatives provides information about all 150 representatives in a single XML file: http://www.tweedekamer.nl/xml/kamerleden.xml (mirror of today’s copy; also in Google-cache, but not archive.org). Some of the personal information it contains (not all values are present for all representatives):

    1. full name
    2. gender
    3. date of birth
    4. place of birth
    5. home town
    6. education
    7. work experience
    8. work e-mail (@tweedekamer.nl)
    9. travels 
    10. personal website
    11. personal statement
    12. (past) affiliations w/foundations, associations
    13. political affiliation
    14. photo 

    When stumbling upon that file, the following thoughts came to mind:

    • I hope these public figures don’t use that information as password or answer to security question in their private life.
    • With personal data being readily available, these high-profile targets surely must have already been victim (although maybe not be aware of it) of password-guessing and social engineering attacks?
    • If they aren’t, is that…
      • …because nobody cared to target them?
      • …because this particular knowledge does not pose a threat?
        • …because their personal subscriptions/service-usage is unknown?
          • E.g. you don’t know they use Gmail, which bank, insurance, webshops.
        • …because their personal logins/names are unknown?
          • E.g. you know they are customer/employee/student at X but you don’t know their username for logging in to X
        • …because this personal info was not used as password or answer to a security question?
          • E.g. you know <username>@gmail.com but can’t guess the password
        • …because this personal info is, by itself, insufficient to compromise accounts?
          • E.g. more information is needed (SSN, bank account number), or multifactor authentication requires possession of token
      • …because of something else?

    In a sense, our representatives function as guinea pigs for testing assumptions about the risk associated with disclosing personal data — or rather, at least with disclosing this particular personal data. Disclosing SSN, bank account numbers, credit card numbers and DigiD credentials probably remains a bad idea.

    UPDATE 2011-04-23: I suddenly realize that A Study on the Re-Identifiability of Dutch Citizens (.pdf) presented at HotPETS 2010 is relevant here. Guido van ‘t Noordende, Cees de Laat and I studied registry office (GBA) data of 2.7 million Dutch citizens (~16% of the total population) to explore their identifiability by various quasi-identifiers consisting of partial or full postal code, partial or full date of birth and gender. We also included this one (tables 2 and 3 in the paper):

    QID = { town + date-of-birth + gender }

    The median anonymity set size was 2, meaning that half of the combinations of town + date of birth + gender in our data set either unambiguously identified an individual (Dutch citizen), or a group of only 2 individuals. The numbers vary depending on town size, but for ~37% of Dutch citizens in our set that QID is identifying up to a group of 5 or less individuals. As you see on the above list, the disclosed personal information possibly includes quasi-identifier value + real identity for the representatives. Just thought this is worth mentioning.

    Since the data is publicly available anyway: here is the list of all representatives and their quasi-identifier value.

    U.S.-Owned Trackers on Dutch Govt Websites

    I used Firebug and manual code inspection to puzzle out which Dutch govt websites have which (ad)trackers like Google Analytics and Nedstat comScore (who bought Nedstat in Q3/2010). Some reflection is desirable, IMHO, on whether or not to disclose which (Dutch) IP address accessed what (Dutch govt) content to foreign-owned companies who’s government may require/force them to hand it over. Note: I only looked at the homepage of each site.

    First the good (tracking-free –> kudos!):

    Then the bad:

                I don’t know what data is collected / is not collected by the various trackers, and lack the time to carry out that analysis. If you feel like it, please do so; I will be more than happy to link to your results or post them on this blog on your behalf.