UPDATE 2015-03-13: interesting article in PLOS Biology: The Extent and Consequences of P-Hacking in Science (2015, Head, Holman Lanfear, Kahn & Jennions) + press release.
Two days ago I wrote a blogpost on the book The Cult of Statistical Significance written by Stephen T. Ziliak and Deirde N. McCloskey. It seems that the “Statement on the proprieties of Substantive Significance“(*) proposed by the authors on p.249/p.250, chapter “What to Do”, is absent from the internet (Google Books being an exception) — while it is of plenty value to be shared online. So I did some data entry. Typo’s are mine, emphasis is original.
——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——–
- Sampling variance is sometimes interesting, but a low value of it is not the same thing as scientific importance. Economic significance is the chief scientific issue in economic science; clinical significance is the chief issue in medical and psychiatric and pharmacological science; epidemiological significance is the chief issue in infectious disease science; and substantive significance is the chief issue in any science, from agronomy to zoology. No amount of sampling significance can substitute for it.
- In any case, scientists should prefer Neyman’s confidence intervals, Rothman’s p-value functions, Zellner’s random prior odds, Rossi’s real Type I error, Leamer’s extreme bound analysis, and above all Gosset’s real error bars (Student 1927) to the Fisher-circumscribed method of reporting sampling variance (Leamer 1982; Leamer and Leonard 1983; Zellner 2008). No uniform minimum level of Type I error should be specified or enforced by journals, governments, or professional associations.
- Scientists should prefer power functions and operating characteristic functions to vague talk about alternative hypotheses, unspecified. Freiman et al. (1978), Rossi (1990), and similar large-scale surveys of power against medium and large effect sizes should server as minimum standards for small and moderate sample size investigations. Lack of power–say, less than 65 percent for medium-sized effects and 85 percent for large effects–should be highlighted. How the balance should be struck in any given case depends on the issues at stake.
- Competing hypotheses should be tested against explicit economic or clinical or other substantively significant standards. For example, in studies of treatments of breast cancer a range of the size and timing of expected net benefits should be stated and argued explicitly. In economics the approximate employment and earnings rates of workers following enactment of a welfare reform bill should be explicitly articulated. Is the Weibull distribution parameter of the cancer patient data substantively different from 1.0, suggesting greatly diminishing chances of patient survival? How greatly? What does one mean by the claim that welfare reform is “working”? In a labor supply regression does β = “about -0.05” on the public assistance variable meet a defensible minimum standard of oomph? In what units? At what level of power? Science needs discernible Jeffreys’ d‘s (minimum important effect sizes)–showing differences of oomph. It does not need unadorned- yet “significant” t‘s.
- Hypothesis testing–examining probabilistic, experimental, and other warrants for believing one hypothesis more than the alternative hypotheses–should be sharply distinguished from significance testing, which in Fisher’s procedures assumes a true null [hypothesis, MRK]. It is an elementary point of logic that “If H, then O” is not the same as “If O, then H“. Good statistical science requires genuine hypothesis testing. As Jeffreys observed, a p-valueallows on to make at best a precise statement about a narrow event that has not occured.
- Scientists should estimate, not testimate. Quantitative measures of oomph such as Jeffreys’ d, Wald’s “loss function”, Savage’s “admissibility”, Wald and Savage’s “minimax”, Neymand-Pearson’s “decision”, and above all Gosset’s “net pecuniary advantage” should be brought back to the center of statistical inquiry.
- Fit is not a good all-purpose measure of scientific validity, and should be deemphasized in favor of inquiry into other measures of error and importance.
——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——– 8< ——–
I will add bibliography and hyperlinks to this post later this week.
(*) Which the authors abbreviate as SpSS — I wonder whether that is an intended pun/reference to SPSS 🙂
This comment has been removed by the author.