P-values, statistics, and historical science
I stumbled across a series of articles about P-values on fivethirtyeight.com that explore how P-values are used and interpreted by scientists. One of the articles includes reporting about a recently-published statement by the American Statistical Association on P-values, which concludes “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.” The full statement goes on to list a number of qualifying statements (given that it was written by a board of ~20 academics, it’s a miracle that there are only six items on this list).
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
The full report is well worth reading, but in a lot of ways many of the scientific disciplines that I am familiar with appear to be moving away from P-values in favor of Bayesian or information theoretic approaches. This particularly makes sense in a historical science such as phylogeography, where frequentist approaches to hypothesis testing typically rely on parametric simulation, which may be difficult given the considerably uncertainty around important parameters. My favorite introduction to these topics is the book Model Based Inference in the Life Sciences by David Anderson.