Can a P Value Be Greater Than 1
1. Introduction
The reified position of the p-value in statistical analyses was unchallenged for decades despite criticism from statisticians and other scientists (e.thou. [1–4]). In recent years, withal, this unrest has intensified, with a plethora of new papers either driving home previous arguments confronting p or raising boosted critiques (eastward.chiliad. [5–11]). Catalysed past the part that the p-value has played in science'southward reproducibility crisis, this criticism has brought us to the brink of an uprising against p's reign.
Consequently, an analysis power vacuum is forming, with a range of alternative approaches vying to fill the space. Commentaries that criticize the p-value oft advise alternative paradigms of statistical analysis, and now a number of options have taken seed in the field of biology. New statistical methods typically involve concepts that are counterintuitive to our p-based training; they correspond radically different ways of interrogating data that involve disparate approaches to generating testify, different software packages and a host of new assumptions to understand and justify. The steep curves for learning new methods could stifle further expansion of their use in lieu of p-centred statistical analyses in the biological sciences.
To provide clarity and confidence for biologists seeking to expand and diversify their analytical approaches, this commodity summarizes some tractable alternatives to p-value centricity. But get-go, here is a brief overview about the limits of the p-value and why, on its ain, it is rarely sufficient to translate our hard-earned data. Forth with many other baronial statisticians, Jacob Cohen and John Tukey take written cogently almost their concerns with the fundamental concept of null hypothesis significance testing. Because the p-value is predicated on the nix hypothesis existence truthful, it does not give us any information nearly the alternative hypothesis—the hypothesis we are normally near interested in. Compounding this trouble, if our p-value is high and so does not reject the null hypothesis this cannot be interpreted as the null being truthful; rather, nosotros are left with an 'open up verdict' [2]. Moreover, with a big enough sample size, inevitably the naught hypothesis volition exist rejected; perversely, a p-value based statistical result is as informative about our sample as information technology is most our hypothesis [12,xiii].
Recently, further concerns have been documented nearly p, linking the p-value to problems with experimental replication [five]. Cumming [7] and Halsey et al. [6] demonstrated that p is 'fickle' in that it tin can vary greatly between replicates even when statistical power is high, and argued that this makes estimation of the p-value untenable unless p is extremely small. Colquhoun [8,14] has argued that significant p-values at just below 0.05 are extremely weak bear witness against the nada hypothesis because at that place is a ane in three hazard that the significant issue is a imitation positive (aka type 1 error). Interpreting p dichotomously equally 'significant' or 'not significant' is particularly egregious for many reasons, but most pertinent hither is that this arroyo encourages failed experiment replication. Studies are often designed to have 80% statistical ability, meaning that there is an eighty% chance that an result in the data will be detected. As Wasserstein & Lazar [nine] explain, the probability of two identical studies statistically powered to lxxx% both returning p ≤ 0.05 is at best eighty% × fourscore% = 64%, while the probability of one of these studies returning p ≤ 0.05 and the other not is 2 × lxxx% × 20% = 32%. Together, these papers and calculations demonstrate that the p-value is typically highly imprecise about the amount of evidence against the zippo hypothesis, and thus p should be considered as providing merely loose, first pass evidence almost the phenomenon existence studied [6,15,xvi].
With the broadening realization amongst biologists that p-values provide only tentative prove well-nigh our information—and, indeed, that exactly what this evidence tells us is easy to misinterpret—it is of import that nosotros equip ourselves with a broad understanding of what statistical options are bachelor that can clarify, or even supplant, p. While it will be hard to extricate ourselves from our indoctrinated arroyo to interpreting every statistical analysis through the prism of significance or non-significance, we can exist motivated by the knowledge that at that place actually are other ways, and indeed more intuitive ways, to investigate our data. Below, I provide a quick-and-easy guide to some unproblematic nevertheless powerful statistical options currently bachelor to biologists conducting standard study designs. Each distinct statistical approach interrogates the data through a different lens, i.east. by asking a fundamentally different scientific question; this is reflected in the subsection headings that follow. We shall start with the option least confusing to the p-value epitome—augmenting p with information about its variability.
2. p-Value: how much evidence is there against the nothing hypothesis?
p provides unintuitive data about your data. However, perhaps it can best be interpreted as characterizing the show in the data against the aught hypothesis [10,17]. And despite its limitations, the p-value has attractive qualities. Information technology is a single number from which an objective interpretation nigh data can be made. Moreover, arguably that interpretation is context independent; p-values can exist compared across different types of studies and statistical tests [xviii] (though see [10]). Huber [19] argues that focusing on the p-value is a suitable kickoff footstep for screening of multiple hypotheses, as occurs in 'high throughput biology' such as factor expression assay and genome-broad association studies.
However, p is let downwards past the considerable variability it exhibits between study samples—variability disguised by the reporting of p as a single value to several decimal places. Arguably, so, if you desire to keep calculating p every bit part of your analyses of unmarried tests, you ought to provide some additional information well-nigh this variability, to inform the reader most the dubiousness of this statistic. One way to achieve this is to provide a value that is somewhat alike to the conviction interval effectually an consequence size, which characterizes the uncertainty of your written report p-value and is termed the p-value prediction interval [7]. Some other option is to calculate the prediction interval that characterizes the incertitude of the p-value of a time to come replicate study. Lazzeroni et al. [18] provide a simple online calculator for both [18]. Based on this calculator, if the p-value from your experiment is, for example, 0.01, it will have a 95% prediction interval of 5.7−vi to 0.54. Clearly, this would provide united states of america with little confidence that p is replicable under this experimental scenario. A p-value of 0.0001 has a 95% prediction interval of 0–0.05. In this second scenario, the 95% prediction interval of a hereafter replicate study is 0–0.26. Vsevolozhskaya et al. [twenty] argue that the prediction interval around p calculated by this method returns underestimates of both the lower and upper bounds. Yet, the width of the prediction interval, however calculated, volition exist surprisingly large to those of us accepted to seeing the p-value equally a naked single value reported to bang-up precision.
If you accept calculated the planned power of your written report and are prepared to quantify the level of belief you had earlier conducting the experiment that the nix hypothesis is true, yous can augment p with the estimated likelihood that if you get a significant p-value it is falsely rejecting the null hypothesis. This is termed the estimated faux positive (discovery) hazard, and can be easily estimated from a simple Bayesian framework (see later) ([9] and the annotate by Altman annexed to [9]):
where p is the p-value of your study, π 0 is the probability that the null hypothesis is true based on prior bear witness and (one − β) is study power.
For example, if you have powered your report to 80% and before y'all conduct your study y'all remember there is a 30% possibility that your perturbation will have an effect (thus π 0 = 0.7), so having conducted the study your assay returns p = 0.05, the estimated fake positive risk is 13%. That is, many replicates of this experiment would indicate a statistically meaning effect of the perturbation and be incorrect in doing and so about 13% of the time. Behave in listen, however, that given the aforementioned fickleness of p, this guess of false positive run a risk could be every bit capricious. This business can be circumvented for loftier throughput studies, replacing p in the equation in a higher place for α (the significance threshold of the statistical test), and estimating π 0 from observed p-values [nine,21].
For those not conducting high throughput studies and who practise not like the thought of subjectively quantifying their a priori expectations almost the veracity of their experimental perturbation, the calculations can be flipped such that your p-value is accompanied past a calculation of the prior expectation that would be needed to produce a specified take a chance (e.grand. v%) of a significant p-value being a false positive ([8]; and the author provides an piece of cake-to-use web figurer for this purpose: http://fpr-calc.ucl.ac.u.k./). This provides an culling way of assessing the likelihood that a meaning p-value is a true positive. If, for example, your p-value is 0.03 for a study powered to about lxx%, to limit the risk of a false positive to 5% your prior expectation that the perturbation will take an effect would demand to exist 77% (based on the 'p-equals' case; [8]).
3. Issue size and confidence interval: how much and how accurate?
A statistically meaning result tells united states of america relatively footling about the phenomenon we are studying—only that the null hypothesis of no 'effect' in our data (which we already knew wasn't true to some level of precision; [13]) has been rejected [22]. Instead of the p-value scientific question 'is at that place or isn't in that location an effect?', considerably more information is garnered past request 'how potent is the event in our sample?' coupled with the question 'how accurate is that value equally an estimate of how strong the population outcome is?'.
The most straightforward way to analyse your data in order to answer these two questions is to summate the effect size in the sample forth with the 95% conviction intervals around that estimate [6,7,23–26]. Fortunately, the effect size is oftentimes like shooting fish in a barrel to calculate or extract from statistical outputs, since it is typically the hateful difference betwixt 2 groups or the forcefulness of the correlation between two variables. And while the definition of a confidence interval is circuitous, Cumming & Calin-Jageman [27] compellingly argue that information technology is reasonable to interpret a confidence interval equally an indication of the accuracy of the consequence size estimate; it is the likely error estimation.
The calculations of confidence intervals and p-values share the same mathematical framework [28,29], just this does not detract from the fact that focusing interpretation of data on event sizes and their conviction intervals is a fundamentally dissimilar approach from that of focusing interpretation on whether or not to reject the null hypothesis [eleven]. These two procedures ask very different questions nearly the information and arm-twist distinct answers [30]. For instance, a study on the effects of two different ambient temperatures on paramecium bore returning an effect size of 20 µm and a p-value of 0.1, if centred on p-value interpretation would conclude 'no outcome' of temperature, despite the best supported issue size existence 20, not 0. An interpretation based on effect size and confidence intervals could, for case, state: 'Our results propose that paramecium kept at the lower temperature will exist on boilerplate 20 µm larger in size, nevertheless a difference in size ranging between −4 and 50 µm is also reasonably likely'. Every bit Amrhein et al. [11] bespeak out, the latter approach acknowledges the doubt in the estimated effect size while also ensuring that you do non make a false claim either of no effect if p > 0.05, or an overly confident claim. And if all the values within the conviction interval are biologically unimportant, then a statement that your results bespeak no of import issue tin besides exist made [11]. (This is an example of where focusing on effect size and dubiousness likewise allows articulate yes/no interpretations if desired; see also [31].)
The approach of focusing on effect size estimation is normally accompanied past an emphasis on visualization of the data to support their evaluation. A strong graphical format that achieves this involves a main panel showing the raw data and side panels helping to illustrate the estimated issue size [32]. Refer to the electronic supplementary materials for an example plot (figure S1). Such plots, while intuitive, are not typically bachelor in statistical packages and not piece of cake to code in programming languages. Nevertheless, Ho and colleagues [32] take recently developed 'Data Analysis with Bootstrap-coupled ESTimation' (DABEST), bachelor in versions for Matlab, Python and R, and too as a webpage https://www.estimationstats.com/#/. All versions have convenient, rote instructions to produce graphs that allow full exploration of your data.
Scientific research seeks to home in on 'answers', and estimated effect sizes and their confidence intervals are central to this goal. In biology at to the lowest degree, homing in on an answer almost inevitably requires multiple studies, which and so demand to exist analysed together, through meta-analysis. Consequence sizes and confidence intervals are the vital information for this process (e.chiliad. [33]), providing some other good argument for their thorough reporting in papers. Typically, the confidence intervals around an upshot size calculated from a meta-analysis are much smaller than those of the individual studies [34], thus giving a much clearer picture near the true, population-level issue size (figure 1). However, meta-analyses can be deeply compromised past the 'file drawer phenomenon', where non-significant results are non published [36], either because researchers do not submit them, or journals will non take them [37]. Fortunately, attitudes of science funders, publishers and researchers are starting to change about the value and importance of reporting non-pregnant results; this momentum needs to continue.
4. Bayes cistron: what is the evidence for one hypothesis compared to another?
In contrast to the p-value providing only data near the likelihood that the zippo hypothesis is true, the Bayes factor directly addresses both the null and the culling hypotheses. The Bayes cistron quantifies the relative show in the data you have nerveless about whether those information are better predicted by the null hypothesis or the alternative hypothesis (an effect of stated magnitude). For example, a Bayes factor of v indicates that the forcefulness of testify is five times greater for the alternative hypothesis than the null hypothesis; a Bayes factor of one/five indicates the reverse.
The Bayes cistron is a elementary and intuitive style of undertaking the Bayesian version of aught hypothesis significance testing. Just recently have Bayes factors been fabricated tractable for the practising biologist, and these are now easily calculable for a range of standard study designs. The Bayes factors for many designs can be run on web-based calculators (e.yard. http://pcl.missouri.edu/bayesfactor) and are also available equally a new package for R called BayesFactor() [38].
A controversy of the Bayesian approach is the need for you to specify your strength of belief in the effect being studied before the experiment takes identify (the prior distribution of the alternative hypothesis) [39]. Thus, your somewhat subjective option of 'prior' influences the upshot of the analysis. Schönbrodt et al. [40] argue that this criticism of Bayesian statistics is oft exaggerated because the influence of the prior is express when a reasonable prior distribution is used. You tin appraise the influence of the prior with a simple sensitivity analysis whereby the analysis is run using a bounded range of realistic prior probabilities [41]. There is also a default prior that you lot can apply in the common situation when you take lilliputian pre-study evidence for the expected effect size.
However, undertaking Bayesian analyses is more involved than zero hypothesis significance testing, and specifying the prior undoubtedly adds some degree of subjectivity. Fortunately, in that location is a single, simple formula that yous can apply to convert a p-value to a form of the Bayes factor without any other information. This simplified Bayes factor, termed the upper bound, states the most likely information technology is that the alternative hypothesis is true rather than the null hypothesis over any reasonable prior distribution (comment by Benjamin and Berger annexed to [9] and Goodman [42]):
For example, if your data generate a p-value of 0.07 (sometimes termed a 'trend'), the Bayes factor upper bound is ane.98 and you can conclude that the alternative hypothesis is at virtually twice as probable as the null hypothesis. A p-value of 0.01 indicates the alternative hypothesis is at nearly 8 times every bit probable as the nada. Benjamin and Berger argue that this approach is an easily-interpretable alternative to p, which should satisfy both practitioners of Bayesian statistics and practitioners of null hypothesis significance testing (comment by Benjamin and Berger annexed to [nine]).
Schönbrodt et al. [xl] make the case that the Bayes factor tin can be used to inform when a study has secured a sufficient sample size and can be halted. Effective stopping rules in research can exist invaluable for decision-making time and financial costs while increasing study replicability, and are ethically important for certain animal studies or intrusive human studies; the use of subjects should be minimized while ensuring that the experiments are robust and reproducible (https://world wide web.nc3rs.org.united kingdom/the-3rs; [43]). Arguably, stopping rules should be used a lot more than than they currently are, and can be a far more than effective method for targeting a suitable sample size than power analysis. A big fault frequently fabricated, notwithstanding, is to implement the p-value in the stopping dominion; the written report is stopped when the data thus far nerveless return a statistically significant p-value. The underlying supposition is that increasing the sample size further would probably subtract p farther. A simple model demonstrates this thinking to be spurious and thus that it drives very bad exercise (figure two). For those of us basing our study on the p-value, it is far preferable to continue a study until a pre-determined sample size is reached that has been decided by a priori ability analysis [45]. All the same, this approach is greatly influenced by the associated a priori outcome size estimate we take provided and there tin can be a strong temptation to increment sample size beyond the pre-adamant number; researchers longing for a statistically significant effect tin can easily succumb to the temptation of collecting extra data points when their p-value is 0.06 or 0.07 [46].
The Bayes cistron is much more advisable here. It provides evidence for the nada, and with a large enough sample the Bayes Factor will converge on 0 (the null is true) or infinity (the culling is true). If the Bayes Factor of your data reaches ten or ane/10, this nearly certainly represents the true state of affairs and your study can stop. Alternatively, if your written report must exist stopped for logistical reasons then the last Bayes Factor can still be interpreted, for example a Bayes gene of one/7 would indicate moderate evidence for the zip hypothesis. Moreover, y'all are entitled to continue sampling if y'all feel the data are not conclusive enough; if the results are unclear, collect more information. All such decisions practise not affect interpretation of the Bayes Cistron [twoscore]. A final large motivation for employing the Bayes factor over the p-value in stopping procedures is that in the long run, the former uses a smaller sample while at the same time generating fewer interpretation errors. A general consensus has non yet been reached about the virtually suitable priors for each state of affairs, and tractable Bayes gene procedures have thus far only been produced for some experimental designs. Just do not let this put you off. Instead of the Bayes gene, the Bayes cistron upper bound, equally described above, tin can be used.
five. Akaike data criterion: what is the all-time understanding of the phenomenon being studied?
If your study involves measuring an consequence variable and multiple potential explanatory variables, then you accept many possible models you could build to explain the variance in your data. Stepwise procedures of model edifice oftentimes focus on p-values, by holding onto merely those explanatory variables associated with a low p. Aside from the general concerns about p, specific criticisms of p-value-based model building include the inflated risk of type i errors [47,48]. An culling approach to model assessment is the Akaike information criterion (AIC), which can be easily calculated in statistical software packages, and in R using AIC() [49]. The AIC provides you with an estimate of how close your model is to representing full reality [50], or in other words its predictive accuracy [51]. Couched within the principle of simplicity and parsimony, a fundamental aspect of the AIC is that information technology trades off a model's goodness of fit against that model's complexity to insure against over-fitting [52].
Let's imagine you have generated three models, returning AICs of 443 (model 1), 445 (model 2) and 448 (model 3). Your preferred model in terms of relative quality will exist the i that returns the minimum AIC. Merely you should non necessarily discard the other models. With the AIC calculated for multiple models, yous can easily compute the relative likelihood that each of those models is the all-time of all presented models given your data, i.due east. the relative prove for each of them. For example, the preferred model volition always have a relative evidence of i, and in the current case the second all-time model, model two, has relative bear witness 0.37, and model 3 has 0.08. Finally, yous can then compute an bear witness ratio between any pair of models; following the above example, the evidence for model 1 over model two is 1/0.37 = two.7, i.e. the evidence for model 1 is 2.7-times as strong. In this scenario, although model 1 has the absolute everyman AIC, the bear witness that model one rather than model two is the all-time from those generated is not stiff, and with some explanatory variables present in merely 1 of the models, the most suitable response could be to brand your inferences based on both models [50]. The AIC approach encourages you to call up hard about culling models and thus hypotheses, in contrast to p-value interpretation that encourages rejecting the nothing when p is pocket-sized, and supporting the alternative hypothesis by default [53]. More broadly, the AIC epitome involves dropping hypotheses judged implausible, refining remaining hypotheses and adding new hypotheses—a scientific strategy that Burnham et al. [50] argue promotes fast and deep learning well-nigh the phenomenon beingness studied.
Although the AIC is mathematically related to the p-value (they are dissimilar transformations of the likelihood ratio; [29]), the quondam is far more flexible in the models information technology can compare. The AIC is a potent option for choosing between multiple models that you have generated to explain your data, i.e. to cull what model represents your best understanding of the phenomenon you have measured, particularly when the observed information are complex and poorly understood and yous exercise not wait your models to accept particularly strong predictive ability [54].
A key limitation of the AIC is that information technology provides a relative, non absolute, test of model quality. Information technology is easy to fall into the trap of assuming that the all-time model is also a good model for your data; this may be the case, or instead the best model may have simply one-half an heart on the variance in your data while all other models are blind to it. Quantifying the absolute quality of your all-time model(south) requires adding of the issue size, as discussed earlier (in the example of models, typically R 2 is suitable).
half-dozen. Conclusion
Good science generates robust data ripe for interpretation. At that place are several broad approaches to the statistical analysis of data, each interrogating the collected variables through a distinct line of questioning. Popper [55] argued that science is defined by the falsifying of its theories. Taking this approach to science, p-values might exist the rightful centrepiece of your statistical analysis since they provide evidence confronting the nada hypothesis [10,17]. Building on this paradigm, you can easily enhance interpretation of the p-value by augmenting p with a prediction interval and/or an estimate of the false positive risk—information about p's reliability. A counter statement, however, is that because the p-value does not test the null hypothesis nor the alternative hypothesis you tin never use information technology to really falsify a theory [56]. Converting the p-value into a Bayes cistron attends to this business organisation, providing relative evidence for one hypothesis or the other. But many have argued that hypothesis testing by whatsoever approach is superseded by focusing on the event in the data—specifically both its magnitude and accurateness—because your best estimate of the magnitude of the phenomenon you are studying is ultimately what you want to know. And if y'all behave multi-variate analysis, particularly when the miracle under study is poorly understood, you lot tin can be well served by the AIC, which encourages consideration of multiple hypotheses and their gradual refinement.
It is of import to emphasize that these manifold approaches are not all mutually exclusive; for example, many would argue that effect size estimates are an essential component of most analyses. Indeed, Goodman et al. [57] become then far every bit to recommend the use of a hybrid for determination making that requires a depression p-value coupled with an outcome size to a higher place an a priori determined minimum to be relevant/important in order to reject the null hypothesis. p-values can also be presented alongside Bayes factors for each statistical exam conducted ('a B for every p'). Continuing to present p-values as function of your statistical output while diluting their interpretive ability by including other statistical approaches should ensure your submission is non jeopardized, and indeed this approach is probably the best way to nudge reviewers and editors towards accepting—even encouraging—the awarding of culling inferential paradigms (and encounter Box ii in [43]). Whatsoever your called statistical approach, it is important that this has been adamant before data collection. Arming oneself with more statistical options could run a risk the temptation of trying different approaches until an exciting result is accomplished; this must be resisted.
Regardless of the statistical image yous employ to investigate patterns in your data, many have recommended that the outputs from statistical tests should always exist considered as secondary interrogations. Primarily, the statement goes, y'all should prioritize interpretation of graphical plots of your information, where possible, and treat statistical analyses every bit supporting or confirmatory information [25,58–60]. A plot that does non appear to support the findings of your statistical assay should not exist automatically explained abroad as a demonstration that your assay has uncovered patterns that are deeper than tin be visualized.
Finally, while I hope that this review might help readers feel a little more than informed, and confident, well-nigh some of the additional and alternative statistical options to the p-value, it is worth reminding ourselves of Sir Ronald Fisher's pertinent words from his Presidential Address to the First Indian Statistical Congress in 1938 [61]: 'To call in a statistician after the experiment is done may be no more than asking him to perform a post-mortem test: he may be able to say what the experiment died of.' Without a good dataset, none of the statistical tools mentioned hither will be effective. Moreover, even a good dataset represents just a unmarried study, and it must not be forgotten that a unmarried study provides limited information. Ultimately, replication is key to refining, and having conviction in, our understanding of the biological world.
Ethics
Consent was non required for this review.
Data accessibility
All information were generated by R lawmaking; the code will be made available on request.
Competing interests
I have no competing interests.
Funding
This study was not supported by funding.
Acknowledgements
The Laboratory Animal Care seminar in 2022 entitled '3R seminar: Study pattern', organised by the University of Oulu Graduate Schoolhouse, provided the catalyst for writing this article. I appreciate the feedback that I received on drafts of this article from Michael Pedersen, Dr Louise Soanes and Dr Mircea Iliescu, and Professor Stuart Semple.
Footnotes
Electronic supplementary textile is bachelor online at https://dx.doi.org/10.6084/m9.figshare.c.4498919.
References
- 1.
Cumming Thou . 2014 The new statistics: why and how. Psychol. Sci. 25 , vii-29. (doi:10.1177/0956797613504966) Crossref, PubMed, ISI, Google Scholar
- 2.
Cohen J . 1994 The Earth is round (p < 0.05). Am. Psychol. 49 , 997-1003. (doi:10.1037/0003-066X.49.12.997) Crossref, ISI, Google Scholar
- 3.
Bakan D . 1966 The examination of significance in psychological inquiry. Psychol. Balderdash. 66 , 423. (doi:x.1037/h0020412) Crossref, PubMed, ISI, Google Scholar
- four.
Berkson J . 1942 Tests of significance considered every bit evidence. J. Am. Stat. Assoc. 37 , 325-335. (doi:10.1080/01621459.1942.10501760) Crossref, Google Scholar
- 5.
Nuzzo R . 2014 Statistical errors. Nature 506 , 150-152. (doi:10.1038/506150a) Crossref, PubMed, ISI, Google Scholar
- 6.
Halsey L, Curran-Everett D, Vowler S, Drummond G . 2015 The fickle P value generates irreproducible results. Nat. Methods 12 , 179-185. (doi:10.1038/nmeth.3288) Crossref, PubMed, ISI, Google Scholar
- seven.
Cumming G . 2008 Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspect. Psychol. Sci. iii , 286-300. (doi:10.1111/j.1745-6924.2008.00079.10) Crossref, PubMed, ISI, Google Scholar
- eight.
Colquhoun D . 2017 The reproducibility of research and the misinterpretation of p-values. R. Soc. open sci. iv , 171085. (doi:10.1098/rsos.171085) Link, ISI, Google Scholar
- nine.
Wasserstein RL, Lazar NA . 2016 The ASA's statement on p-values: context, procedure, and purpose. Am. Stat. seventy , 129-133. (doi:ten.1080/00031305.2016.1154108) Crossref, ISI, Google Scholar
- ten.
Lew M . 2012 Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don't know P . Br. J. Pharmacol. 166 , 1559-1567. (doi:10.1111/j.1476-5381.2012.01931.x) Crossref, PubMed, ISI, Google Scholar
- 11.
Amrhein V, Greenland Southward, MsShane B . 2019 Retire statistical significance. Nature 567 , 305-307. (doi:10.1038/d41586-019-00857-ix) Crossref, PubMed, ISI, Google Scholar
- 12.
Cohen J . 1990 Things I have learned (and so far). Am. Psychol. 45 , 1304. (doi:ten.1037/0003-066X.45.12.1304) Crossref, ISI, Google Scholar
- 13.
Tukey JW . 1991 The philosophy of multiple comparisons. Stat. Sci. 6 , 100-116. (doi:10.1214/ss/1177011945) Crossref, Google Scholar
- 14.
Colquhoun D . 2014 An investigation of the false discovery charge per unit and the misinterpretation of p-values. R. Soc. open sci. i , 140216. (doi:10.1098/rsos.140216) Link, ISI, Google Scholar
- 15.
Fisher R . 1959 Statistical methods and scientific inference , 2nd edn. New York, NY: Hafner Publishing. Google Scholar
- xvi.
Boos D, Stefanski L . 2011 P-value precision and reproducibility. Am. Stat. 65 , 213-221. (doi:ten.1198/tas.2011.10129) Crossref, PubMed, ISI, Google Scholar
- 17.
Lew MJ . 2013To P or non to P: on the evidential nature of P-values and their place in scientific inference. arXiv 1311.0081. Google Scholar
- 18.
Lazzeroni LC, Lu Y, Belitskaya-Levy I . 2016 Solutions for quantifying P-value incertitude and replication power. Nat. Methods 13 , 107-108. (doi:ten.1038/nmeth.3741) Crossref, PubMed, ISI, Google Scholar
- 19.
Huber W . 2016 A clash of cultures in discussions of the P value. Nat. Methods 13 , 607. (doi:10.1038/nmeth.3934) Crossref, PubMed, ISI, Google Scholar
- twenty.
Vsevolozhskaya O, Ruiz G, Zaykin D . 2017 Bayesian prediction intervals for assessing P-value variability in prospective replication studies. Transl. Psychiatry 7 , 1271. (doi:10.1038/s41398-017-0024-3) Crossref, PubMed, ISI, Google Scholar
- 21.
Altman N, Krzywinski M . 2017 Points of significance: Interpreting P values. Nat. Methods xiv , 213-214. (doi:ten.1038/nmeth.4210) Crossref, ISI, Google Scholar
- 22.
Tukey JW . 1969 Analyzing information: Sanctification or detective work? Am. Psychologist 24 , 83-91. (doi:ten.1037/h0027108) Crossref, ISI, Google Scholar
- 23.
Johnson D . 1999 The insignificance of statistical significance testing. J. Wildl. Manage. 63 , 763-772. (doi:ten.2307/3802789) Crossref, ISI, Google Scholar
- 24.
Nakagawa S, Cuthill I . 2007 Upshot size, confidence interval and statistical significance: a practical guide for biologists. Biol. Rev. 82 , 591-605. (doi:10.1111/j.1469-185X.2007.00027.x) Crossref, PubMed, ISI, Google Scholar
- 25.
Loftus GR . 1993 A film is worth a m p values: on the irrelevance of hypothesis testing in the microcomputer age. Behav. Res. Methods Instrum. Comput. 25 , 250-256. (doi:ten.3758/bf03204506) Crossref, Google Scholar
- 26.
Lavine M . 2014 Annotate on Murtaugh. Environmental 95 , 642-645. (doi:10.1890/xiii-1112.1) Crossref, PubMed, ISI, Google Scholar
- 27.
Cumming G, Calin-Jageman R . 2016 Introduction to the new statistics: estimation, open scientific discipline, and beyond . New York, NY: Routledge. Crossref, Google Scholar
- 28.
Cumming G, Fidler F, Vaux D . 2007 Mistake bars in experimental biology. J. Cell Biol. 177 , 7-11. (doi:10.1083/jcb.200611141) Crossref, PubMed, ISI, Google Scholar
- 29.
Murtaugh P . 2014 In defence force of P values. Environmental 95 , 611-617. (doi:10.1890/xiii-0590.i) Crossref, PubMed, ISI, Google Scholar
- 30.
Spanos A . 2014 Recurring controversies virtually P values and confidence intervals revisited. Ecology 95 , 645-651. (doi:x.1890/xiii-1291.1) Crossref, PubMed, ISI, Google Scholar
- 31.
Calin-Jageman RJ, Cumming G . 2019 The new statistics for better science: ask how much, how uncertain, and what else is known. Am. Stat. 73 (suppl. 1), 271-280. (doi:10.1080/00031305.2018.1518266) Crossref, PubMed, ISI, Google Scholar
- 32.
Ho J, Tumkaya T, Aryal Due south, Choi H, Claridge-Chang A . 2018Moving beyond P values: Everyday data analysis with estimation plots. bioRxiv, 377978. Google Scholar
- 33.
Sena ES, Briscoe CL, Howells DW, Donnan GA, Sandercock PA, Macleod MR . 2010 Factors affecting the apparent efficacy and condom of tissue plasminogen activator in thrombotic occlusion models of stroke: systematic review and meta-analysis. J. Cereb. Blood Menstruum Metab. xxx , 1905-1913. (doi:10.1038/jcbfm.2010.116) Crossref, PubMed, ISI, Google Scholar
- 34.
Cohn LD, Becker BJ . 2003 How meta-analysis increases statistical power. Psychol. Methods 8 , 243. (doi:ten.1037/1082-989X.eight.three.243) Crossref, PubMed, ISI, Google Scholar
- 35.
Ioannidis JP, Lau J . 1999 State of the evidence: electric current condition and prospects of meta-analysis in infectious diseases. Clin. Infect. Dis. 29 , 1178-1185. (doi:10.1086/313443) Crossref, PubMed, ISI, Google Scholar
- 36.
Rosenthal R . 1979 The file drawer problem and tolerance for null results. Psychol. Balderdash. 86 , 638. (doi:10.1037/0033-2909.86.three.638) Crossref, ISI, Google Scholar
- 37.
Lane A, Luminet O, Nave Thousand, Mikolajczak M . 2016 Is there a publication bias in behavioural intranasal oxytocin enquiry on humans? Opening the file drawer of one laboratory. J. Neuroendocrinol. 28 . (doi:10.1111/jne.12384) Crossref, PubMed, ISI, Google Scholar
- 38.
Morey RD 2015BayesFactor: Ciphering of Bayes factors for common designs. Encounter https://cran.r-project.org/spider web/packages/BayesFactor/index.html. Google Scholar
- 39.
Sinharay S, Stern HS . 2002 On the sensitivity of Bayes factors to the prior distributions. Am. Stat. 56 , 196-201. (doi:x.1198/000313002137) Crossref, ISI, Google Scholar
- 40.
Schönbrodt FD, Wagenmakers E-J, Zehetleitner M, Perugini M . 2017 Sequential hypothesis testing with Bayes factors: efficiently testing mean differences. Psychol. Methods 22 , 322. (doi:10.1037/met0000061) Crossref, PubMed, ISI, Google Scholar
- 41.
Spiegelhalter D, Rice M . 2009 Bayesian statistics. Scholarpedia iv , 5230. (doi:x.4249/scholarpedia.5230) Crossref, Google Scholar
- 42.
Goodman SN . 2001 Of P-values and Bayes: a modest proposal. Epidemiology 12 , 295-297. (doi:x.1097/00001648-200105000-00006) Crossref, PubMed, ISI, Google Scholar
- 43.
Sneddon LU, Halsey LG, Bury NR . 2017 Because aspects of the 3Rs principles within experimental animal biological science. J. Exp. Biol. 220 , 3007-3016. (doi:10.1242/jeb.147058) Crossref, PubMed, ISI, Google Scholar
- 44.
O'Keefe D . 2007 Post hoc ability, observed power, a priori power, retrospective power, prospective power, achieved power: sorting out appropriate uses of statistical power analyses. Commun. Methods Meas. 1 , 291-299. (doi:10.1080/19312450701641375) Crossref, Google Scholar
- 45.
Cohen J . 1988 Statistical power analysis for the behavioural sciences . Hillside, NJ: Lawrence Earlbaum Associates. Google Scholar
- 46.
John LK, Loewenstein One thousand, Prelec D . 2012 Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 , 524-532. (doi:10.1177/0956797611430953) Crossref, PubMed, ISI, Google Scholar
- 47.
Mundry R, Nunn C . 2009 Stepwise model plumbing equipment and statistical inference: turning noise into indicate pollution. Am. Nat. 173 , 119-123. (doi:ten.1086/593303) Crossref, PubMed, ISI, Google Scholar
- 48.
Krzywinski Thou, Altman North . 2014 Points of significance: Comparing samples—part II. Nat. Methods eleven , 355-356. (doi:10.1038/nmeth.2900) Crossref, ISI, Google Scholar
- 49.
Sakamoto Y, Ishiguro MGK . 1986 Akaike data criterion statistics . Dordrecht, Holland: D. Reidel Publishing Visitor. Google Scholar
- fifty.
Burnham KP, Anderson D, Huyvaert Yard . 2011 AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons. Behav. Ecol. Sociobiol. 65 , 23-35. (doi:x.1007/s00265-010-1029-six) Crossref, ISI, Google Scholar
- 51.
Gelman A, Hwang J, Vehtari A . 2014 Agreement predictive information criteria for Bayesian models. Stat. Comput. 24 , 997-1016. (doi:10.1007/s11222-013-9416-two) Crossref, ISI, Google Scholar
- 52.
Burnham KP, Anderson DR . 2001 Kullback-Leibler information equally a basis for strong inference in ecological studies. Wildl. Res. 28 , 111-119. (doi:10.1071/WR99107) Crossref, ISI, Google Scholar
- 53.
Steidl RJ . 2006 Model option, hypothesis testing, and risks of condemning analytical tools. J. Wildl. Manage. 70 , 1497-1498. (doi:10.2193/0022-541X(2006)seventy[1497:MSHTAR]2.0.CO;2) Crossref, ISI, Google Scholar
- 54.
Ellison A, Gotelli Northward, Inouye B, Stiff D . 2014 P values, hypothesis testing, and model selection: it's déjà vu all over again. Environmental 95 , 609-610. (doi:ten.1890/13-1911.one) Crossref, PubMed, ISI, Google Scholar
- 55.
Popper K . 1963 Conjectures and refutations: the growth of scientific noesis . London, United kingdom of great britain and northern ireland: Routledge. Google Scholar
- 56.
Gallistel C . 2009 The importance of proving the nix. Psychol. Rev. 116 , 439. (doi:10.1037/a0015251) Crossref, PubMed, ISI, Google Scholar
- 57.
Goodman WM, Spruill SE, Komaroff E . 2019 A proposed hybrid effect size plus p-value criterion: empirical evidence supporting its use. Am. Stat. 73 (suppl. 1), 168-185. (doi:x.1080/00031305.2018.1564697) Crossref, ISI, Google Scholar
- 58.
Murtaugh P . 2014 Rejoinder. Ecology 95 , 651-653. (doi:ten.1890/xiii-1858.1) Crossref, PubMed, ISI, Google Scholar
- 59.
Drummond K, Vowler S . 2011 Show the data, don't muffle them. J. Physiol. 589.8, 1861-1863. (doi:10.1113/jphysiol.2011.205062) Crossref, ISI, Google Scholar
- 60.
Masson G, Loftus GR . 2003 Using conviction intervals for graphically based information interpretation. Can. J. Exp. Psychol. 57 , 203-220. (doi:10.1037/h0087426) Crossref, PubMed, ISI, Google Scholar
- 61.
Fisher RA . 1938 Presidential accost to the First Indian Statistical Congress. Sankhya 4 , 14-17. Google Scholar
Source: https://royalsocietypublishing.org/doi/10.1098/rsbl.2019.0174
0 Response to "Can a P Value Be Greater Than 1"
Post a Comment