p-values in Critical Appriasal

p-values are often revered in research, sometimes to a fault. They are frequently cited in critical appraisals and appear regularly in exams, making it essential to understand their true meaning. But what exactly are P-values?

Welcome to the St Emlyn’s Podcast! Today we delve into the enigmatic world of p-values, a cornerstone of research that can make or break academic careers. As emergency medicine clinicians, it’s crucial to grasp not only the definition of P values but also how to interpret them meaningfully in the context of clinical research. So, let’s embark on this journey to demystify p-values and enhance our critical appraisal skills.

Listening Time – 10:29

What are p-values?

In simple terms, a p-value is a measure of the probability that an observed difference could have occurred just by random chance if the null hypothesis were true. The null hypothesis typically states that there is no difference between two treatments or interventions. Thus, a p-value helps us determine whether the observed data is consistent with the null hypothesis.

The Null Hypothesis and Significance Testing

To fully comprehend p-value, we must start with the null hypothesis. In any trial, we begin with the premise that there is no difference between the treatments being tested. The goal is to test this null hypothesis and, ideally, to disprove it. This process is known as significance testing.

When we calculate a p-value, we are essentially expressing the probability of obtaining a result as extreme as the one observed, assuming the null hypothesis is true. For instance, a p-value of 0.05 suggests that there is a 5% chance that the observed difference is due to random variation alone.

The Magic of 0.05

The threshold of 0.05 has become somewhat magical in the world of research. A p-value below this threshold is often considered statistically significant, while one above is not. However, this binary approach oversimplifies the complexity of statistical analysis. The figure 0.05 is arbitrary and does not imply that results just above or below this threshold are drastically different in terms of practical significance.

Clinical vs. Statistical Significance

One critical aspect of interpreting p-values is distinguishing between statistical significance and clinical significance. A statistically significant result with a very small P-value may not always translate into clinical importance. For example, a large study might find that a new treatment reduces blood pressure by 0.5 millimetres of mercury with a p-value of 0.001. While statistically significant, such a small reduction may not be clinically relevant.

Conversely, a clinically significant finding might not reach the strict threshold of statistical significance, particularly in smaller studies. Therefore, it’s essential to consider both the magnitude of the effect and its practical implications in clinical practice.

The Fragility Index

The fragility index is an alternative measure that addresses some limitations of p-value. It calculates the number of events that would need to change to alter the study’s results from statistically significant to non-significant. This index provides insight into the robustness of the findings. Surprisingly, even large trials can have a low fragility index, indicating that their results hinge on a small number of events.

Moving Beyond 0.05

Recognising the limitations of the 0.05 threshold, some researchers advocate for more stringent criteria, such as a p-value of 0.02, particularly in large randomized controlled trials (RCTs). This approach aims to reduce the likelihood of false-positive results and improve the reliability of findings. However, it also raises the bar for demonstrating the efficacy of new treatments, which can be a double-edged sword.

Multiple Testing and Bonferroni Adjustment

A significant challenge in research is multiple testing. Conducting numerous statistical tests increases the probability of finding at least one significant result purely by chance. This issue is particularly relevant in exploratory studies where multiple outcomes are assessed.

One method to address this problem is the Bonferroni adjustment, which adjusts the significance threshold based on the number of tests performed. While this approach helps control the risk of false positives, it can be overly conservative and reduce the power to detect true effects. Therefore, it should be used judiciously.

Interim Analysis in Clinical Trials

Interim analysis is a crucial aspect of clinical trials, allowing researchers to assess the effectiveness or harm of an intervention before the study’s completion. However, performing multiple interim analyses can increase the risk of false-positive findings. To mitigate this risk, researchers use techniques like p-value spending functions, which adjust the significance threshold for each interim analysis.

Additionally, the number of interim analyses should be limited and pre-specified in the study protocol. This ensures that decisions to stop a trial early are based on robust evidence and not on arbitrary or opportunistic analyses.

Effect Size and Confidence Intervals

p-values alone do not provide a complete picture of the study results. It’s equally important to consider the effect size, which measures the magnitude of the difference between treatments. A small p-value might indicate statistical significance, but without a substantial effect size, the clinical relevance of the finding remains questionable.

Confidence intervals (CIs) complement p-values by providing a range within which the true effect size is likely to lie. A 95% CI means that if the study were repeated multiple times, 95% of the calculated intervals would contain the true effect size. CIs offer valuable context for interpreting p-values and understanding the precision of the estimated effect.

Practical Tips for Interpreting p-values

Understand the Null Hypothesis: Always start with a clear understanding of the null hypothesis and what the study aims to test.
Look Beyond the p-value: Consider the effect size, confidence intervals, and clinical significance of the findings.
Be Cautious with Multiple Testing: Recognize the increased risk of false positives with multiple comparisons and apply appropriate adjustments.
Assess the Fragility Index: Use the fragility index to gauge the robustness of the study’s findings.
Consider Interim Analysis: Ensure that interim analyses are pre-planned and interpreted with caution to avoid bias.
Question the Threshold: Remember that the 0.05 threshold is not a magic number. Interpret p-values in the context of the study design, sample size, and practical implications.

Conclusion

p-values are a fundamental aspect of medical research, but their interpretation requires a nuanced understanding. By considering the null hypothesis, clinical significance, effect size, and confidence intervals, we can make more informed decisions based on the data. As emergency medicine clinicians, our goal is to apply research findings judiciously to improve patient care.

We hope this deep dive into P-values has clarified their role and limitations in research. Remember, the journey to mastering statistical concepts is ongoing, and continuous learning is key. If you have any questions or thoughts, please share them in the comments below. Happy appraising, and stay curious!

Podcast Transcription

Hello and welcome to this St Emlyn’s podcast. I’m Simon Carley, and I’m Rick Bodey. What are we doing today? We’re going to talk about P values. Now, P values are godlike in the world of research. The whole world is about P values, isn’t it? There’s that kind of joke about the definition of P value, the worst thing in the world for a researcher is a P value of 0.06, but success if the P value is 0.04. So, it’s important to get these things right. It can make or break your academic career based on that P value.

I know, but it’s not right though, is it? No, it’s not right, but it’s really important because it appears in all the different papers, so you’re bound to find it when you do critical appraisals. It often comes up in the FRCA exams, critical appraisal exams, where they ask you to define what you mean by P value. So, it’s kind of important that we get a good idea about what it actually is, but also how we, as clinicians, can use that information when we’re looking at papers.

You can go and look at a textbook definition of P values, and things like that, but they can be a bit confusing. Do you have any ways of thinking about them and conceptualising what we mean by P values? When we were just talking about this concept between ourselves, you made a really important point about the null hypothesis. I think if we start with the null hypothesis and understand what we’re testing, then it will really help us to understand what the P value actually means because it’s all about that null hypothesis.

P values are probabilities in a way, and they’re a way of doing what we call significance testing. So, going back to the null hypothesis, say we’ve got a trial, we usually start with something like the purpose of this trial is to find out if there’s a difference, but it’s not really constructed that way. It’s constructed around the null hypothesis, which basically says what we’re trying to do is to find out what. The null hypothesis will say that there’s no difference between two approaches or two treatments, and we aim then to disprove that null hypothesis. So, when we’re doing P value testing, actually, what we’re expressing with the P value is the probability that if the null hypothesis was true, then we’d get a result that’s at least as extreme as the one that we’ve seen.

Yeah, so I’ll say it again because I think it’s really important. If the null hypothesis is true, the chances of finding a result of this difference, or more unlikely than this, is equivalent to whatever the P value is. So, a P value of 0.1 would be 10% of the time, P value of 0.05, the famous one, 5% of the time, or 1 in 20 times, with noting at this point that 1 in 20 is not that unusual.

Yeah, that’s right. It’s a very magical figure, this 0.05, but there is, of course, no magic about it whatsoever, and it is quite a fragile thing, so we have to be a little bit intelligent in interpreting the P value because there are lots of limitations to it. One of them, of course, being the difference between clinical significance and statistical significance. We can find a highly significant P value and a very, very small difference. It could be something which is not that important, or we could, with a very, very large study, detect very small differences, which we think actually, I don’t care. So you might find in a big study that you can reduce blood pressure by 0.5 millimetres of mercury. Really? I mean, is that important? You might be able to prove it in numbers, but it’s not the same as clinical importance. So that’s always going to be a question we ask, and that’s also going to be a particularly interesting question when we talk about this 0.05 or 1 in 20, when we’re looking at what really means a difference, there are some things where that’s not going to be enough proof.

Yeah, that’s right. So there’s one way of measuring that, which is the fragility index, and that tells us a little bit about how resilient that figure is to changes if we’ve got slightly different results. Yeah, so the fragility index has been offered as an alternative to P values, and basically what that says is that you do a trial, and you may find out that something is statistically significant with a P value of say 0.04, and then the fragility index asks how many people in that trial would have to flip their results, so have an alternative outcome, for it to become non-significant. It’s quite interesting, you look at even big trials, and particularly whether you’ve been raised or whatever it is you’re looking at, it’s quite small, the fragility index can be something like a handful of patients, and that’s really, really interesting when you look at particularly some of the larger critical care trials.

But go back to this 0.05 thing, it has become magical, but actually there’s a lot of trials now which are funded by large organisations, the big randomised control trials, where when they’re calculating the number of people that they need to have in the trial, the power calculation, or the sample size calculation more accurately, they’re saying that actually we want a better level of proof of this, they’re actually going for often 0.02, so one in 50 chance of getting a result this unlikely or more unlikely as a positive result, and I think that’s a good thing.

Yeah, it’s tricky, there’s got to be a balance, hasn’t there, because we don’t want to throw away a good treatment that might benefit patients, just because we haven’t hit really strict statistical criteria for success, but at the same time we don’t want to be subjecting patients to a treatment that’s actually doing them no good, it’s just a statistical quirk.

When we’re looking at trials in particular, we might look at lots of different outcomes actually, so we might, and I think most of the trials you look at now, there’ll be a number of different results in those results sections, each will have their own little P value. Is that a problem with us doing lots of analyses? Well, it can be. So if you imagine that one in 20 chance of getting a P value of less than 0.05, you know, then, and we do multiple tests, now the chances that we’re going to find one of those P values of less than 0.05, just by chance alone, is really going up.

Yeah, it starts to ramp up quite quickly, in fact. Yeah, so this is a problem that we have with multiple testing, and that actually it biases you to finding something positive, when actually there was nothing positive to find, you were just fishing. So there’s some adjustments that we can make for that, like the Bonferroni adjustment, for example. Bonferroni is a bit crude, but it is used, and basically I think every time you increase the number of times you’ve done a test, then you adjust to basically divide the significance by the number of times you do the analysis, and that can be used, but it is a bit of a blunt tool, and sometimes it can feel a bit draconian, particularly when you’re looking at studies where you’re quite exploratory about what’s going on. So you might do a study where you plan to do lots of analyses, because you really don’t know what’s going to go on, and then use that data, have a look what is potentially significant, and then take that through to a greater depth of analysis in a future study. So not achieving a 0.05 sometimes is also important, and just because something doesn’t reach statistical significance, you still may want to have a look at that data.

Yeah, and you raise the really important point about interim analysis there, because we just talked about the danger of doing multiple testing, and one of them just by chance will be positive, and the same applies with doing interim analysis. So if you’re running a randomised controlled trial, we don’t want to be subjecting patients to interventional treatments for too long. As soon as we know it’s successful, we want all our patients to be getting this successful treatment, and if we know it’s harmful, we want to stop the trial, so that they don’t get any more of the potentially harmful treatment. So you could say, well, why don’t we just keep on testing? And if we reach that P value of 0.05 and show success, then we can stop the trial because we’ve shown success, but of course you’ve got into the problem of multiple testing. So there are ways of doing it, you can have a P value spending function, so that every time you do the test, you accept a more stringent P value, it represents the statistical significance, but also really importantly, we have to limit the number of interim analyses that we do, because otherwise it is literally just fishing. I can’t run the exact statistics, but I do know that if you say you’re doing a larger randomised controlled trial, you decided to do an analysis after every patient, then you’ll definitely get a positive result, even if there’s no difference between the two groups, pretty much on the base of chance. Mathematicians will tell me that that’s not actually strictly true, but the point is that the chances of you finding a positive result, just by chance, by doing lots of interims, is terrible. So I think there’s something there about when you’re designing and looking at a critical appraisal of a trial, if there have been interim analyses, then you want to see that they’ve been pre-planned, they’ve been understood, and actually in a lot of large studies now, and I’m on a couple of data monitoring committees, the analysis is done by somebody else, and they will only stop the trial early if there’s very significant evidence that actually continuing would be the

wrong thing to do. So the marker and the bar for stopping a trial early is actually higher than we would normally look for in the outcomes of critical appraisal.

Absolutely. Important concepts there on multiple testing, interim analysis, we’ve talked about the difference between clinical significance and statistical significance. Maybe we should touch on effect size as well. That overlaps with clinical and statistical significance, doesn’t it? But one of the common criticisms of a P value is you can look very impressive. P less than 0.001, but that difference can be small, so we need to take into account some measure of effect size as well. P value is not going to tell us the full story, and that’s where your 95% confidence intervals come in. And that’s really important to tell us the full picture. We shouldn’t just be relying on P values to tell us whether or not our intervention is successful.

I agree, let’s just quickly whip through that. P values start with the null hypothesis. We’re trying to disprove the null hypothesis, and a P value is a measure of the probability of finding a result this unlikely or more unlikely, assuming that there’s no difference between the two interventions, or however many interventions you’re looking at. Keep that in mind, and P values actually become a lot clearer and a lot easier to understand.

Yep, we hope that’s made things a bit clearer for you. Okay, so with a high probability that you’ve enjoyed this podcast, that’s rubbish. That was rubbish actually, that’s a terrible joke. With high probability that you enjoy this podcast just for the second time, enjoy your emergency medicine and whatever it is you’re up to. Happy appraising!

Where to listen

You can listen to our podcast in numerous ways, ensuring you never miss an episode no matter where you are or what device you’re using. For the traditionalists, Apple Podcasts and Google Podcasts offer easy access with seamless integration across all your Apple or Android devices. Spotify and Amazon Music are perfect for those who like to mix their tunes with their talks, providing a rich listening experience. If you prefer a more curated approach, platforms like Podchaser and TuneIn specialize in personalising content to your tastes. For those on the go, Overcast and Pocket Casts offer mobile-friendly features that enhance audio quality and manage playlists effortlessly. Lastly, don’t overlook YouTube for those who appreciate a visual element with their audio content. Choose any of these platforms and enjoy our podcast in a way that suits you best!