An introduction to sample size calculations. St.Emlyn's

This blog post is based on a paper I put together with Steve Jones and Magnus Harrison way back in 2003. In that paper we covered an area of critical appraisal that strikes fear into many a reader.., the issues of sample size and power calculations. What we hoped to do back then (and now) is to try and make a potentially difficult topic a little easier through explanation and example. I think it’s stood the test of time and now that the regulations on publishing your own work on blogs have been made clear we can share as #FOAMed. We hope it helps.

If you want to see the original paper then you can download a copy here

This article has been accepted for publication in the EMJ following peer review. The definitive copyedited, typeset version is available online at http://emj.bmj.com/content/20/5/453.full.pdf+html

See licence to publish details from BMJ here.

Also the great SHERPA/ROMEO information here.

You can also watch a video podcast here

Objectives

Understand power and sample size estimation.
Understand why power is an important part of both study design and analysis.
Understand the differences between sample size calculations in comparative and diagnostic studies.
Learn how to perform a sample size calculation.
1. For continuous data
2. For non-continuous data
3. In diagnostic tests

Power and sample size estimation

Power and sample size estimations are measures of how many patients are needed in a study. Nearly all clinical studies involve studying a sample of patients with a particular characteristic rather than the whole population. We then use this sample to draw inferences about the whole population.

Statistical analysis allows us to determine if the results we have found in a study are likely to be true or possibly just due to chance alone. Clearly we can reduce the possibility of our results coming from chance by eliminating bias in the study design using techniques such as randomisation, blinding etc. However, another factor influences the possibility of our results coming from chance alone, the number of patients studied. Intuitively we assume that the greater the proportion of the whole population studied, the closer we will get to true answer for that population. But how many do we need to study in order to get as close as we need to the right answer?

An example is the case of thrombolysis in acute myocardial infarction (AMI). For many years clinicians felt that this treatment would be of benefit given the proposed aetiology of AMI, however successive studies failed to prove the case. It was not until the adequately powered “mega-trials” that the small but important benefit of thrombolysis was proven.

Generally these trials compared thrombolysis with placebo and often had a primary outcome measure of mortality at a certain number of days. The basic hypothesis for the studies may have compared, for example, the day 21 mortality of thrombolysis versus placebo. There are two hypotheses then that we need to consider:

The null hypothesis is that there is no difference between the treatments in terms of mortality.
The alternative hypothesis is that there is a difference between the treatments in terms of mortality.

In trying to determine whether the two groups are the same (accepting the null hypothesis) or they are different (accepting the alternative hypothesis) we can potentially make two kinds of error. These are called a type I error and a type II error.

A type I error is said to have occurred when we reject the null hypothesis incorrectly (i.e. it is true and there is no difference between the two groups) and report a difference between the two groups being studied.

A type II error is said to occur when we accept the null hypothesis incorrectly (i.e. it is false and there is a difference between the two groups which is the alternative hypothesis) and report that there is no difference between the two groups.

They can be expressed as a two by two table:

Screenshot 2014-08-14 08.06.02 — The familiar 2×2 table

Sample size calculations tell us how many patients are required in order to avoid a type I or a type II error.

You may find this an easier way to remember….

@theplaguedoc

You're pregnant.

Congrats! pic.twitter.com/0iatGqUNLF

— Dr Aidan Baron (@Aidan_Baron) May 16, 2014

The term power is commonly used with reference to all sample size estimations in research. Strictly speaking “power” refers to the number of patients required to avoid a type II error in a comparative study. Sample size estimation is a more encompassing term that looks at more than just the type II error and is applicable to all types of studies. In common parlance the terms are used interchangeably.

What affects the power of a study?

There are several factors that can affect the power of a study. These should be considered early on in the development of a study. Some of the factors we have control over, others we do not.

(i) Precision

Why might a study not find a difference if there truly is one? For any given result from a sample of patients we can only determine a probability distribution around that value that will suggest where the true population value lies. The best-known example of this would be a 95% confidence interval. The size of the confidence interval is inversely proportional to the number of subjects studied. So the more people we study the more precise we can be about where the true population value lies.

Figure 1 shows that for a single measurement, the more subjects studied the narrower the probability distribution becomes.

Figure 1

Fig 1. Change in confidence interval width with increasing numbers of subjects — Fig 1. Change in confidence
interval width with increasing
numbers of subjects

The probability distribution of where the true value lies is an integral part of most statistical tests for comparisons between groups (e.g. t-tests). A study with a small sample size will have large confidence intervals and will only show up as statistically abnormal if there is a large difference between the two groups. Figure 2 demonstrates how increasing the number of subjects can give a more precise estimate of differences.

Figure 2

Effect of confidence interval reduction to demonstrate a true difference in means. This example shows that the initial comparison between groups 1 and 3 showed no statistical difference as the confidence intervals overlapped. In groups 3 and 4 the number of patients is doubled (although the mean remains the same). We see that the confidence intervals no longer overlap indicating that the difference in means is unlikely to have occurred by chance. — Effect of confidence
interval reduction to demonstrate a
true difference in means. This
example shows that the initial
comparison between groups 1 and 3
showed no statistical difference as the
confidence intervals overlapped. In
groups 3 and 4 the number of
patients is doubled (although the
mean remains the same). We see that
the confidence intervals no longer
overlap indicating that the difference
in means is unlikely to have occurred
by chance.

(ii) The difference we are looking for.

If we are trying to detect very small differences between treatments, very precise estimates of the true population value are required. This is because we need to find the true population value very precisely for each treatment group. Conversely, if we find, or are looking for, a large difference a fairly wide probability distribution may be acceptable.

In other words if we are looking for a big difference between treatments we might be able to accept a wide probability distribution, if we want to detect a small difference we will need great precision and small probability distributions. Since the width of probability distributions is largely determined by how many subjects we study it is clear that the difference sought affects sample size calculations.

When comparing two or more samples we usually have little control over the size of the effect. However, we need to make sure that the difference is worth detecting. For example it may be possible to design a study that would demonstrate a reduction in the onset time of local anaesthesia from 60 seconds to 59 seconds, but such a small difference would be of no clinical importance. Conversely a study demonstrating a difference of 60 seconds to 10 minutes clearly would. Stating what the “clinically important difference” is a key component of a sample size calculation.

(iii) How important is a type I or type II error for the study in question?

We can specify how concerned we would be to avoid a type I or type II error.

A type I error is said to have occurred when we reject the null hypothesis incorrectly. Conventionally we choose a probability of <0.05 for a type I error. This means that if we find a positive result the chances of finding this (or a greater difference) would occur on less than 5% of occasions. This figure, or significance level, is designated as pa and is usually preset by us early in the planning of a study, when performing a sample size calculation. By convention, rather than design, we more often than not choose 0.05. The lower the significance level the lower the power, so using 0.01 will reduce our power accordingly.

(To avoid a type I error, i.e. if we find a positive result the chances of finding this, or a greater difference, would occur on less than a% of occasions.

A type II error is said to occur when we accept the null hypothesis incorrectly and report that there is no difference between the two groups. If there truly is a difference between the interventions we express the probability of getting a type II error and how likely are we to find it. This figure is referred to as pb. There is less convention as to the accepted level of pb, but figures of 0.8-0.9 are common (i.e. if a difference truly exists between interventions then we will find it on 80-90% of occasions.

The avoidance of a type II error is the essence of power calculations. The power of a study is the probability that the study will detect a predetermined effect on the measurement between the two groups, if it truly exists, given a preset value of pa and a sample size, N.

(iv) The type of statistical test we are performing.

Sample size calculations reflect how statistical tests are going to perform. Therefore, it is no surprise that the type of test used affects how the sample size is calculated. For example, parametric tests are better at finding differences between groups than non-parametric tests (which is why we often try to convert basic data to normal distributions). Consequently, an analysis reliant upon a non-parametric test (e.g. Mann-Whitney U) will need more patients than one based on a parametric test (e.g. students t-test).

Factors affecting a power calculation

Precision
The difference we are looking for
How certain we want to be to avoid type I or II errors
The type of statistical test we are performing

Should sample size calculations be performed before or after the study?

The answer is before, during and after.

In designing a study we want to make sure that the work that we do is worthwhile so that we get the correct answer and we get it in the most efficient way. This is so that we can recruit enough patients to give our results adequate power but not too many that we waste time getting more data than we need. Unfortunately, when designing the study we may have to make assumptions about desired effect size and variance within the data. Once the study is underway analysis of the results obtained should periodically be used to perform further power calculations and adjustments made to the sample size accordingly.

When we are assessing results from trials with negative results it is particularly important to question the sample size of the study. It may well be that the study was underpowered and that we have incorrectly accepted the null hypothesis, a type II error. If the study had had more subjects, then a difference may well have been detected. In an ideal world this should never happen because a sample size calculation should appear in the methods section of all papers, reality shows us that this is not the case. As a consumer of research we should be able to estimate the power of a study from the given results.

Retrospective sample size calculation are not covered in this article. Several calculators for retrospective sample size are available on the Internet(1;2).

What type of study should have a power calculation performed?

Nearly all quantitative studies can be subjected to a sample size calculation. However, they may be of little value in early exploratory studies where little data is available on which to base the calculations (though this may be addressed by performing a pilot study first and using the data from that).

Clearly sample size calculations are a key component of clinical trials since the emphasis in most of these studies is in finding the magnitude of difference between therapies. All clinical trials should have an assessment of sample size.

In other study types sample size estimation should be performed to improve the precision of our final results. For example, the principle outcome measures for many diagnostic studies will be the sensitivity and specificity for a particular test, typically reported with confidence intervals for these values. As with comparative studies, the greater number of patients studied the more likely the sample finding is to reflect the true population value. By performing a sample size calculation for a diagnostic study we can specify the precision with which we would like to report the confidence intervals for the sensitivity and specificity.

As clinical trials and diagnostic studies are likely to form the core of research work in emergency medicine we have concentrated on these in this article.

Power in comparative studies.

1. Studies reporting continuous normally distributed data

Suppose that Egbert Everard had become involved in a clinical trial involving hypertensive patients. A new antihypertensive drug, Jabba Juice, was being compared to bendrofluazide as a new first line treatment for hypertension.

Egbert writes down some things that he thinks are important for the calculation.

What is the null hypothesis?	That Jabba Juice will be no more effective than bendrofluazide in treating new presentations of hypertension.
What level do we want to avoid a type I error at? (pa)	We set this to 0.05
What level do we want to avoid a type II error at? (pb)	We set this to 0.8
What is the “clinically important difference” we want to detect?	For this study we want to detect a minimum 10mmHg difference between treatments.
What type of data and analysis is likely?	Continuous normally distributed data. To be analysed using a t-test
What is the standard deviation of blood pressure in this group of patients?	From other studies we know that the standard deviation is 20mmHg.

As you can see the figures for pa and pb are somewhat typical. These are usually set by convention, rather than changing between one study and another, although as we see below they can change.

A key requirement is the “clinically important difference” we want to detect between the treatment groups. As discussed above this needs to be a difference that is clinically important as, if it is very small, it may not be worth knowing about.

Another figure that we require to know is the standard deviation of the variable within the study population. Blood pressure measurements are a form of normally distributed continuous data and as such will have standard deviation, which Egbert has found from other studies looking at similar groups of people.

Once we know these last two figures we can work out the standardised difference and then use a table to give us an idea of the number of patients required.

Standardised difference = difference between means/population standard deviation

The difference between the means is the clinically important difference, i.e. it represents the difference between the mean blood pressure of the bendrofluazide group and the mean blood pressure of the new treatment group.

From Egbert’s scribblings:

Standardised difference = 10mmHg/20mmHg

Therefore the standardised difference is 0.5mmHg

Using table 1 below we can see that with a standardised difference of 0.5 and a power level (pb) of 0.8 the number of patients required is 64. This table is a two-sided table so we will need a minimum of 64 x 2=128 patients. This is so that we make sure we get patients that fall both sides of the mean difference we have set.

Table 1. How power changes with standardised difference.

Power level (pb)
Sdiff*	0.99	0.95	0.90	0.80
0.10	3676	2600	2103	1571
0.20	920	651	527	394
0.30	410	290	235	176
0.40	231	164	133	100
0.50	148	105	86	64
0.60	104	74	60	45
0.70	76	54	44	33
0.80	59	42	34	26
0.90	47	34	27	21
1.00	38	27	22	17
1.10	32	23	19	14
1.20	27	20	16	12
1.30	23	17	14	11
1.40	20	15	12	9
1.50	18	13	11	8

*SDiff=standardised difference

Another method of setting the sample size is to use the nomogram developed by Gore and Altman(3) as shown in Figure 3.

Figure 3

Gore SM, Altman DG. How large a sample. In: Statistics in practice. London: BMJ Publishing, 2001:6–8. — Gore SM, Altman DG. How large a sample. In: Statistics in practice.
London: BMJ Publishing, 2001:6–8.

From this we can use a straight edge to join the standardised difference to the power required for the study. Where the edge crosses the middle variable gives an indication as to the number, N, required.

The nomogram can also be used to calculate power for a two-sample comparison of a continuous measurement with the same number of patients in each group.

If the data is not normally distributed the nomogram is unreliable and formal statistical help should be sought

2. Studies reporting categorical data

Suppose that Egbert Everard, in his constant quest to improve care for his patients suffering from myocardial infarction, had been persuaded by a pharmaceutical representative to help conduct a study into the new post-thrombolysis drug, Jedi Flow. He knew from previous studies that large numbers would be needed so performed a sample size calculation to determine just how daunting the task would be.

What is the null hypothesis?	That adding Jedi Flow will be no more effective than thrombolysis alone in improving the mortality rate in acute MI.
What level do we want to avoid a type I error at? (pa)	We set this to 0.05
What level do we want to avoid a type II error at? (pb)	We set this to 0.8
What is the “clinically important difference” we want to detect?	3%
What is the mortality rate using thrombolysis alone?	12%

Once again the figures for pa and pb are standard, and we have set the level for a clinically important difference.

Unlike continuous data, the sample size calculation for categorical data is based on proportions. However, similar to continuous data we still need to calculate a standardised difference. This enables us to use the nomogram to work out how many patients are needed.

p₁ = proportional mortality in thrombolysis group = 12% or 0.12

p₂ = proportional mortality in Jedi Flow group = 9% or 0.09 (This is the 3% clinically important difference in mortality we want to show).

The standardised difference is 0.1. If we use the nomogram, and draw a line from 0.1 to the power axis at 0.8, we can see from the intersect with the central axis, at 0.05 pa level, we need 3000 patients in the study. This means we need 1500 patients in the Jedi Flow group and 1500 in the thrombolysis group.

Power in diagnostic tests.

Power calculations are rarely reported in diagnostic studies and in our experience few people are aware of them. They are of particular relevance to emergency medicine practice because of the nature of our work. The methods described here are taken from the work by Buderer(4).

The method described is used to calculate the sample size required to estimate an expected level of sensitivity or specificity with a predefined degree of precision. If the researcher wishes to ensure that a particular test has a sensitivity or specificity higher than a predetermined level then an alternative method should be used (5). The method described should not be used if there are fewer than five subjects in any of the cells of the 2×2 table.

Dr Egbert Everard decides that the diagnosis of ankle fractures may be improved by the use of a new hand held ultrasound device in the emergency department at Death Star general. The DefRay device is used to examine the ankle and gives a read out of whether the ankle is fractured or not. Dr Everard thinks this new device may reduce the need for patients having to wait hours in x-ray thereby avoiding all the earache from patients when they come back. He thinks that the DefRay may be used as a screening tool, only those patients with a positive DefRay test would be sent to X-Ray to demonstrate the exact nature of the injury.

He designs a diagnostic study where all patients with suspected ankle fracture are examined in the emergency department using the DefRay. This result is recorded and then the patients are sent around for an x-ray regardless of the result of the DefRay test. Dr Everard and a colleague will then compare the results of the DefRay against the standard x-ray.

Missed ankle fractures cost Dr Everard’s department a lot of money last year and so it is very important that the DefRay performs well if it be accepted as a screening test. Egbert wonders how many patients he will need. He writes down the following

What is the null hypothesis?	That the DefRay will not be more than 90% sensitive and 70% specific for detecting ankle fractures
What is the lowest expected sensitivity that is acceptable?	95% (call it SN)
What is the lowest expected specificity that is acceptable?	80% (call it SP)
What do you want the confidence intervals to be?	5% for sensitivity (Call it W)
How many patients in the study will have the target disorder? (In this case ankle fractures in Egbert’s population of patients)	30% (Call it P)

For purposes of calculation WN, SN, SP and P are expressed as numbers between 0 and 1, rather than as percentages.

For a diagnostic study we calculate the power required to achieve either an adequate sensitivity or an adequate specificity. The calculations work around the standard 2×2 way of reporting diagnostic data as shown below.

To calculate the number needed for adequate sensitivity.

First calculate the total number of patients with disease. This is the number of true postives and false negatives (TP+FN). Z is the value from a standard statistical text representing pa. Usually pa=0.05 and therefore Z=1.96 for 95% 2 tailed confidence intervals.

In this case the TP+FN = 72.998. We then divide this number by P where P is the estimate of disease in the population.

The final number of patients needed by Egbert is therefore 243 patients.

To calculate the number needed for adequate specificity

For specificity we first calculate the number of false positives and true positives FP+TN). Again z is the value from a standard statistical text representing pa. Usually pa=0.05 and therefore Z=1.96 for 95% 2 tailed confidence intervals.

So for adequate specificity Egbert needs 351 patients.

If Egbert were equally interested in having a test with a specificity and sensitivity we would take the greater of the two, but he is not. He is most interested in making sure the test has a high sensitivity in order to rule out ankle fractures. He therefore takes the figure for sensitivity, 243 patients.

Conclusion

Sample size estimation is a key component of research. An understanding of the concepts of power, sample size, and type I & II errors will help the researcher and the critical reader of the medical literature.

Quiz

[DDET (1) What factors affect a power calculation for a trial of therapy?]

Precision
The difference we are looking for
How certain we want to be to avoid type I or II errors
The type of statistical test we are performing [/DDET]

[DDET (2) Dr Egbert Everard wants to test a new blood test (Boozerfind) for the diagnosis of pancreatitis. He wants the test to have a sensitivity of at least 70% and a specificity of 90% with 5% confidence levels. Disease prevalence in this population is 10%. (i) How many patients does Egbert need to be 95% sure his test is more than 70% sensitive? (ii) How many patients does Egbert need to be 95% sure that his test is more than 90% specific?]

(2) (i) 2881 patients (ii) 81 patients [/DDET]

[DDET (3) If Dr Everard was to trial a new treatment for light sabre burns that was hoped would reduce mortality from 55% to 45%. He sets the pa to 0.05 and pb to 0.99 but finds that he needs lots of patients, so to make his life easier he changes the power to 0.80. (i) How many patients did he need with the pa to 0.05 and pb to 0.80? (ii) How many more patients did he need with the higher power?]

(3) (i) 394 patients (ii) 526 patients [/DDET]

Reference List

[1] UCLA Power calculators http://calculators.stat.ucla.edu/powercalc/

[2] Interactive statistical pages. http://www.statistics.com/content/javastat.html

[3] Gore SM ADG. How large a sample. In: Gore SM ADG, editor. Statistics in practice. London: BMJ Publishing, 2001.

[4] Buderer NM. Statistical methodology: I. Incorporating the prevalence of disease into the sample size calculation for sensitivity and specificity. Academic Emergency Medicine 1996; 3(9):895-900.

[5] Arkin CF, Wachtel MS. How many patients are necessary to assess test performance? JAMA1990;3:895–900.