Overview of the case studies
A variety of case studies are provided for different trial designs, including varying types of primary outcomes, availability of evidence to inform the target difference and level of complexity. A short description is provided in Table 1.
Case study 1: the MAPS trial
Radical prostatectomy is carried out for men suffering from early prostate cancer. The operation is usually carried out through an open incision in the abdomen, which may damage the urinary bladder sphincter, its nerve supply and other pelvic structures. Urinary incontinence occurs in around 90% of men initially, but the long-term prognosis varies from 2% to 60%, depending on how incontinence is measured and time after surgery. Successive Cochrane systematic reviews101 have shown that, although conservative treatment based on pelvic floor muscle training may be offered to men with urinary incontinence after prostate surgery, there is insufficient evidence to evaluate its effectiveness and cost-effectiveness. Men After Prostate Surgery (MAPS)100 was a multicentre RCT that aimed to assess the clinical effectiveness (primarily by looking at the presence of urinary incontinence post treatment) and cost-effectiveness of active conservative treatment delivered by a specialist continence physiotherapist or a specialist continence nurse, compared with standard management, in men receiving a radical prostatectomy at 12 months after surgery.
The primary outcome was the presence of urinary continence. No other outcomes were considered. The sample size was based on a target difference of 15% absolute difference (85% specialist treatment vs. 70% control). A Cochrane systematic review101 suggested that the current control group proportion was 70% (average across relevant control groups). This magnitude of target difference was determined to be both a realistic and an important difference from discussion between clinicians and the project management group, and from inspection of the proportion of patients with urinary continence in the trials included in the Cochrane systematic review.101 Setting the statistical significance to the two-sided 5% level and seeking 90% power, 174 participants per group were required, giving a total of 348 participants prior to considering missing data. Allowing for just under 15% missing data increased the overall sample size to 400. The power (77%) should the control group response turn out to be 40% (i.e. using 55% for the treatment and 40% for the control) was calculated as a sensitivity analysis. As the power was still reasonably high and this was considered a less plausible scenario, the overall sample size was not changed (see Box 4).
Case study 2: the ACL-SNNAP trial
Anterior cruciate ligament (ACL) rupture is a common injury, mainly affecting young, active individuals. ACL injury can have a profound effect on knee kinematics (knee movement and forces), with recurrent knee instability (giving way) the main problem. In the UK, a surgical management strategy has become the preferred treatment for ACL-injured individuals. However, the preference for surgical management (reconstruction) of the ACL-deficient knee had been questioned by a Scandinavian trial,115 which suggested that rehabilitation can reduce the proportion of acute patients requiring surgery by up to 50%.
A two-arm RCT – ACL Surgery Necessity in Non Acute Patients (ACL-SNNAP)115 – was planned to compare a strategy of non-surgical management, with the option of surgery if required (the rehabilitation group), with a strategy of surgical management only (the reconstruction group) in the UK NHS setting treating non-acute patients. The main outcome of interest was the Knee injury and Osteoarthritis Outcome Score (KOOS)-4, which excludes the activities of daily living component of the full KOOS. This decision reflected belief about the impact of ACL rupture and the aim of treatment.115,116 KOOS-4 seemed to be the most appropriate of the available condition-relevant quality-of-life measures.
Limited work had assessed what would be a minimum important difference (MID) in the overall KOOS and the KOOS-4 variant. The KOOS user guide recommended 8–10 points as the (current) best estimate of a minimum important change.117 This was based on an anchor method approach, using clinical judgement about the recovery timescale applied to a small cohort of ACL reconstruction patients.118 Differences that occurred within the recovery period were ≤ 7 points, whereas those that occurred afterwards were ≥ 8 points for three of the four KOOS-4 domain scores. Given the limited data on what would constitute an important difference, estimates from a distribution-based approach [minimum clinically detectable change (MCDC)] were also considered. The MCDC was around 6–12 for individual domains.119 A value of 8 points was taken as a reasonable value for the MCID in the KOOS-4 overall score.
A standard sample size calculation for comparing two means using a SD of 19 gave a required sample size of 120 in each group, for 90% power at a two-sided 5% significance level. This is how many patients would be required for an individually randomised trial in the absence of any clustering of outcome. The impact of clustering of outcome by the main intervention deliverer (surgeon and/or physiotherapist) was also considered. Given the time of outcome measurement (quality of life at 6 months), previous evidence suggested any clustering effect to be low: circa 0–0.06 for intracluster correlation (ICC) effect estimates from a database of previous surgical trials.121 Clustering was assumed to occur to the same degree in both arms. Two surgeons from at least 13 sites were anticipated, whereas a priori more physiotherapists (at least 50% more, i.e. around 40) were anticipated to be involved in the study. Credible SDs for the cluster sizes were informally assessed using mock scenarios. Equal allocation was planned.
The sample size was estimated to be 130 patients per group to achieve just over 80% power, based on assuming an ICC of 0.06. With 26 surgeons, the number of patients per surgeon in the surgery management arm was expected to be five on average. With 40 physiotherapists, the number of patients per physiotherapist was expected to be three on average. Some allowance for variance in the number per health professional was also made. Given the anticipated challenges in recruiting to the study, keeping the sample size as small as possible was considered critical. As clustering was not certain, the sample size was increased to ensure at least 80% power if clustering occurred. In the absence of clustering, the power would be > 90%.
To allow for missing data, the sample size was set at 320 (allowing approximately 15% loss to follow-up). The total required sample size was therefore 320 patients. As the funding agency requested an interim check on the degree of clustering, a single planned interim check was set once data for 100 patients had been collected. This planned interim assessment would assess only the ICC magnitude and other sample size assumptions, such as cluster size. A formal interim analysis comparing treatments was not planned. Box 7 provides the corresponding sample size explanation in the trial protocol.
Case study 3: the OPTION-DM trial
A common comorbidity for patients with diabetes mellitus is neuropathic pain. Although there are some pharmacological treatments for this pain, it is unclear which is best. As the first-line treatment often does not work, patients may get second-line treatments as part of a care pathway. In the Optimal Pathway for TreatIng neurOpathic paiN in Diabetes Mellitus (OPTION-DM) trial,110 three care pathways were to be compared in a three-period crossover study. All patients would receive all three patient pathways. Each care pathway reflected a form of clinical practice that a patient might receive for their neuropathic pain. The main candidate primary outcome was the 7-day-average 24-hour pain after 16 weeks of treatment, measured on a numeric rating scale.
There was some experience of using such a pain score within the study team and in the published literature:
- A recent placebo-controlled crossover trial observed a 0.5-point average difference between the active comparator and placebo.122
- Patients in this population on the active treatment were expected to improve from baseline by, on average, 2 points.
- A 1-point improvement within an individual patient was viewed as a clinically important difference, based on an existing study that used an opinion-seeking approach.123
These criteria were used to inform the choice of a clinically important difference. The wish was to increase the proportion of patients improving by ≥ 1 point. The proportion of individuals improving can be calculated given the assumed reduction and difference between the groups. We expected a mean improvement of 2 points from baseline. Assuming that the change from baseline followed a normal distribution, 66% of patients were anticipated to improve by 1 point (relevant values are in bold text in Table 2).
If, for example, a clinically important mean difference of 0.5 points between treatments was the target (see bold text in Table 2 for relevant values), this would equate to a mean change from baseline of 2.5 points and 74% of patients showing a clinical improvement of ≥ 1 points in the active group. These calculations suggested a clinically important mean difference of 0.5 points, which we can equate to the proportion of individual patients showing individual clinical improvements of 1 point.
The calculation was then adjusted for multiplicity. Each care pathway was planned to be compared with each of the other pathways at the end of the trial. As three formal comparisons were planned, the Bonferroni adjustment was used to adjust the significance level to maintain an overall two-sided, 5% significance level. The sample size was calculated for 90% statistical power. See Box 8 for the corresponding sample size explanation presented in the protocol.
If the proportion of patients with an improvement of ≥ 1 point had been used as the primary outcome (i.e. dichotomising the pain score) and analysed accordingly as a binary outcome, for an effect of 8%, the corresponding sample size would have required a much larger sample size of 884 analysable patients (n = 1179, allowing for dropout).
Case study 4: the SUSPEND trial
Ureteric colic describes episodic severe abdominal pain from sustained contraction of ureteric smooth muscle as a kidney stone passes down the ureter into the bladder. It is a common reason for people to seek emergency health care. Treatments that increase the likelihood of stone passage would benefit patients with ureteric colic, as they will reduce the need for an interventional procedure.
At the time of planning, two smooth muscle relaxant drugs, tamsulosin (an alpha-adrenoceptor antagonist, or alpha-blocker) and nifedipine (a calcium channel blocker), known as medical expulsive therapy (MET), were considered potentially beneficial treatments. The Spontaneous Urinary Stone Passage ENabled by Drugs (SUSPEND) trial111 was designed to inform the treatment choice. A three-arm RCT was planned to compare tamsulosin and nifedipine with a placebo control to facilitate spontaneous stone passage.
A head-to-head comparison of the two MET agents, nifedipine and tamsulosin, was considered vital. A comparison of the two active arms (combined) with the placebo arm (MET vs. placebo) was also planned, due to uncertainty about the strength of the existing evidence of clinical efficacy. The key outcome of interest was the presence or absence of a stone at 28 days. It was defined as the lack of any further intervention (or planned intervention) to resolve the index ureteric stone.
A review of the evidence base approach was used. Data were available from two systematic reviews125,126 that included RCTs comparing alpha-blockers, calcium channel blockers and a variety of controls (placebo, treatment as usual or prescribed painkillers).127–129 Only three RCTs compared tamsulosin and nifedipine directly, although there were a number of other trials that compared them to another treatment or a placebo. RCT data from both reviews were combined in a network meta-analysis to maximise the available data to inform the sample size calculation.
The estimated RR effects are shown in Table 3. For simplicity, the uncertainty around the estimates is not shown. The RRs of being stone free, comparing nifedipine and tamsulosin to the mixed control group, were estimated to be 1.50 and 1.70, respectively. Of particular note, the RR of being stone free for tamsulosin compared with nifedipine was estimated to be 1.15.
An estimate of the anticipated control (placebo) group event rate for being stone free was needed before the sample size could be calculated. This was estimated to be 50%, using a random effects estimate of the pooled proportion of the control arms of the RCTs from the two systematic reviews. This was then used as the placebo control group response in the sample size calculation, in lieu of better evidence that might be more relevant to the anticipated population. Using this and applying the corresponding RRs from the network meta-analysis, the stone-free level was anticipated to be 75% and 85% in the nifedipine and tamsulosin groups, respectively.
The study sample size was based on the comparison of the nifedipine and tamsulosin treatments. A standard sample size (for a two-sided, 5% significance level and 90% power with a continuity correction) for comparing two proportions gave a required number of 354 in each group. This sample size was inflated to 400 per group to account for an approximate 10% loss to follow-up. The total required sample size was 1200 (applying this size to the placebo group as well).
The placebo control group size was kept at 400 for the planned comparison with any MET (nifedipine and tamsulosin combined), which provided > 90% power. The size of the placebo group could have been reduced using an uneven allocation ratio, but was instead kept an equivalent size to the two active treatment arms. The funding agency strongly supported the inclusion of a placebo arm, given concerns about the potential risk of bias and the relatively small size of the existing placebo-controlled trials. No adjustment was made to the alpha level for multiple treatment comparisons (and therefore no inflation to the standard sample size), as the different comparisons were considered independent research questions: (1) MET compared with placebo control; and (2) nifedipine compated with tamsulosin. See Box 9 for the corresponding sample size explanation from the protocol.111
Case study 5: the MACRO trial
Chronic rhinosinusitis (CRS) is a common condition, affecting around 10% of the UK adult population, that can lead to chronic respiratory disease or impaired quality of life. Initial management of CRS in the UK is in the family doctor setting, followed by referral to a hospital setting for medical treatment. Initial management fails to deliver sufficient relief for around one in three patients who attend hospital ear, nose and throat clinics.130,131 The role of antibiotics for CRS is unclear, although they are commonly used in clinical practice. Endoscopic sinus surgery is a commonly conducted operation. Its use varies from centre to centre due to an insufficient evidence base. Two Cochrane systematic reviews132,133 of treatment of CRS with medical and surgical treatments highlighted the need for new randomised trials. Two main research questions related to treatment for patients with CRS were apparent to the investigators:
- the relative benefits of surgical compared with medical treatment
- the role of antibiotics.
Given the lack of clarity about current practice in the UK, two possible trial designs were considered potentially appropriate:
- A two-stage trial incorporating two linked randomised comparisons:
- stage 1 – antibiotic compared with placebo for 3 months
- stage 2 – proceeding to receive endoscopic sinus surgery or continued medical therapy for those without significant benefit.
- A three-arm randomised trial comparing antibiotic, placebo and endoscopic sinus surgery.
The relative merits of the study designs are not considered here. Instead, the focus is on specifying the target difference. The Management for Adults with Chronic RhinOsinusitis (MACRO) trial112,113 was designed to have a sample size sufficient for whichever of the two designs was ultimately chosen. An expert panel subsequently choice a variant of the three-arm design.
The primary outcome was the Sinonasal Outcome Test-22 items (SNOT-22), a validated disease-specific quality-of-life instrument.134 An anchor approach was used to estimate the MID. A ‘medium’ SES (according to Cohen) was also calculated and used, as there is evidence to suggest that 0.5 SDs would be a reasonable estimate of the MID for this type of outcome.135,136
Data from an existing study were used to infer what might be realistic to observe.134 Limited work had assessed what would be a MID in the SNOT-22 score. Based on a large existing study of around 2000 patients receiving surgery for CRS with/without nasal polyps, which used the SNOT-22, a SD of 20 seemed plausible (group change score SDs were in the range of 19–20). An analysis adjusting for baseline was planned. A 10-point difference in the SNOT-22 (0.5 SD with SD of 20.0134) could be considered an important difference to detect.
The anchor method study suggested that the MID could be slightly smaller, at 8.9 points. This estimate was derived by calculating the average difference between those who stated post treatment that they were ‘a little better’ (9.5-point reduction) and those who stated they were ‘about the same’ (0.6-point reduction): 9.5 – 0.6 = 8.9 points mean difference. In the aforementioned study134 of surgical patients, the overall mean change score was 16.2, substantially larger than the aforementioned MID estimates. It is not realistic to expect all of this to be observed in a comparison of surgery with another treatment. If 25% (arbitrary, but based on the judgement of the team) of the response were attributed to regression to the mean or the process of receiving treatment of some kind, this would suggest that a difference of 12.2 might be plausible for surgery compared with an essentially non-effective treatment. A similar value for 15% of the effect was also considered (13.8).
The four mean values (8.9, 10.0, 12.2 and 13.8 points) reflected what might be clinically important differences (8.9–10) and realistic target differences (12.2–13.8) to use in the sample size calculations to look at various potential sample sizes under the two designs. A range of standard sample size calculations for a two-arm trial was produced, looking for 80% or 90% statistical power at the two-sided 5% level, using a pooled SD of 20.0 and assuming around 10% missing data (which was plausible based on two previous studies137,138 of patients in this area).
Three-arm design
For 90% power and a 8.9 target difference, 107 per group would be required. Applied to the three-arm design and allowing for 10% missing data, this would lead to 120 per group and 360 overall. In the presence of clustering by surgeon in only the surgery arm, this would still be sufficient to achieve just under 80% power (additionally assuming an ICC of 0.05, 10 clusters of cluster size 12 and similar levels of missing data) for the relevant comparisons. No allowance for unequal cluster sizes was made. However, the actual number of clusters that such a trial would use was thought likely to be somewhat higher, offsetting any potential loss due to uneven cluster sizes.
Two-stage design
What could be achieved with this sample size (360) was then considered for the two-stage design. The full sample would be available for stage 1, barring any missing data. In the absence of clustering, this would be more than sufficient to detect the 8.9-point difference (power of 98%). However, the stage 2 comparison drives the overall sample size calculation in a two-stage design, as the sample size must be inflated to deal with a loss of randomised participants after stage 1. This loss was assumed to be 50%, based on limited prior evidence and erring towards higher estimates. Assuming too low a loss between stages would have a huge impact on the precision of the stage 2 comparison.
An overall sample size of 360 would lead to 180 available at stage 2 (90 per group). Using a target difference of 10.0, a sample size of 360 would achieve 90% power. In the presence of clustering, a large target difference is needed to have 80% or 90% power. The sample size would be sufficient if a target difference of 12.2 was used, after allowing for clustering in the same manner as in the three-arm design. See Box 10 for the corresponding sample size explanation presented in the grant application. The final sample size was inflated at the request of the funder to allow the subgroup with and without nasal polpys to be analysed. For simplicity, this is not considered here.
Case study 6: the RAPiD trial
Increased use of antibiotics is a major contributor to the spread of antimicrobial resistance. Dentists are responsible for approximately 10% of all antibiotics dispensed in UK community pharmacies. Despite clear clinical guidance, evidence demonstrates that dentists often prescribe antibiotics inappropriately in the absence of clinical need. The effectiveness of strategies to change the behaviour of health professionals is variable, but audit and feedback (A&F) has been shown to lead to small but important improvements in behaviour across a range of contexts and settings. The Reducing Antibiotic Prescribing in Dentistry (RAPiD) trial114 randomised all dental practices with responsibility for prescribing in Scotland (n = 795), using routinely collected Scottish NHS dental prescribing and treatment claim data [available through PRISMS (Prescribing Information System for Scotland)], to compare the effectiveness of different individualised (to dentists with practices) A&F interventions for the translation into practice of national guidance recommendations on antibiotic prescribing.
A total of 795 practices were randomly allocated to an intervention or the control (no A&F). Six hundred and thirty-two intervention group dental practices were subsequently evenly allocated to one of eight A&F groups in a 2 × 2 × 2 factorial design. The three factors were (1) receiving feedback with or without a written behaviour change message, (2) providing the graph of monthly practice prescribing levels with or without health board prescribing levels in the graph and (3) receiving feedback reports twice (0 and 6 months) or three times (0, 6 and 9 months). This led to a total of eight equal-sized intervention groups of 79 practices. The remaining 163 practices in Scotland formed the no intervention control group. The addition of this independent no intervention control group led to a ‘partially’ factorial design rather than fully factorial.
The RAPiD trial sample size calculation was unusual, as the population of sample units was fixed by the size of the country [i.e. every dental practice with responsibility for prescribing (n = 795) in Scotland was expected to take part, as part of a national policy to participate in dental service delivery research]. The sample size calculation was therefore based on identifying whether or not adequate statistical power could be achieved for the primary comparisons for target differences that were considered theoretically plausible (realistic) for a fixed size. The cost implications of a larger sample size were nominal and therefore the full population was always going to be used. The analysis was intended to be at the dentist level, adjusted for dental practice. However, the sample size calculation was carried out at the practice-aggregated level and was therefore conservative.
A systematic review139 demonstrated that the interquartile range of effects of A&F across different settings was 0.5–16%. The study team therefore determined that a 10% reduction (or less) would be both plausible and important. The routine prescribing data indicated that the mean number of antibiotic items prescribed per list was 141.1, with a SD of 140.9. Given that past prescribing behaviour is highly predictive of future prescribing data (correlated) both theoretically and empirically (p = 0.91 observed for the two most recent pre-intervention years), correction for the anticipated baseline correlation was used to reduce the precision.140 A baseline-prescribing data-adjusted analysis was correspondingly planned.
With the sample size calculation for the A&F comparisons estimated, the study sample size for the comparison of A&F compared with no A&F was fixed by the number of dental practices left in Scotland that could be randomised to no A&F intervention. The detectable difference was 12%, which was still considered both plausible (realistic) and important, should it be observed. Given that the intervention group was being modelled twice within the two main hypotheses (intervention vs. no A&F and intervention factors vs. no intervention factors), Bonferroni’s adjustment was used to adjust the significance level to 2.5%, to maintain an overall two-sided 5% significance level. See Box 11 for the sample size explanation presented in the trial results paper.
Publication Details
Copyright
Publisher
NIHR Journals Library, Southampton (UK)
NLM Citation
Cook JA, Julious SA, Sones W, et al. Practical help for specifying the target difference in sample size calculations for RCTs: the DELTA2 five-stage study, including a workshop. Southampton (UK): NIHR Journals Library; 2019 Oct. (Health Technology Assessment, No. 23.60.) Chapter 5, Case studies of sample size calculations.