NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Raftery J, Young A, Stanton L, et al. Clinical trial metadata: defining and extracting metadata on the design, conduct, results and costs of 125 randomised clinical trials funded by the National Institute for Health Research Health Technology Assessment programme. Southampton (UK): NIHR Journals Library; 2015 Feb. (Health Technology Assessment, No. 19.11.)

Chapter 7Theme 4: were the statistical analyses appropriate and as planned?

This chapter considers questions surrounding the appropriateness of the statistical analyses. After a brief review of the relevant literature, 19 questions were explored. The results are summarised and discussed.

Introduction

Outcome reporting bias has been widely reported.10,11,27,8087 However, only a few papers have reported on whether or not researchers adequately specify planned analyses in the protocol and, subsequently, whether or not they follow the prespecified analysis.88,89 This matters because failure to follow the prespecified analysis can result in bias. One study suggested that protocols were not sufficiently precise to identify deviations from planned analyses.89 Two reviewed whether or not sample size calculations were adequately specified.8890 Another recently questioned whether or not the current method of sample size calculations was appropriate.91 These are summarised below.

The primary outcomes in protocols were compared with those in published reports for 102 trials approved by the scientific ethics committees for Copenhagen and Frederiksberg, Denmark, between 1994 and 1995.10 Selective reporting was revealed, with 62% of trials reviewed having at least one primary outcome added, omitted or changed.

A similar review of 48 trials funded by the Canadian Institutes for Health Research81 found that in 33% of trials, the outcome listed as primary in the publication differed from that in the protocol. They also found that outcome results were incompletely reported.

A pilot study conducted in 2000 reviewed 15 applications received by a single local research ethics committee in the 1990s and compared the outcomes, analysis and sample size in the protocol with that presented in the final study report.89 The authors found that six protocols (40%) stated the primary outcome and, of these, four (67%) matched that in the published report. Eight mentioned an analysis plan but only one (12%) followed its prescribed plan. The study concluded that selective reporting may be substantial but that bias could only be broadly identified as protocols were not sufficiently precise.

In 2008, Chan et al.88 compared the statistical analysis and sample size calculations specified in the protocol with those specified in the published paper. They found evidence of discrepancies in the sample size calculations (18/34 trials), the methods of handling protocol deviations (18/34 trials), methods of handling missing data (39/49 trials), primary outcome analyses (25/42 trials), subgroup analyses (25/25 trials) and adjusted analyses (23/28 trials). These discrepancies could affect the reliability of results, introduce bias and indicate selective reporting. They concluded that the reliability of trial reports cannot be assessed without access to the protocol.

A 2008 comparison of the sample size calculation specified in the protocol with that in the publication found that only 11 of the 62 trials reviewed adequately described the sample size calculation in both the protocol and published report.88

Charles et al.,90 in a review of the reporting of sample size calculations in 215 trials published between 2005 and 2006, found that 43% did not report all the required sample size calculation parameters.

A study of 18 trials that reported on traumatic brain injury reviewed the covariates adjusted for and subgroup analyses performed.92 Protocols could be obtained for 6 of the 18 trials and it found that all six trials reported subgroup effects which differed from those specified in their protocols.

In collaboration with journal editors, triallists, methodologists and ethicists, Chan et al.93,94 have launched the Standard Protocol Items for Randomised Trials (SPIRIT) initiative to establish evidence-based recommendations for the key content of trial protocols.

The above studies may not reflect current practice because either the number of trials reviewed was small or the studies reviewed were relatively old (1994–5 for Chan88 and similar for Hahn89). Practice may have improved since, following the introduction of CONSORT and other guidelines.

Our objective was to repeat these analyses on the cohort of all HTA published RCTs, assessing the extent of these discrepancies and whether or not they improved over time.

Questions addressed

The aim was to review the appropriateness of the statistical analyses for all published HTA clinical trials, including the sufficiency of the proposed statistical plan, handling of missing data and whether or not there were discrepancies between what was proposed and what was actually reported in the published monograph.

The questions posed (Box 6) fall under the following six subheadings:

  1. Did the protocol specify the planned method of analysis for the primary outcome in sufficient detail?
  2. Was the analysis planned in the proposal/protocol for the primary outcome carried out?
  3. How was the sample size estimated?
  4. How adequate was the reporting of planned and actual subgroup analysis?
  5. Other information: what graphical presentation of data was reported in HTA trials?
  6. Were conclusions justified given the analysis results?

Methods

Nineteen questions were piloted as shown in Box 6. Four questions were considered but not proceeded with, regarding:

  1. the number of statistical tests and number of primary statistical tests
  2. whether or not authors measured more outcomes than they reported
  3. adequate reporting of subgroup analyses
  4. whether or not the conclusions were justified given the analysis results.

Difficulties arose with each of these questions. Firstly, results were not presented in a standard format in the monographs. Secondly, as the monographs were lengthy, data extraction meant searching and reading through many pages. Thirdly, as the HTA trials are pragmatic, they include a large number of outcomes measured at multiple time points, which increased the number of tables/amount of text to be reviewed. Fourthly, extracting information on subgroup analyses planned and carried out was difficult because authors seldomly labelled analyses as subgroup analyses. Lastly, we found it difficult to specify data that could answer the question regarding the conclusions being justified by the analyses.

For the 19 questions explored, the methods used in the literature reviewed above were used as a framework to detail the questions. For example, the paper by Chan et al.88 provided the key components of data that needed to be extracted on the sample size calculation. Data on these components were expanded to include other types of outcome measures and study designs (e.g. time-to-event data, non-inferiority and cluster randomised trials). We extracted these data from the protocol or project proposal (if a protocol was not available) and monograph, and analysed the data in a similar way.

Denominators

All trials were included (n = 125). The unit of analysis for questions T4.1–T4.15 was each trial’s primary outcome with complete analysis (n = 164 planned and n = 161 reported). The unit of analysis for T4.16–T4.18 was the individual trial.

Results

Questions T4.1–T4.10: did the protocol specify the planned method of analysis for the primary outcome in sufficient detail?

Question T4.1: how many specified a method of analysis for the primary outcome?

The 125 trials included 206 planned primary outcomes and reported on 232 primary outcomes. Of these, 164 and 161, respectively, were ‘complete for analysis’ (these are the denominators for questions T4.1–T4.10).

The method of analysis was prespecified for 111 out of 164 planned primary outcomes (67.7%), with little difference between those that did and did not have protocols (65.9%, 54/82 from the proposal and 69.5%, 57/82 from the protocol) (Table 32).

Question T4.2: has this improved over time?

There is a slight indication that the specification of the primary outcome analyses has improved over time. This could be due to the increasing number of protocols available (Table 33 and Figure 6) but the low numbers preclude strong conclusions.

Question T4.3: statistical test applied

Of the 111 planned primary outcomes with a prespecified method, the proposed statistical test/choice of model was described in 108 (96.4%). The most frequently reported planned methods of analysis were logistic regression (23.4%, 26/111) and analysis of covariance (ANCOVA)/linear regression (17.1%, 19/111), followed by t-test (14.4%, 16/111) (Table 34).

Question T4.4: significance level

Of the 111 primary outcomes with a specified method of analysis, the significance level/confidence interval level to be used was specified in 39 (35.1%). Table 34 shows that the most commonly used level of statistical significance was 5%.

Question T4.5: hypothesis testing

The majority did not specify whether one-sided or two-sided analysis would be performed (87.4%, 97/111) (see Table 34).

Question T4.6: adjustment for covariates

Sixty-eight of the 111 (61.3%) planned primary outcomes specified the covariates that they planned to adjust for in the final analysis.

Question T4.7: analysis population

The planned population for the primary analysis was not specified by 41.4% (46/111). This appears to have improved over time (apart from anomalies in 1998 and 2003), with a big increase in 1996, the year in which CONSORT was published.

Question T4.8: adjustment for multiple testing

Almost all studies failed to specify a method of adjustment for multiple testing (93.7%, 104/111). As HTA trials are pragmatic as opposed to licensing trials, looking at a range of outcomes over short- and long-term periods, adjustment for multiple testing may matter less than transparency.

Question T4.9: missing data

Most studies did not specify a method for handling missing data (74.8%, 83/111). Of those that did, the methods used varied (see Table 34).

Question T4.10: is sufficient detail including all of the above seven elements recorded in the protocol?

The number of protocols meeting all seven criteria was low, at 1.8% (2/111). When we limited the criteria to three (model/test, significance level and analysis population), of the 111 primary outcomes for which a method of analysis was specified in the protocol/proposal, 30 primary outcomes qualified (27%, 30/111). This increased slightly over time, from 22.7% before 1998 to 35.6% after.

Questions T4.11–T4.15: was the analysis planned in the protocol/proposal for the primary outcome carried out?

Question T4.11: statistical test/model used

Of the 82 trials whose primary outcome was as planned, the authors changed the planned method of statistical analysis (model/test) in 20 (24.4%). Some changed to more complex methods (t-test changed to linear regression in five instances) and others to simpler methods (in three, a chi-squared test was carried out instead of logistic regression, linear regression was used instead of a mixed model and Fisher’s exact test was used instead of Cox proportional hazards). These could be legitimate changes or selective reporting depending on the results, something we did not explore (examples are given in Box 7).

Question T4.12: significance level

All but six trials used the 5% significance level. Of the six discrepancies between the significance stated in the protocol/proposal and that used in the monograph, one led to an increase in the significance level used but this seems to be an error (trial ID42 stated in the protocol that ‘Differences will be judged significant at the 2.5% level to take account of two primary comparisons being drawn’; the monograph stated that 95% confidence intervals would be calculated but a 2.5% significance level was stated in the sample size calculation in the protocol.

Question T4.13: analysis population

Of those trials that stated the planned analysis population for the primary outcome analysis in the protocol/proposal, 90% (56/62) followed the plan. Most carried out what they described as an ‘intention-to-treat’ analysis. In two trials, the triallists stated in the protocol/proposal that they would carry out both an intention-to-treat and per-protocol analysis but reported only on the per-protocol analysis. Both of these were from trial ID109, where ‘The data were analysed per protocol. As planned, no intention-to-treat analyses were conducted, as < 10% of subjects would have been classified differently in such an analysis.’ Therefore, this change of analysis population was justified because the authors had specified in the protocol a rule which was used to decide which population to use.

Question T4.14: missing data

Of the 28 trials for which a method of handling missing data was specified in the protocol, the method used was different in 12 (42.9%).

Question T4.15: covariates adjusted for in the analysis

Sixty-eight of the 111 trials (61.3%) outlined their planned analysis of covariates and for 31 (27.9%) it was unclear (Table 35). Some trials did not specify which covariates they would adjust for in the protocol or, if they did, they failed to specify exactly which covariates would be adjusted for, for example ‘adjusting for baseline variables’ or ‘taking into account any statistically important imbalances’. This made it difficult to compare planned covariates with actual covariates adjusted for, in many trials.

In summary, the analyses planned in the proposal/protocol for the primary outcome were carried out in 68 of the 82 trials (76%) and changed in 20 (24%) (considering statistical test/model only). The method of handling missing data specified in the protocol/proposal did not match what was carried out 43% of the time. The analysis population and significance level changed 10% of the time in trials. More detailed examination suggests that some of the changes were legitimate. Without knowing if a statistical analysis plan was drawn up before the analysis and subsequently followed, one cannot conclude departures from proper practice.

Questions T4.16–T4.18: how was the sample size estimated (power, confidence interval, etc.)?

We followed the methods and tables used by Chan et al.,88 expanded to incorporate the different types of sample size calculation observed in the HTA trials (e.g. width of confidence interval calculations, time-to-event data, standardised effect size, non-inferiority, equivalence).

Question T4.16: was sufficient information on the sample size calculation provided?

The results of classifying the trials by the five components suggested by Chan et al.88 are shown in Table 36. Of the 125 trials, 75 proposals/protocols (60%) and 66 monographs (52.8%) reported all the required sample size components. Individual components were reported in 60.7–100% of proposals/protocols and 49.6–100% of monographs. The required sample size was reported in the proposal/protocol in 93% of trials (116/125), in the monograph in 90% (112/125) and in both in 89%. The result from the sample size calculation was presented in the proposal/protocol in 57% of trials (111/125), in the monograph in 46% (58/125) and in both in 42% (e.g. the sample size calculation showed that the trial will have to recruit 326 participants. Taking account of the participant dropout rate, this will increase the number needed per arm to 350). Forty-two per cent of trials (52/125) reported all the required components of the sample size calculation in both the proposal/protocol and monograph.

Question T4.17: does the sample size calculation in the protocol match the sample size calculation shown in the monograph? What discrepancies were found?

Of the 117 trials reporting a sample size calculation in both the proposal/protocol and the monograph, we observed discrepancies between that planned and that reported in 45 trials (38.5%). A component of the sample size calculation was reported in the monograph but not reported in the protocol/proposal in 18 trials. In 39 trials, there was a discrepancy in at least one component reported in both the protocol/proposal and the monograph. These discrepancies were not acknowledged in the monograph. Where a discrepancy was observed between the number of patients, the trial planned to recruit and the number actually recruited, this was twice as likely to be because the number specified in the monograph was smaller than that in the protocol/proposal than vice versa (19 trials vs. 10 trials). Where a discrepancy existed in the minimum clinically important effect size, this was also almost twice as likely to be due to the effect size being reported as larger in the monograph than in the protocol (Table 37). These discrepancies could be due to reductions in the planned sample size after the study started which were not reported in the monograph, or attempts to justify the fewer patients actually recruited.

Question T4.18: what values of alpha, power and dropout were used in the sample size calculation?

In the proposal/protocol, a 5% significance level was used in the sample size calculation 94.4% of the time (102/108). Eighty per cent power was specified in half of the protocols (52.2%, 59/113) and 90% power was specified in over one-third (37.1%, 42/113). The triallists inflated the sample size for participant loss to follow-up 61.5% of the time (72/122) in the protocol/proposal and 48.3% of the time (58/120) in the monograph (Table 38).

Question T4.19: other information – what graphical presentation of data was reported in Health Technology Assessment trials?

We reviewed each HTA monograph and assessed whether it included a repeated measures plot, a Kaplan–Meier plot or a forest plot, as these were the top reported figures in Pocock et al.95 (accounting for 92% of figures published in the 77 RCT reports that they reviewed in five general medical journals). A repeated measures plot was presented in the HTA monograph for 38.4% of the trials (48/125), followed in frequency by a Kaplan–Meier plot (20%, 25/125) and a forest plot (16.8%, 21/125) (Table 39). A repeated measures plot was observed more frequently in the HTA monographs than in Pocock et al.’s95 sample, and a Kaplan–Meier plot less often. This could be due to differences in the types of trials reviewed, with HTA trials more likely to involve a longer follow-up at multiple time points and less likely to include survival outcomes.

Analysis

The planned method of analysis for the primary outcome was not specified in the protocol/proposal in one-third of the 125 trials. Of those that specified a method of analysis, only two (1.8%) fully specified the method of analysis using the six core criteria. Twenty-seven per cent met three criteria (statistical test/model, significance level and analysis population). Improvements occurred over time from 22.7% before 1998 to 35.6% thereafter. There did not appear to be differences in the level of detail reported in the protocol compared with the proposal, but this could be due to small numbers or confounding (with the year the commissioning brief was advertised).

Out of the 125 trials reviewed, only 52 (41.6%) reported all the required components of the sample size calculation in both the proposal/protocol and monograph. Of these, the information in the proposal/protocol matched the information in the monograph in only 43 trials (34%) (see Tables 36 and 37). Where discrepancies were observed, they were twice as likely to indicate a smaller sample size planned in the monograph than stated as planned in the protocol.

Discussion

We were able to extract data to answer a number of questions on the planned and actual method of statistical analysis and sample size calculation. The degree to which this study was successful varied by the three broad sets of questions:

  • Questions T4.1–T4.10: did the protocol specify the planned method of analysis for the primary outcome in sufficient detail? The study indicated that this set of questions could be answered and indicated some cause for concern as around one-third of trials provided insufficient detail, particularly on planned statistical analysis.
  • Questions T4.11–T4.15: was the analysis planned in the proposal/protocol for the primary outcome carried out? We showed that it was difficult to complete this set of questions owing to lack of data.
  • Questions T4.16 and T4.17: was sufficient information on the sample size calculation provided? And does the sample size calculation in the protocol match the sample size calculation shown in the monograph? What discrepancies were found? The study showed that it was difficult to complete this set of questions owing to lack of data.

One general finding from this study relates to the limitation of retrospective analysis. Standards changed over time. We were unable to discuss details with those responsible for the analyses in the trials. In particular, we had no way of knowing if statistical analysis plans had been drawn up separately from the protocol. We understand that such plans are common practice but often not until the trial is close to completion. The key issue is that such plans are specified in advance before the data are examined. We have no way of knowing if this happened.

This is the first study we are aware of that has reviewed whether or not the method of statistical analysis was recorded in sufficient detail in the protocol, as defined by a minimum set of criteria.

Sample size calculation is a vitally important aspect of any clinical trial to ensure that the number of patients included is large enough to answer the question reliably and as few patients as possible are exposed to a potentially inferior treatment. It is important that all parameters used in the sample size calculation(s) are clearly and accurately reported in both the grant proposal/protocol and final trial publication. The level of detail reported should enable another statistician to replicate the sample size calculation if necessary. The sample size calculation reported in the final trial protocol and final publication should match and any changes to the sample size that were made after the trial had started should be reported.

We found that sample size calculation information was not being recorded in sufficient detail in both the protocol and publication for RCTs. Where the information was recorded, the level of unexplained discrepancies was surprisingly high. Changes to the sample size calculation after a trial has started are allowed for much the same reasons as listed in relation to changes to the statistical analysis plan [e.g. advances in knowledge, change in trial design or better understanding of the trial data (SD or control group event rate)], but should be minimised as much as possible.

We observed fewer discrepancies than other studies with regard to the method of statistical analyses and whether or not the authors followed the protocol or the sample size calculations. The discrepancies observed could be legitimate changes not reported in the monograph or could be hiding unacknowledged reductions in the sample size carried out after the trial started due to recruitment problems (reported in Chapter 6), or they could be evidence of selective reporting bias indicating the results to be more clinically meaningful than they were (e.g. by increasing clinically meaningful difference specified) or typographical mistakes. Of these, given the large number of trials which failed to recruit the original planned sample size as reported in Chapter 6, we think the most likely explanations are the first two listed above.

Questions T4.11–T4.15 explored potential selective reporting. We found that potential selective reporting bias in sample size calculation information and in methods of analysis exists. This is perhaps not so serious, as a previous review of a subset of the RCTs in this cohort found that only 24% of primary outcome results were statistically significant.5 If there was selective reporting bias we might expect this percentage to be higher.

Chan et al.88 found that the statistical test for primary outcome measures differed between the protocol and publication in 60% of trials; we found a smaller percentage in our cohort of 25%. This could be because we had access to the final version of the protocol whereas Chan et al.88 had access to the protocol submitted to an ethics committee. In addition, Chan et al.88 studied protocols from the 1994–5 period, before CONSORT had been developed (in 1996). Chan et al.88 observed that 32.6% of protocols described the planned method of handling missing data, higher than our finding of 25.2%.

Chan et al.88 found that 11 out of 62 trials (17.7%) fully and consistently reported all of the requisite components of the sample size calculation in both the protocol and publication. The corresponding figure in our sample was 34%; this is twice as large as in Chan et al.88 but is still much lower than expected.

We found a similar proportion of trials reporting all the required sample size calculation parameters as Charles et al..90 Charles found that 57% of 206 trials reported all the required sample size calculation parameters; we found that 56.4% of our trials did so.

The figures in the paper by Hahn et al.89 are similar to ours, although their studies were few and dated.

Strengths and weaknesses of the study

The biggest strength of this study was that we had access to a protocol/proposal for all the trials. This is the largest cohort study that we are aware of to have compared the method of analysis and sample size calculation planned in the protocol with that reported in a publication. This is also the first such study of UK-funded RCTs. Further, previous studies comparing protocols with publications may not reflect current practice because either the number of trials reviewed was small or the studies reviewed were relatively old (1994–5 for Chan et al.88 and similar for Hahn et al.89).

A limitation of our work was that we only analysed the first sample size calculation reported and compared that with the monograph.

We were surprised at the lack of detail in statistical analysis plans reported in the protocol/proposal and how few met our criteria. However, as statisticians often create statistical analysis plans separate from the protocol prior to final analysis, these may well provide more detail.

Key questions for the HTA programme concern whether or not it requires audit of planned analyses and, if so, how and at what level of detail. Our study shows the limits of retrospective audit based on the protocol/application form and the monograph. More generally, the HTA programme should consider requiring information to be recorded on the statistical test/model planned for use in the analysis, the significance level/confidence interval level to be used and the analysis population.

Recommendations for future work

Should the database be continued, we recommend that the questions on statistical analysis are reviewed alongside the SPIRIT checklist.94 Any further data extraction should include 13 questions: four should remain as they are (T4.1, T4.2, T4.18 and T4.19) and nine should be amended (T4.3, T4.4, T4.7, T4.10, T4.11, T4.12, T4.13, T4.16 and T4.17).

We observed that if trials funded by the HTA programme are to continue to qualify as one of the four cohorts of trials included in Djulbegovic et al.,96 then data will have to be extracted on the relevant fields.5 Dent and Raftery5 assessed treatment success and whether or not the results were compatible with equipoise using six categories: (1) statistically significant in favour of the new treatment; (2) statistically significant in favour of the control treatment; (3) true negative; (4) truly inconclusive; (5) inconclusive in favour of the new treatment; or (6) inconclusive in favour of the control treatment. Trials were classified by comparing the 95% confidence interval for the difference in primary outcome with the difference specified in the sample size calculation. The recent Cochrane Review used data extracted for this project and combined them with the only three other similar cohorts.96

Unanswered questions and future research

We analysed whether or not the planned analyses were carried out. We did not attempt to investigate whether or not the planned analyses were appropriate.

We compared individual components of the planned method of analysis with individual components reported in the monograph but did not calculate how often all of the components of the analysis plan matched those presented in the monograph. Again, this could be the subject of further work.

Small numbers constrained our analysis of trends in time. If the work continues, time trend analyses should be repeated and extended.

Further work could explore whether or not the amount of detail provided in the protocol on planned analyses is affected by the seniority of the statistician involved, including if he/she was a co-applicant.

Figures

FIGURE 6. Proportion of trials with a protocol available by year of commissioning brief.

FIGURE 6

Proportion of trials with a protocol available by year of commissioning brief.

Tables

TABLE 32

Planned primary outcome analysis specified in the protocol/proposal by whether or not a protocol was available

Planned primary analysisProtocol available or not?Total, n (%)
Yes, n (%)No, n (%)
Yes57 (69.5)54 (65.9)111 (67.7)
No14 (17.1)13 (15.9)27 (16.5)
Not clear1 (1.2)1 (1.2)2 (1.2)
Not applicable1 (1.2)01 (0.6)
No information available9 (11.0)14 (17.0)23 (14.0)
Total number of primary outcomes82 (100.0)82 (100.0)164 (100.0)
Total number of trials65 (52.0)60 (48.0)125 (100.0)

TABLE 33

Planned primary outcome analysis specified in protocol/proposal by year

Year of commissioning briefYes, n (%)No,a n (%)Not clear, n (%)Total, n (%)
199312 (70.6)5 (29.4)017 (100.0)
199413 (52.0)11 (44.0)1 (4.0)25 (100.0)
199519 (73.0)7 (27.0)026 (100.0)
199616 (55.2)13 (44.8)029 (100.0)
19976 (50.0)5 (42.0)1 (8.0)12 (100.0)
19983 (75.0)1 (25.0)04 (100.0)
199910 (83.3)2 (16.7)012 (100.0)
200120 (87.0)3 (13.0)023 (100.0)
20023 (75.0)1 (25.0)04 (100.0)
20036 (75.0)2 (25.0)08 (100.0)
2005b1 (100.0)001 (100.0)
2009b2 (66.7)1 (33.3)03 (100.0)
Total111 (67.8)51 (33.5)2 (1.2)164 (100.0)
a

The categories ‘no information available’ and ‘no’ were merged as they were essentially the same.

b

There were no trials included in the database with commissioning briefs advertised in 2006, 2007 and 2008 because all of the trials advertised by the HTA programme had yet to publish their results at the time the metadata database was closed in July 2011. The two trials which had a commissioning brief advertised year of 2009 were trials conducted as a result of the flu call, which had to report within a short time frame.

TABLE 34

Components of the analysis of the primary outcome reported in the protocol/proposal and monograph

Description of planned statistical analysesPlanned from protocol/proposal, n (%)Reported in the monograph, n (%)
Planned statistical test
t-test16 (14.4)14 (9.4)
Chi-squared test8 (7.2)20 (13.4)
ANOVA6 (5.4)0
ANCOVA/linear regression19 (17.1)48 (32.2)
Logistic regression26 (23.4)21 (14.1)
Mixed model5 (4.5)18 (12.1)
Poisson regression3 (2.7)2 (1.3)
Cox proportional hazards7 (6.3)8 (5.4)
Log-rank test1 (0.9)4 (2.7)
Mann–Whitney1 (0.9)1 (0.7)
Non-parametric analyses1 (0.9)0
Confidence interval11 (9.9)9 (6.0)
Other3 (2.7)4 (2.7)
Not specified4 (3.6)0
Total111 (100.0)149 (100.0)
Significance level
1%013 (8.7)
2.5%2 (1.8)1 (0.7)
5%17 (15.3)22 (14.8)
95% confidence interval specified20 (18.0)47 (31.5)
Not specified72 (64.9)66 (44.3)
Total111 (100.0)149 (100.0)
Hypothesis testing
One-sided3 (2.7)3 (2.0)
Two-sided11 (9.9)28 (18.8)
Not specified97 (87.4)118 (79.2)
Total111 (100.0)149 (100.0)
Planned covariates to adjust for
Yes68 (61.3)0
No9 (8.1)0
Not clear3 (2.7)0
No information available31 (27.9)0
Total111 (100.0)0
Analysis population
ITT analysis60 (55.0)117 (78.5)
PP analysis03 (2.0)
AT analysis00
ITT and PP analysis5 (3.6)14 (9.4)
ITT and AT analysis00
PP and AT analysis00
No available information46 (41.4)15 (10.1)
Total111 (100.0)149 (100.0)
Adjustment for multiple comparisons
Bonferroni correction4 (3.6)8 (5.4)
Bonferroni–Dunn1 (0.9)3 (2.0)
Other2 (1.8)5 (3.4)
None specified104 (93.7)133 (89.3)
Total111 (100.0)149 (100.0)
Method of handling missing data
Complete case analysis3 (2.7)20 (13.4)
LOCF – single imputation method4 (3.6)14 (9.4)
WCI – single imputation method02 (1.3)
HDI – single imputation method01 (0.7)
RM – single imputation method05 (3.4)
Multiple imputation07 (4.7)
Mixed model3 (2.7)6 (4.0)
Generalised estimating equation1 (0.9)1 (0.7)
Survival analysis7 (6.3)11 (7.4)
Mean – single imputation method1 (0.9)3 (2.0)
More than one method was used to deal with missing data2 (1.8)8 (5.4)
Sensitivity analysis7 (6.3)9 (6.0)
None/no available information83 (74.8)62 (41.6)
Total111 (100.0)149 (100.0)

ANOVA, analysis of variance; AT, as treated; HDI, hot desk imputation; ITT, intention to treat; LOCF, last observation carried forward; PP, per protocol; RM, regression methods; WCI, worst-case imputation.

TABLE 35

Examples of discrepancies between covariates which trials planned to adjust for and those actually adjusted for as specified in the monograph

Trial IDCovariates which trial planned to adjust forActual covariates adjusted for
65Controlling for baseline HRSD, treatment centre, age and sex. Duration of index depressive episode, degree of treatment resistance, psychosis, antidepressant medication equivalents and cognitive impairmentPrerandomisation baseline HRSD scores were included as a covariate, as were NHS trusts to adjust for centre effects
59No informationAdjusted for age, sex, surgical status, major presumptive clinical syndrome, SOFA score at time of randomisation and APACHE II score at ICU admission
74Adjusting for group differences at baseline if necessaryWith baseline HADS depression score and stratification categories (urban/rural location; horizontal/vertical kinship) as covariates
78Age, sex, time to treatment and stroke type. Presence or absence of dysphagiaTime to treatment
86Individual-level covariates, e.g. age of mother, parity, and health visitor confounders such as ageAfter adjusting for covariates such as 6-week EPDS score, living alone, previous history of PND and any life events experienced
90Severity at initial presentation, age and sexNone specified
94Two stratification variables – centre and size of ulcer – were to be adjusted for in the analyses, as were ulcer type, duration of episodes, weight of patient, ankle mobility and a binary variable for the presence/absence of infection at baseline. Authors were to present an unadjusted analysis, but the adjusted analysis would have primacyA Cox proportional hazards model was used to adjust the analysis for the randomisation stratification factors (centre, baseline ulcer area), as well as duration and ulcer type. Actual baseline area (as measured from the tracings) and duration of ulcer were used
103Group, time, group by time, model using a linear trend over time and a quadratic trend if necessary (group by time interaction)Adjusted for baseline HbA1c based on those who completed their 12-month HbA1c measurement

APACHE II, Acute Physiology and Chronic Health Evaluation II; EPDS, Edinburgh Postnatal Depression Scale; HADS, Hospital Anxiety and Depression Scale; HbA1c, glycated haemoglobin; HRSD, Hamilton Rating Scale for Depression; ICU, intensive care unit; PND, postnatal depression; SOFA, Sequential Organ Failure Assessment.

TABLE 36

Reporting of sample size calculation components in the proposal/protocol and monograph

Component of sample size calculationNumber of trials reporting each component (n = 117)a
Protocol, n/N (%)Monograph, n/N (%)Both,b n/N (%)
1. Name of outcome measure113/117 (96.6)113/117 (96.6)111/117 (94.9)
2. Alpha (type 1 error rate)108/117 (92.3)109/117 (93.2)104/117 (88.9)
3. (a) Method of calculation: powerc113/116 (97.4)113/116 (97.4)110/115 (95.7)
Continuous outcome
  Minimum clinically important effect size (delta)d and43/49 (87.8)46/58 (79.3)39/47 (83.0)
  SD for deltad or33/49 (67.3)32/58 (55.2)29/47 (61.7)
  Standardised effect size12/12 (100.0)7/7 (100.0)7/7 (100.0)
Binary outcome
  Estimated event rate in each arme41/53 (77.4)37/48 (77.1)36/48 (75.0)
Time-to-event outcome
  Time-to-event dataf2/2 (100.0)2/2 (100.0)2/2 (100.0)
Type of outcome not specified
  No components for sample size calculation specifiedN/A1/1 (100.0)1/1 (100.0)
3. (b) Method of calculation: width of confidence interval1/1 (100.0)1/1 (100.0)1/1 (100.0)
Binary outcome: event rate in each arm and precision/width of confidence interval required1/1 (100.0)1/1 (100.0)1/1 (100.0)
Continuous outcome: SD and precision/width of confidence interval required000
4. Calculated sample size
4. (a) Included result from sample size calculation on number required to recruit71/117 (60.7)58/117 (49.6)53/117 (45.3)
4. (b) Presented total number of participants required to recruit116/117 (99.1)112/117 (95.7)111/117 (94.9)
5. All components required75/117 (64.1)66/117 (56.4)52/117 (44.4)

N/A, not applicable.

a

The figures for both from 3(a) and 3(b) do not add up to 117 because 11 changed the type of primary outcome used from the protocol to the monograph, so sample size calculations could not be compared (five from binary to continuous, one from continuous to not specified, one from effect size to continuous).

b

Including only trials that reported a sample size calculation in the protocol/proposal.

c

Excluding trials that used width of confidence interval to estimate sample size calculation which does not require that power is specified.

d

For trials reporting a sample size calculation with a continuous outcome.

e

For trials reporting a sample size calculation using a binary outcome.

f

The components required for sample size calculations based on time-to-event data were either proportion in each arm at particular time point, median survival in each group or median survival in one group and hazard ratio for comparison.

TABLE 37

Discrepancies in sample size calculation information reported in the proposal/protocol and monograph

Component of sample size calculationNumber of trials reporting each component (n = 117)
Total, n/NNot prespecified,a nDifferent from protocol description, n
1. Name of outcome measure18/113216
2. Alpha (type 1 error rate)7/10952
3. (a) Method of calculation: power18/113315: nine larger in monograph; six larger in protocol/proposal
Continuous outcome
  Minimum clinically important effect size (delta) and19/466/4613: five larger in monograph, three larger in protocol/proposal and five not comparable as primary outcomes in protocol and monograph are different
  SD for delta or5/323/322: one larger in protocol and one not comparable as primary outcomes in protocol and monograph are different
  Standardised effect size1/701: one larger in protocol
Binary outcome
  Estimated event rate in each arm4/3713: in one values reported were higher in the monograph and one not comparable as primary outcomes in protocol and monograph are different
Time-to-event outcome
  Time-to-event data0/200
Type of outcome not specified
  No components for sample size calculation specified in publication101: values specified in protocol for minimum difference aim to detect (delta), SD for delta, alpha and power
3. (b) Method of calculation: width of confidence interval1
Binary outcome: event rate in each arm and precision/width of confidence interval required0/100
Continuous outcome: SD and precision/width of confidence interval required1/10Not comparable as primary outcome in protocol and monograph are different
4. Calculated sample size
4. (a) Included result from sample size calculation on number required to recruit20/58515: 10 larger in protocol and five larger in monograph (note these figures include six trials where the primary outcome used for sample size calculation is different in protocol/proposal and monograph)
4. (b) Presented total number of participants required to recruit30/112129: 10 larger in monograph and 19 larger in protocol/proposal (note this includes five trials where the primary outcome used is different so not comparable)
5. Any component451839
a

Reported in the publication but not mentioned in the protocol.

TABLE 38

Sample size calculations reported in the protocol/proposal and monograph

Component of sample size calculationProtocol, n (%)Publication, n (%)
Alpha
 5%102 (94.4)100 (91.7)
 1%3 (2.8)3 (2.8)
 Other3 (2.8)6 (5.5)
Total108109
Power
 < 80%1 (0.9)1 (0.9)
 80%59 (52.2)63 (55.8)
 81–84%3 (2.7)1 (0.9)
 85%1 (0.9)4 (3.5)
 86–89%2 (1.8)1 (0.9)
 90%42 (37.1)40 (35.3)
 > 90%5 (4.4)3 (2.7)
Total113113
Did they consider dropout?
 Yes72 (61.5)58 (48.3)
 No12 (10.3)27 (22.5)
 Not clear1 (0.8)0
 No information available37 (31.6)35 (29.2)
Total122120

TABLE 39

Graphical presentation of data in HTA monographs compared with reports reviewed in the study by Pocock et al.95

Description of datan (%)n (%) from Pocock et al.95
What type of figures was used to illustrate results?
 Kaplan–Meier plot25 (20.0)32 (41.6)
 Repeated measures plot48 (38.4)20 (26.0)
 Forest plot21 (16.8)21 (27.3)
 None of the above31 (24.9)N/A
Total12573

N/A, not applicable.

Boxes

BOX 6

The actual research questions answered under this theme

Did the protocol specify the planned method of analysis for the primary outcome in sufficient detail? In relation to:

  • T4.1. how many specified a method of analysis for the primary outcome.
  • T4.2. whether or not this improved over time.
  • T4.3. statistical test applied.
  • T4.4. significance level.
  • T4.5. hypothesis testing.
  • T4.6. adjustment for covariates.
  • T4.7. analysis population.
  • T4.8. adjustment for multiple testing.
  • T4.9. missing data.
  • T4.10. sufficient detail including all of the above seven elements recorded in the protocol.

Was the analysis planned in the proposal/protocol for the primary outcome carried out? In relation to:

  • T4.11. statistical test/model used.
  • T4.12. significance level.
  • T4.13. analysis population.
  • T4.14. missing data.
  • T4.15. covariates adjusted for in the analysis.

How was the sample size estimated (power, confidence intervals, etc.)?

  • T4.16. Was sufficient information on the sample size calculation provided?
  • T4.17. Does the sample size calculation in the protocol match the sample size calculation shown in the monograph? What discrepancies were found?
  • T4.18. What values of alpha, power and drop out were used in the sample size calculation?
  • T4.19. Other information: what graphical presentation of data was reported in HTA trials?

BOX 7

Examples of discrepancies between statistical test/model planned in the protocol/proposal and used in the monograph

  1. In three trials (ID131, ID132 and ID133) reported in one monograph, the authors stated in the protocol that they would analyse the primary outcome score data using logistic regression. They actually analysed the continuous score data using ordinal regression (which they classified as linear regression).
  2. Trial ID65 planned in the proposal to analyse the second primary outcome as follows: ‘Six month follow up data, relapse rates will be analysed by comparing relapse rates between the groups by survival analysis using cox’s regression controlling for baseline depression, age, sex and centre.’ They actually compared the percentage relapsing in each group at the end of treatment using a Fisher’s exact test and yielding a significant result (p < 0.005).
Copyright © Queen’s Printer and Controller of HMSO 2015. This work was produced by Raftery et al. under the terms of a commissioning contract issued by the Secretary of State for Health. This issue may be freely reproduced for the purposes of private research and study and extracts (or indeed, the full report) may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NIHR Journals Library, National Institute for Health Research, Evaluation, Trials and Studies Coordinating Centre, Alpha House, University of Southampton Science Park, Southampton SO16 7NS, UK.

Included under terms of UK Non-commercial Government License.

Bookshelf ID: NBK274325