Included under terms of UK Non-commercial Government License.
NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Raftery J, Young A, Stanton L, et al. Clinical trial metadata: defining and extracting metadata on the design, conduct, results and costs of 125 randomised clinical trials funded by the National Institute for Health Research Health Technology Assessment programme. Southampton (UK): NIHR Journals Library; 2015 Feb. (Health Technology Assessment, No. 19.11.)
This chapter considers questions surrounding the appropriateness of the statistical analyses. After a brief review of the relevant literature, 19 questions were explored. The results are summarised and discussed.
Introduction
Outcome reporting bias has been widely reported.10,11,27,80–87 However, only a few papers have reported on whether or not researchers adequately specify planned analyses in the protocol and, subsequently, whether or not they follow the prespecified analysis.88,89 This matters because failure to follow the prespecified analysis can result in bias. One study suggested that protocols were not sufficiently precise to identify deviations from planned analyses.89 Two reviewed whether or not sample size calculations were adequately specified.88–90 Another recently questioned whether or not the current method of sample size calculations was appropriate.91 These are summarised below.
The primary outcomes in protocols were compared with those in published reports for 102 trials approved by the scientific ethics committees for Copenhagen and Frederiksberg, Denmark, between 1994 and 1995.10 Selective reporting was revealed, with 62% of trials reviewed having at least one primary outcome added, omitted or changed.
A similar review of 48 trials funded by the Canadian Institutes for Health Research81 found that in 33% of trials, the outcome listed as primary in the publication differed from that in the protocol. They also found that outcome results were incompletely reported.
A pilot study conducted in 2000 reviewed 15 applications received by a single local research ethics committee in the 1990s and compared the outcomes, analysis and sample size in the protocol with that presented in the final study report.89 The authors found that six protocols (40%) stated the primary outcome and, of these, four (67%) matched that in the published report. Eight mentioned an analysis plan but only one (12%) followed its prescribed plan. The study concluded that selective reporting may be substantial but that bias could only be broadly identified as protocols were not sufficiently precise.
In 2008, Chan et al.88 compared the statistical analysis and sample size calculations specified in the protocol with those specified in the published paper. They found evidence of discrepancies in the sample size calculations (18/34 trials), the methods of handling protocol deviations (18/34 trials), methods of handling missing data (39/49 trials), primary outcome analyses (25/42 trials), subgroup analyses (25/25 trials) and adjusted analyses (23/28 trials). These discrepancies could affect the reliability of results, introduce bias and indicate selective reporting. They concluded that the reliability of trial reports cannot be assessed without access to the protocol.
A 2008 comparison of the sample size calculation specified in the protocol with that in the publication found that only 11 of the 62 trials reviewed adequately described the sample size calculation in both the protocol and published report.88
Charles et al.,90 in a review of the reporting of sample size calculations in 215 trials published between 2005 and 2006, found that 43% did not report all the required sample size calculation parameters.
A study of 18 trials that reported on traumatic brain injury reviewed the covariates adjusted for and subgroup analyses performed.92 Protocols could be obtained for 6 of the 18 trials and it found that all six trials reported subgroup effects which differed from those specified in their protocols.
In collaboration with journal editors, triallists, methodologists and ethicists, Chan et al.93,94 have launched the Standard Protocol Items for Randomised Trials (SPIRIT) initiative to establish evidence-based recommendations for the key content of trial protocols.
The above studies may not reflect current practice because either the number of trials reviewed was small or the studies reviewed were relatively old (1994–5 for Chan88 and similar for Hahn89). Practice may have improved since, following the introduction of CONSORT and other guidelines.
Our objective was to repeat these analyses on the cohort of all HTA published RCTs, assessing the extent of these discrepancies and whether or not they improved over time.
Questions addressed
The aim was to review the appropriateness of the statistical analyses for all published HTA clinical trials, including the sufficiency of the proposed statistical plan, handling of missing data and whether or not there were discrepancies between what was proposed and what was actually reported in the published monograph.
The questions posed (Box 6) fall under the following six subheadings:
- Did the protocol specify the planned method of analysis for the primary outcome in sufficient detail?
- Was the analysis planned in the proposal/protocol for the primary outcome carried out?
- How was the sample size estimated?
- How adequate was the reporting of planned and actual subgroup analysis?
- Other information: what graphical presentation of data was reported in HTA trials?
- Were conclusions justified given the analysis results?
Methods
Nineteen questions were piloted as shown in Box 6. Four questions were considered but not proceeded with, regarding:
- the number of statistical tests and number of primary statistical tests
- whether or not authors measured more outcomes than they reported
- adequate reporting of subgroup analyses
- whether or not the conclusions were justified given the analysis results.
Difficulties arose with each of these questions. Firstly, results were not presented in a standard format in the monographs. Secondly, as the monographs were lengthy, data extraction meant searching and reading through many pages. Thirdly, as the HTA trials are pragmatic, they include a large number of outcomes measured at multiple time points, which increased the number of tables/amount of text to be reviewed. Fourthly, extracting information on subgroup analyses planned and carried out was difficult because authors seldomly labelled analyses as subgroup analyses. Lastly, we found it difficult to specify data that could answer the question regarding the conclusions being justified by the analyses.
For the 19 questions explored, the methods used in the literature reviewed above were used as a framework to detail the questions. For example, the paper by Chan et al.88 provided the key components of data that needed to be extracted on the sample size calculation. Data on these components were expanded to include other types of outcome measures and study designs (e.g. time-to-event data, non-inferiority and cluster randomised trials). We extracted these data from the protocol or project proposal (if a protocol was not available) and monograph, and analysed the data in a similar way.
Denominators
All trials were included (n = 125). The unit of analysis for questions T4.1–T4.15 was each trial’s primary outcome with complete analysis (n = 164 planned and n = 161 reported). The unit of analysis for T4.16–T4.18 was the individual trial.
Results
Questions T4.1–T4.10: did the protocol specify the planned method of analysis for the primary outcome in sufficient detail?
Question T4.1: how many specified a method of analysis for the primary outcome?
The 125 trials included 206 planned primary outcomes and reported on 232 primary outcomes. Of these, 164 and 161, respectively, were ‘complete for analysis’ (these are the denominators for questions T4.1–T4.10).
The method of analysis was prespecified for 111 out of 164 planned primary outcomes (67.7%), with little difference between those that did and did not have protocols (65.9%, 54/82 from the proposal and 69.5%, 57/82 from the protocol) (Table 32).
Question T4.2: has this improved over time?
There is a slight indication that the specification of the primary outcome analyses has improved over time. This could be due to the increasing number of protocols available (Table 33 and Figure 6) but the low numbers preclude strong conclusions.
Question T4.3: statistical test applied
Of the 111 planned primary outcomes with a prespecified method, the proposed statistical test/choice of model was described in 108 (96.4%). The most frequently reported planned methods of analysis were logistic regression (23.4%, 26/111) and analysis of covariance (ANCOVA)/linear regression (17.1%, 19/111), followed by t-test (14.4%, 16/111) (Table 34).
Question T4.4: significance level
Of the 111 primary outcomes with a specified method of analysis, the significance level/confidence interval level to be used was specified in 39 (35.1%). Table 34 shows that the most commonly used level of statistical significance was 5%.
Question T4.5: hypothesis testing
The majority did not specify whether one-sided or two-sided analysis would be performed (87.4%, 97/111) (see Table 34).
Question T4.6: adjustment for covariates
Sixty-eight of the 111 (61.3%) planned primary outcomes specified the covariates that they planned to adjust for in the final analysis.
Question T4.7: analysis population
The planned population for the primary analysis was not specified by 41.4% (46/111). This appears to have improved over time (apart from anomalies in 1998 and 2003), with a big increase in 1996, the year in which CONSORT was published.
Question T4.8: adjustment for multiple testing
Almost all studies failed to specify a method of adjustment for multiple testing (93.7%, 104/111). As HTA trials are pragmatic as opposed to licensing trials, looking at a range of outcomes over short- and long-term periods, adjustment for multiple testing may matter less than transparency.
Question T4.9: missing data
Most studies did not specify a method for handling missing data (74.8%, 83/111). Of those that did, the methods used varied (see Table 34).
Question T4.10: is sufficient detail including all of the above seven elements recorded in the protocol?
The number of protocols meeting all seven criteria was low, at 1.8% (2/111). When we limited the criteria to three (model/test, significance level and analysis population), of the 111 primary outcomes for which a method of analysis was specified in the protocol/proposal, 30 primary outcomes qualified (27%, 30/111). This increased slightly over time, from 22.7% before 1998 to 35.6% after.
Questions T4.11–T4.15: was the analysis planned in the protocol/proposal for the primary outcome carried out?
Question T4.11: statistical test/model used
Of the 82 trials whose primary outcome was as planned, the authors changed the planned method of statistical analysis (model/test) in 20 (24.4%). Some changed to more complex methods (t-test changed to linear regression in five instances) and others to simpler methods (in three, a chi-squared test was carried out instead of logistic regression, linear regression was used instead of a mixed model and Fisher’s exact test was used instead of Cox proportional hazards). These could be legitimate changes or selective reporting depending on the results, something we did not explore (examples are given in Box 7).
Question T4.12: significance level
All but six trials used the 5% significance level. Of the six discrepancies between the significance stated in the protocol/proposal and that used in the monograph, one led to an increase in the significance level used but this seems to be an error (trial ID42 stated in the protocol that ‘Differences will be judged significant at the 2.5% level to take account of two primary comparisons being drawn’; the monograph stated that 95% confidence intervals would be calculated but a 2.5% significance level was stated in the sample size calculation in the protocol.
Question T4.13: analysis population
Of those trials that stated the planned analysis population for the primary outcome analysis in the protocol/proposal, 90% (56/62) followed the plan. Most carried out what they described as an ‘intention-to-treat’ analysis. In two trials, the triallists stated in the protocol/proposal that they would carry out both an intention-to-treat and per-protocol analysis but reported only on the per-protocol analysis. Both of these were from trial ID109, where ‘The data were analysed per protocol. As planned, no intention-to-treat analyses were conducted, as < 10% of subjects would have been classified differently in such an analysis.’ Therefore, this change of analysis population was justified because the authors had specified in the protocol a rule which was used to decide which population to use.
Question T4.14: missing data
Of the 28 trials for which a method of handling missing data was specified in the protocol, the method used was different in 12 (42.9%).
Question T4.15: covariates adjusted for in the analysis
Sixty-eight of the 111 trials (61.3%) outlined their planned analysis of covariates and for 31 (27.9%) it was unclear (Table 35). Some trials did not specify which covariates they would adjust for in the protocol or, if they did, they failed to specify exactly which covariates would be adjusted for, for example ‘adjusting for baseline variables’ or ‘taking into account any statistically important imbalances’. This made it difficult to compare planned covariates with actual covariates adjusted for, in many trials.
In summary, the analyses planned in the proposal/protocol for the primary outcome were carried out in 68 of the 82 trials (76%) and changed in 20 (24%) (considering statistical test/model only). The method of handling missing data specified in the protocol/proposal did not match what was carried out 43% of the time. The analysis population and significance level changed 10% of the time in trials. More detailed examination suggests that some of the changes were legitimate. Without knowing if a statistical analysis plan was drawn up before the analysis and subsequently followed, one cannot conclude departures from proper practice.
Questions T4.16–T4.18: how was the sample size estimated (power, confidence interval, etc.)?
We followed the methods and tables used by Chan et al.,88 expanded to incorporate the different types of sample size calculation observed in the HTA trials (e.g. width of confidence interval calculations, time-to-event data, standardised effect size, non-inferiority, equivalence).
Question T4.16: was sufficient information on the sample size calculation provided?
The results of classifying the trials by the five components suggested by Chan et al.88 are shown in Table 36. Of the 125 trials, 75 proposals/protocols (60%) and 66 monographs (52.8%) reported all the required sample size components. Individual components were reported in 60.7–100% of proposals/protocols and 49.6–100% of monographs. The required sample size was reported in the proposal/protocol in 93% of trials (116/125), in the monograph in 90% (112/125) and in both in 89%. The result from the sample size calculation was presented in the proposal/protocol in 57% of trials (111/125), in the monograph in 46% (58/125) and in both in 42% (e.g. the sample size calculation showed that the trial will have to recruit 326 participants. Taking account of the participant dropout rate, this will increase the number needed per arm to 350). Forty-two per cent of trials (52/125) reported all the required components of the sample size calculation in both the proposal/protocol and monograph.
Question T4.17: does the sample size calculation in the protocol match the sample size calculation shown in the monograph? What discrepancies were found?
Of the 117 trials reporting a sample size calculation in both the proposal/protocol and the monograph, we observed discrepancies between that planned and that reported in 45 trials (38.5%). A component of the sample size calculation was reported in the monograph but not reported in the protocol/proposal in 18 trials. In 39 trials, there was a discrepancy in at least one component reported in both the protocol/proposal and the monograph. These discrepancies were not acknowledged in the monograph. Where a discrepancy was observed between the number of patients, the trial planned to recruit and the number actually recruited, this was twice as likely to be because the number specified in the monograph was smaller than that in the protocol/proposal than vice versa (19 trials vs. 10 trials). Where a discrepancy existed in the minimum clinically important effect size, this was also almost twice as likely to be due to the effect size being reported as larger in the monograph than in the protocol (Table 37). These discrepancies could be due to reductions in the planned sample size after the study started which were not reported in the monograph, or attempts to justify the fewer patients actually recruited.
Question T4.18: what values of alpha, power and dropout were used in the sample size calculation?
In the proposal/protocol, a 5% significance level was used in the sample size calculation 94.4% of the time (102/108). Eighty per cent power was specified in half of the protocols (52.2%, 59/113) and 90% power was specified in over one-third (37.1%, 42/113). The triallists inflated the sample size for participant loss to follow-up 61.5% of the time (72/122) in the protocol/proposal and 48.3% of the time (58/120) in the monograph (Table 38).
Question T4.19: other information – what graphical presentation of data was reported in Health Technology Assessment trials?
We reviewed each HTA monograph and assessed whether it included a repeated measures plot, a Kaplan–Meier plot or a forest plot, as these were the top reported figures in Pocock et al.95 (accounting for 92% of figures published in the 77 RCT reports that they reviewed in five general medical journals). A repeated measures plot was presented in the HTA monograph for 38.4% of the trials (48/125), followed in frequency by a Kaplan–Meier plot (20%, 25/125) and a forest plot (16.8%, 21/125) (Table 39). A repeated measures plot was observed more frequently in the HTA monographs than in Pocock et al.’s95 sample, and a Kaplan–Meier plot less often. This could be due to differences in the types of trials reviewed, with HTA trials more likely to involve a longer follow-up at multiple time points and less likely to include survival outcomes.
Analysis
The planned method of analysis for the primary outcome was not specified in the protocol/proposal in one-third of the 125 trials. Of those that specified a method of analysis, only two (1.8%) fully specified the method of analysis using the six core criteria. Twenty-seven per cent met three criteria (statistical test/model, significance level and analysis population). Improvements occurred over time from 22.7% before 1998 to 35.6% thereafter. There did not appear to be differences in the level of detail reported in the protocol compared with the proposal, but this could be due to small numbers or confounding (with the year the commissioning brief was advertised).
Out of the 125 trials reviewed, only 52 (41.6%) reported all the required components of the sample size calculation in both the proposal/protocol and monograph. Of these, the information in the proposal/protocol matched the information in the monograph in only 43 trials (34%) (see Tables 36 and 37). Where discrepancies were observed, they were twice as likely to indicate a smaller sample size planned in the monograph than stated as planned in the protocol.
Discussion
We were able to extract data to answer a number of questions on the planned and actual method of statistical analysis and sample size calculation. The degree to which this study was successful varied by the three broad sets of questions:
- Questions T4.1–T4.10: did the protocol specify the planned method of analysis for the primary outcome in sufficient detail? The study indicated that this set of questions could be answered and indicated some cause for concern as around one-third of trials provided insufficient detail, particularly on planned statistical analysis.
- Questions T4.11–T4.15: was the analysis planned in the proposal/protocol for the primary outcome carried out? We showed that it was difficult to complete this set of questions owing to lack of data.
- Questions T4.16 and T4.17: was sufficient information on the sample size calculation provided? And does the sample size calculation in the protocol match the sample size calculation shown in the monograph? What discrepancies were found? The study showed that it was difficult to complete this set of questions owing to lack of data.
One general finding from this study relates to the limitation of retrospective analysis. Standards changed over time. We were unable to discuss details with those responsible for the analyses in the trials. In particular, we had no way of knowing if statistical analysis plans had been drawn up separately from the protocol. We understand that such plans are common practice but often not until the trial is close to completion. The key issue is that such plans are specified in advance before the data are examined. We have no way of knowing if this happened.
This is the first study we are aware of that has reviewed whether or not the method of statistical analysis was recorded in sufficient detail in the protocol, as defined by a minimum set of criteria.
Sample size calculation is a vitally important aspect of any clinical trial to ensure that the number of patients included is large enough to answer the question reliably and as few patients as possible are exposed to a potentially inferior treatment. It is important that all parameters used in the sample size calculation(s) are clearly and accurately reported in both the grant proposal/protocol and final trial publication. The level of detail reported should enable another statistician to replicate the sample size calculation if necessary. The sample size calculation reported in the final trial protocol and final publication should match and any changes to the sample size that were made after the trial had started should be reported.
We found that sample size calculation information was not being recorded in sufficient detail in both the protocol and publication for RCTs. Where the information was recorded, the level of unexplained discrepancies was surprisingly high. Changes to the sample size calculation after a trial has started are allowed for much the same reasons as listed in relation to changes to the statistical analysis plan [e.g. advances in knowledge, change in trial design or better understanding of the trial data (SD or control group event rate)], but should be minimised as much as possible.
We observed fewer discrepancies than other studies with regard to the method of statistical analyses and whether or not the authors followed the protocol or the sample size calculations. The discrepancies observed could be legitimate changes not reported in the monograph or could be hiding unacknowledged reductions in the sample size carried out after the trial started due to recruitment problems (reported in Chapter 6), or they could be evidence of selective reporting bias indicating the results to be more clinically meaningful than they were (e.g. by increasing clinically meaningful difference specified) or typographical mistakes. Of these, given the large number of trials which failed to recruit the original planned sample size as reported in Chapter 6, we think the most likely explanations are the first two listed above.
Questions T4.11–T4.15 explored potential selective reporting. We found that potential selective reporting bias in sample size calculation information and in methods of analysis exists. This is perhaps not so serious, as a previous review of a subset of the RCTs in this cohort found that only 24% of primary outcome results were statistically significant.5 If there was selective reporting bias we might expect this percentage to be higher.
Chan et al.88 found that the statistical test for primary outcome measures differed between the protocol and publication in 60% of trials; we found a smaller percentage in our cohort of 25%. This could be because we had access to the final version of the protocol whereas Chan et al.88 had access to the protocol submitted to an ethics committee. In addition, Chan et al.88 studied protocols from the 1994–5 period, before CONSORT had been developed (in 1996). Chan et al.88 observed that 32.6% of protocols described the planned method of handling missing data, higher than our finding of 25.2%.
Chan et al.88 found that 11 out of 62 trials (17.7%) fully and consistently reported all of the requisite components of the sample size calculation in both the protocol and publication. The corresponding figure in our sample was 34%; this is twice as large as in Chan et al.88 but is still much lower than expected.
We found a similar proportion of trials reporting all the required sample size calculation parameters as Charles et al..90 Charles found that 57% of 206 trials reported all the required sample size calculation parameters; we found that 56.4% of our trials did so.
The figures in the paper by Hahn et al.89 are similar to ours, although their studies were few and dated.
Strengths and weaknesses of the study
The biggest strength of this study was that we had access to a protocol/proposal for all the trials. This is the largest cohort study that we are aware of to have compared the method of analysis and sample size calculation planned in the protocol with that reported in a publication. This is also the first such study of UK-funded RCTs. Further, previous studies comparing protocols with publications may not reflect current practice because either the number of trials reviewed was small or the studies reviewed were relatively old (1994–5 for Chan et al.88 and similar for Hahn et al.89).
A limitation of our work was that we only analysed the first sample size calculation reported and compared that with the monograph.
We were surprised at the lack of detail in statistical analysis plans reported in the protocol/proposal and how few met our criteria. However, as statisticians often create statistical analysis plans separate from the protocol prior to final analysis, these may well provide more detail.
Key questions for the HTA programme concern whether or not it requires audit of planned analyses and, if so, how and at what level of detail. Our study shows the limits of retrospective audit based on the protocol/application form and the monograph. More generally, the HTA programme should consider requiring information to be recorded on the statistical test/model planned for use in the analysis, the significance level/confidence interval level to be used and the analysis population.
Recommendations for future work
Should the database be continued, we recommend that the questions on statistical analysis are reviewed alongside the SPIRIT checklist.94 Any further data extraction should include 13 questions: four should remain as they are (T4.1, T4.2, T4.18 and T4.19) and nine should be amended (T4.3, T4.4, T4.7, T4.10, T4.11, T4.12, T4.13, T4.16 and T4.17).
We observed that if trials funded by the HTA programme are to continue to qualify as one of the four cohorts of trials included in Djulbegovic et al.,96 then data will have to be extracted on the relevant fields.5 Dent and Raftery5 assessed treatment success and whether or not the results were compatible with equipoise using six categories: (1) statistically significant in favour of the new treatment; (2) statistically significant in favour of the control treatment; (3) true negative; (4) truly inconclusive; (5) inconclusive in favour of the new treatment; or (6) inconclusive in favour of the control treatment. Trials were classified by comparing the 95% confidence interval for the difference in primary outcome with the difference specified in the sample size calculation. The recent Cochrane Review used data extracted for this project and combined them with the only three other similar cohorts.96
Unanswered questions and future research
We analysed whether or not the planned analyses were carried out. We did not attempt to investigate whether or not the planned analyses were appropriate.
We compared individual components of the planned method of analysis with individual components reported in the monograph but did not calculate how often all of the components of the analysis plan matched those presented in the monograph. Again, this could be the subject of further work.
Small numbers constrained our analysis of trends in time. If the work continues, time trend analyses should be repeated and extended.
Further work could explore whether or not the amount of detail provided in the protocol on planned analyses is affected by the seniority of the statistician involved, including if he/she was a co-applicant.
Figures
FIGURE 6
Tables
TABLE 32
Planned primary analysis | Protocol available or not? | Total, n (%) | |
---|---|---|---|
Yes, n (%) | No, n (%) | ||
Yes | 57 (69.5) | 54 (65.9) | 111 (67.7) |
No | 14 (17.1) | 13 (15.9) | 27 (16.5) |
Not clear | 1 (1.2) | 1 (1.2) | 2 (1.2) |
Not applicable | 1 (1.2) | 0 | 1 (0.6) |
No information available | 9 (11.0) | 14 (17.0) | 23 (14.0) |
Total number of primary outcomes | 82 (100.0) | 82 (100.0) | 164 (100.0) |
Total number of trials | 65 (52.0) | 60 (48.0) | 125 (100.0) |
TABLE 33
Year of commissioning brief | Yes, n (%) | No,a n (%) | Not clear, n (%) | Total, n (%) |
---|---|---|---|---|
1993 | 12 (70.6) | 5 (29.4) | 0 | 17 (100.0) |
1994 | 13 (52.0) | 11 (44.0) | 1 (4.0) | 25 (100.0) |
1995 | 19 (73.0) | 7 (27.0) | 0 | 26 (100.0) |
1996 | 16 (55.2) | 13 (44.8) | 0 | 29 (100.0) |
1997 | 6 (50.0) | 5 (42.0) | 1 (8.0) | 12 (100.0) |
1998 | 3 (75.0) | 1 (25.0) | 0 | 4 (100.0) |
1999 | 10 (83.3) | 2 (16.7) | 0 | 12 (100.0) |
2001 | 20 (87.0) | 3 (13.0) | 0 | 23 (100.0) |
2002 | 3 (75.0) | 1 (25.0) | 0 | 4 (100.0) |
2003 | 6 (75.0) | 2 (25.0) | 0 | 8 (100.0) |
2005b | 1 (100.0) | 0 | 0 | 1 (100.0) |
2009b | 2 (66.7) | 1 (33.3) | 0 | 3 (100.0) |
Total | 111 (67.8) | 51 (33.5) | 2 (1.2) | 164 (100.0) |
- a
The categories ‘no information available’ and ‘no’ were merged as they were essentially the same.
- b
There were no trials included in the database with commissioning briefs advertised in 2006, 2007 and 2008 because all of the trials advertised by the HTA programme had yet to publish their results at the time the metadata database was closed in July 2011. The two trials which had a commissioning brief advertised year of 2009 were trials conducted as a result of the flu call, which had to report within a short time frame.
TABLE 34
Description of planned statistical analyses | Planned from protocol/proposal, n (%) | Reported in the monograph, n (%) |
---|---|---|
Planned statistical test | ||
t-test | 16 (14.4) | 14 (9.4) |
Chi-squared test | 8 (7.2) | 20 (13.4) |
ANOVA | 6 (5.4) | 0 |
ANCOVA/linear regression | 19 (17.1) | 48 (32.2) |
Logistic regression | 26 (23.4) | 21 (14.1) |
Mixed model | 5 (4.5) | 18 (12.1) |
Poisson regression | 3 (2.7) | 2 (1.3) |
Cox proportional hazards | 7 (6.3) | 8 (5.4) |
Log-rank test | 1 (0.9) | 4 (2.7) |
Mann–Whitney | 1 (0.9) | 1 (0.7) |
Non-parametric analyses | 1 (0.9) | 0 |
Confidence interval | 11 (9.9) | 9 (6.0) |
Other | 3 (2.7) | 4 (2.7) |
Not specified | 4 (3.6) | 0 |
Total | 111 (100.0) | 149 (100.0) |
Significance level | ||
1% | 0 | 13 (8.7) |
2.5% | 2 (1.8) | 1 (0.7) |
5% | 17 (15.3) | 22 (14.8) |
95% confidence interval specified | 20 (18.0) | 47 (31.5) |
Not specified | 72 (64.9) | 66 (44.3) |
Total | 111 (100.0) | 149 (100.0) |
Hypothesis testing | ||
One-sided | 3 (2.7) | 3 (2.0) |
Two-sided | 11 (9.9) | 28 (18.8) |
Not specified | 97 (87.4) | 118 (79.2) |
Total | 111 (100.0) | 149 (100.0) |
Planned covariates to adjust for | ||
Yes | 68 (61.3) | 0 |
No | 9 (8.1) | 0 |
Not clear | 3 (2.7) | 0 |
No information available | 31 (27.9) | 0 |
Total | 111 (100.0) | 0 |
Analysis population | ||
ITT analysis | 60 (55.0) | 117 (78.5) |
PP analysis | 0 | 3 (2.0) |
AT analysis | 0 | 0 |
ITT and PP analysis | 5 (3.6) | 14 (9.4) |
ITT and AT analysis | 0 | 0 |
PP and AT analysis | 0 | 0 |
No available information | 46 (41.4) | 15 (10.1) |
Total | 111 (100.0) | 149 (100.0) |
Adjustment for multiple comparisons | ||
Bonferroni correction | 4 (3.6) | 8 (5.4) |
Bonferroni–Dunn | 1 (0.9) | 3 (2.0) |
Other | 2 (1.8) | 5 (3.4) |
None specified | 104 (93.7) | 133 (89.3) |
Total | 111 (100.0) | 149 (100.0) |
Method of handling missing data | ||
Complete case analysis | 3 (2.7) | 20 (13.4) |
LOCF – single imputation method | 4 (3.6) | 14 (9.4) |
WCI – single imputation method | 0 | 2 (1.3) |
HDI – single imputation method | 0 | 1 (0.7) |
RM – single imputation method | 0 | 5 (3.4) |
Multiple imputation | 0 | 7 (4.7) |
Mixed model | 3 (2.7) | 6 (4.0) |
Generalised estimating equation | 1 (0.9) | 1 (0.7) |
Survival analysis | 7 (6.3) | 11 (7.4) |
Mean – single imputation method | 1 (0.9) | 3 (2.0) |
More than one method was used to deal with missing data | 2 (1.8) | 8 (5.4) |
Sensitivity analysis | 7 (6.3) | 9 (6.0) |
None/no available information | 83 (74.8) | 62 (41.6) |
Total | 111 (100.0) | 149 (100.0) |
ANOVA, analysis of variance; AT, as treated; HDI, hot desk imputation; ITT, intention to treat; LOCF, last observation carried forward; PP, per protocol; RM, regression methods; WCI, worst-case imputation.
TABLE 35
Trial ID | Covariates which trial planned to adjust for | Actual covariates adjusted for |
---|---|---|
65 | Controlling for baseline HRSD, treatment centre, age and sex. Duration of index depressive episode, degree of treatment resistance, psychosis, antidepressant medication equivalents and cognitive impairment | Prerandomisation baseline HRSD scores were included as a covariate, as were NHS trusts to adjust for centre effects |
59 | No information | Adjusted for age, sex, surgical status, major presumptive clinical syndrome, SOFA score at time of randomisation and APACHE II score at ICU admission |
74 | Adjusting for group differences at baseline if necessary | With baseline HADS depression score and stratification categories (urban/rural location; horizontal/vertical kinship) as covariates |
78 | Age, sex, time to treatment and stroke type. Presence or absence of dysphagia | Time to treatment |
86 | Individual-level covariates, e.g. age of mother, parity, and health visitor confounders such as age | After adjusting for covariates such as 6-week EPDS score, living alone, previous history of PND and any life events experienced |
90 | Severity at initial presentation, age and sex | None specified |
94 | Two stratification variables – centre and size of ulcer – were to be adjusted for in the analyses, as were ulcer type, duration of episodes, weight of patient, ankle mobility and a binary variable for the presence/absence of infection at baseline. Authors were to present an unadjusted analysis, but the adjusted analysis would have primacy | A Cox proportional hazards model was used to adjust the analysis for the randomisation stratification factors (centre, baseline ulcer area), as well as duration and ulcer type. Actual baseline area (as measured from the tracings) and duration of ulcer were used |
103 | Group, time, group by time, model using a linear trend over time and a quadratic trend if necessary (group by time interaction) | Adjusted for baseline HbA1c based on those who completed their 12-month HbA1c measurement |
APACHE II, Acute Physiology and Chronic Health Evaluation II; EPDS, Edinburgh Postnatal Depression Scale; HADS, Hospital Anxiety and Depression Scale; HbA1c, glycated haemoglobin; HRSD, Hamilton Rating Scale for Depression; ICU, intensive care unit; PND, postnatal depression; SOFA, Sequential Organ Failure Assessment.
TABLE 36
Component of sample size calculation | Number of trials reporting each component (n = 117)a | ||
---|---|---|---|
Protocol, n/N (%) | Monograph, n/N (%) | Both,b n/N (%) | |
1. Name of outcome measure | 113/117 (96.6) | 113/117 (96.6) | 111/117 (94.9) |
2. Alpha (type 1 error rate) | 108/117 (92.3) | 109/117 (93.2) | 104/117 (88.9) |
3. (a) Method of calculation: powerc | 113/116 (97.4) | 113/116 (97.4) | 110/115 (95.7) |
Continuous outcome | |||
Minimum clinically important effect size (delta)d and | 43/49 (87.8) | 46/58 (79.3) | 39/47 (83.0) |
SD for deltad or | 33/49 (67.3) | 32/58 (55.2) | 29/47 (61.7) |
Standardised effect size | 12/12 (100.0) | 7/7 (100.0) | 7/7 (100.0) |
Binary outcome | |||
Estimated event rate in each arme | 41/53 (77.4) | 37/48 (77.1) | 36/48 (75.0) |
Time-to-event outcome | |||
Time-to-event dataf | 2/2 (100.0) | 2/2 (100.0) | 2/2 (100.0) |
Type of outcome not specified | |||
No components for sample size calculation specified | N/A | 1/1 (100.0) | 1/1 (100.0) |
3. (b) Method of calculation: width of confidence interval | 1/1 (100.0) | 1/1 (100.0) | 1/1 (100.0) |
Binary outcome: event rate in each arm and precision/width of confidence interval required | 1/1 (100.0) | 1/1 (100.0) | 1/1 (100.0) |
Continuous outcome: SD and precision/width of confidence interval required | 0 | 0 | 0 |
4. Calculated sample size | |||
4. (a) Included result from sample size calculation on number required to recruit | 71/117 (60.7) | 58/117 (49.6) | 53/117 (45.3) |
4. (b) Presented total number of participants required to recruit | 116/117 (99.1) | 112/117 (95.7) | 111/117 (94.9) |
5. All components required | 75/117 (64.1) | 66/117 (56.4) | 52/117 (44.4) |
N/A, not applicable.
- a
The figures for both from 3(a) and 3(b) do not add up to 117 because 11 changed the type of primary outcome used from the protocol to the monograph, so sample size calculations could not be compared (five from binary to continuous, one from continuous to not specified, one from effect size to continuous).
- b
Including only trials that reported a sample size calculation in the protocol/proposal.
- c
Excluding trials that used width of confidence interval to estimate sample size calculation which does not require that power is specified.
- d
For trials reporting a sample size calculation with a continuous outcome.
- e
For trials reporting a sample size calculation using a binary outcome.
- f
The components required for sample size calculations based on time-to-event data were either proportion in each arm at particular time point, median survival in each group or median survival in one group and hazard ratio for comparison.
TABLE 37
Component of sample size calculation | Number of trials reporting each component (n = 117) | ||
---|---|---|---|
Total, n/N | Not prespecified,a n | Different from protocol description, n | |
1. Name of outcome measure | 18/113 | 2 | 16 |
2. Alpha (type 1 error rate) | 7/109 | 5 | 2 |
3. (a) Method of calculation: power | 18/113 | 3 | 15: nine larger in monograph; six larger in protocol/proposal |
Continuous outcome | |||
Minimum clinically important effect size (delta) and | 19/46 | 6/46 | 13: five larger in monograph, three larger in protocol/proposal and five not comparable as primary outcomes in protocol and monograph are different |
SD for delta or | 5/32 | 3/32 | 2: one larger in protocol and one not comparable as primary outcomes in protocol and monograph are different |
Standardised effect size | 1/7 | 0 | 1: one larger in protocol |
Binary outcome | |||
Estimated event rate in each arm | 4/37 | 1 | 3: in one values reported were higher in the monograph and one not comparable as primary outcomes in protocol and monograph are different |
Time-to-event outcome | |||
Time-to-event data | 0/2 | 0 | 0 |
Type of outcome not specified | |||
No components for sample size calculation specified in publication | 1 | 0 | 1: values specified in protocol for minimum difference aim to detect (delta), SD for delta, alpha and power |
3. (b) Method of calculation: width of confidence interval | 1 | ||
Binary outcome: event rate in each arm and precision/width of confidence interval required | 0/1 | 0 | 0 |
Continuous outcome: SD and precision/width of confidence interval required | 1/1 | 0 | Not comparable as primary outcome in protocol and monograph are different |
4. Calculated sample size | |||
4. (a) Included result from sample size calculation on number required to recruit | 20/58 | 5 | 15: 10 larger in protocol and five larger in monograph (note these figures include six trials where the primary outcome used for sample size calculation is different in protocol/proposal and monograph) |
4. (b) Presented total number of participants required to recruit | 30/112 | 1 | 29: 10 larger in monograph and 19 larger in protocol/proposal (note this includes five trials where the primary outcome used is different so not comparable) |
5. Any component | 45 | 18 | 39 |
- a
Reported in the publication but not mentioned in the protocol.
TABLE 38
Component of sample size calculation | Protocol, n (%) | Publication, n (%) |
---|---|---|
Alpha | ||
5% | 102 (94.4) | 100 (91.7) |
1% | 3 (2.8) | 3 (2.8) |
Other | 3 (2.8) | 6 (5.5) |
Total | 108 | 109 |
Power | ||
< 80% | 1 (0.9) | 1 (0.9) |
80% | 59 (52.2) | 63 (55.8) |
81–84% | 3 (2.7) | 1 (0.9) |
85% | 1 (0.9) | 4 (3.5) |
86–89% | 2 (1.8) | 1 (0.9) |
90% | 42 (37.1) | 40 (35.3) |
> 90% | 5 (4.4) | 3 (2.7) |
Total | 113 | 113 |
Did they consider dropout? | ||
Yes | 72 (61.5) | 58 (48.3) |
No | 12 (10.3) | 27 (22.5) |
Not clear | 1 (0.8) | 0 |
No information available | 37 (31.6) | 35 (29.2) |
Total | 122 | 120 |
TABLE 39
Description of data | n (%) | n (%) from Pocock et al.95 |
---|---|---|
What type of figures was used to illustrate results? | ||
Kaplan–Meier plot | 25 (20.0) | 32 (41.6) |
Repeated measures plot | 48 (38.4) | 20 (26.0) |
Forest plot | 21 (16.8) | 21 (27.3) |
None of the above | 31 (24.9) | N/A |
Total | 125 | 73 |
N/A, not applicable.
Boxes
BOX 6
The actual research questions answered under this theme
Did the protocol specify the planned method of analysis for the primary outcome in sufficient detail? In relation to:
- T4.1. how many specified a method of analysis for the primary outcome.
- T4.2. whether or not this improved over time.
- T4.3. statistical test applied.
- T4.4. significance level.
- T4.5. hypothesis testing.
- T4.6. adjustment for covariates.
- T4.7. analysis population.
- T4.8. adjustment for multiple testing.
- T4.9. missing data.
- T4.10. sufficient detail including all of the above seven elements recorded in the protocol.
Was the analysis planned in the proposal/protocol for the primary outcome carried out? In relation to:
- T4.11. statistical test/model used.
- T4.12. significance level.
- T4.13. analysis population.
- T4.14. missing data.
- T4.15. covariates adjusted for in the analysis.
How was the sample size estimated (power, confidence intervals, etc.)?
- T4.16. Was sufficient information on the sample size calculation provided?
- T4.17. Does the sample size calculation in the protocol match the sample size calculation shown in the monograph? What discrepancies were found?
- T4.18. What values of alpha, power and drop out were used in the sample size calculation?
- T4.19. Other information: what graphical presentation of data was reported in HTA trials?
BOX 7
Examples of discrepancies between statistical test/model planned in the protocol/proposal and used in the monograph
- In three trials (ID131, ID132 and ID133) reported in one monograph, the authors stated in the protocol that they would analyse the primary outcome score data using logistic regression. They actually analysed the continuous score data using ordinal regression (which they classified as linear regression).
- Trial ID65 planned in the proposal to analyse the second primary outcome as follows: ‘Six month follow up data, relapse rates will be analysed by comparing relapse rates between the groups by survival analysis using cox’s regression controlling for baseline depression, age, sex and centre.’ They actually compared the percentage relapsing in each group at the end of treatment using a Fisher’s exact test and yielding a significant result (p < 0.005).