Discussion

Robert C Stein; Janet A Dunn; John MS Bartlett; Amy F Campbell; Andrea Marshall; Peter Hall; Leila Rooshenas; Adrienne Morgan; Christopher Poole; Sarah E Pinder; David A Cameron; Nigel Stallard; Jenny L Donovan; Christopher McCabe; Luke Hughes-Davies; Andreas Makris; on behalf of the OPTIMA Trial Management Group

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Stein RC, Dunn JA, Bartlett JMS, et al.; on behalf of the OPTIMA Trial Management Group. OPTIMA prelim: a randomised feasibility study of personalised care in the treatment of women with early breast cancer. Southampton (UK): NIHR Journals Library; 2016 Feb. (Health Technology Assessment, No. 20.10.)

OPTIMA prelim: a randomised feasibility study of personalised care in the treatment of women with early breast cancer.

Show details

Contents

< Prev Next >

Chapter 6Discussion

Recruitment and study conduct

Recruitment and study design

OPTIMA prelim opened to recruitment in late September 2012. The first patient consented to join the study in October 2012. By 3 June 2014, 350 patients had been registered, of whom 313 had been randomised from 35 recruiting centres. The OPTIMA TMG, and the NIHR HTA programme as study funder, always considered that patient recruitment might prove to be difficult. Previous experience of clinical trials involving an experimental arm in which less treatment is given than is standard clinical practice, such as the ProTect trial,¹⁴⁰ have experienced difficulties in recruitment.¹⁸⁵^,¹⁸⁶ Prespecified success targets were designed to demonstrate that recruitment into the proposed main trial was feasible once the number of recruiting centres was scaled up. Most of these applied to the final 6 months of recruitment to allow time to open the requisite number of sites and for research staff to learn how to recruit potential participants. Site selection for the OPTIMA prelim was on the basis of individual invitation rather than an open call. Five main geographical clusters were established, namely North London, East Anglia, the South West, the West Midlands and Scotland. Sites within each cluster included a mixture of cancer centres and district general hospitals. This policy was designed to ensure that feasibility could be demonstrated in a representative selection of UK centres rather than in those that are best described as enthusiasts.

All predefined feasibility conditions have been met, demonstrating that a large-scale effectiveness trial of the use of multiparameter assays as a decision tool for patients with primary ER-positive HER2-negative breast cancer in the UK is indeed achievable. This is despite approximately 80% of patients having lymph node involvement, which places them at comparatively high recurrence risk.

Two of the three other international trials of test-directed chemotherapy decisions have met their recruitment targets.¹²¹^,¹²⁴ The MINDACT study estimated patient risk using information from a clinical risk prediction nomogram, Adjuvant! Online, and a biological test, MammaPrint.¹²²^,¹⁸⁶ Patients with discordant risk assessments were randomised to a decision based on one of the two. Eligible patients could have up to three involved axillary lymph nodes but there was no restriction on receptor status. Patients with high-risk disease, determined on the basis of test concordance or by randomisation, were treated with chemotherapy and offered an optional further randomisation between one of two chemotherapy regimens. Similarly, those who had ER-positive disease, whether or not they received chemotherapy, were eligible for optional randomisation between two endocrine treatment regimens. A formal feasibility analysis was published after 800 patients (including a small number from the UK) had been registered, at which time 25% of registered patients had discordant test results.¹⁸⁷ The MINDACT design is very complex, which inevitably created difficulties in explaining the study to patients. There is no published information on the acceptability of the study to patients, although the use of a risk score provided by Adjuvant! Online is likely to have been reassuring. The study recruited approximately 6600 patients. The first analysis is expected to be published in 2016.

The TAILORx trial was conducted in the USA. Eligible patients had ER-positive, HER2-negative tumours no larger than 5 cm in diameter and without lymph node involvement. Registered patients underwent Oncotype DX testing.¹²¹^,¹²⁴ Those with a RS in the range 11–25 were eligible for randomisation between chemotherapy followed by endocrine therapy versus endocrine therapy alone. Non-randomised patients with low or high scores were followed up in registry arms. The study, which has a non-inferiority end point, aimed to enrol 11,248 patients and to randomise approximately 4500; no information is available on the number of eligible patients who accepted randomisation. Most of the patients randomised in the study would be considered at too low a risk to be offered chemotherapy in routine UK clinical practice. The primary analysis is planned to take place in December 2017.

The RxPONDER study is similar in design to TAILORx but includes patients with one to three involved lymph nodes; those with an Oncotype DX RS of ≤ 25 are eligible for randomisation.¹²³^,¹⁸⁸ In potential study participants, tumour testing may be performed either through trial registration or independently. An estimated 9400 patients will need to be screened to achieve the randomisation target of 4000. Cost-effectiveness research is an integral component of RxPONDER, which is largely funded by the USA private health insurers. The primary analysis of the study, which began recruitment in 2011, is expected in 2022. No information has been published on the acceptability of the study to patients.

The OPTIMA study has significant design differences to these three studies. Patients with up to nine involved axillary nodes are eligible for randomisation, which by virtue of tumour stage is a higher-risk group than the other studies have allowed. The study is partially blinded to minimise the likelihood of bias in clinician behaviour and to reduce the risk of non-acceptance of treatment allocation; thus, participants allocated to chemotherapy are not informed if they have been randomised to the control arm or have a high-score Oncotype DX test result. Although patients allocated to no chemotherapy are aware that they have a low-score Oncotype DX result, the actual RS is not disclosed to the site or patient. Patients who are premenopausal at diagnosis are routinely treated with ovarian suppression regardless of whether or not they receive chemotherapy, as otherwise a chemotherapy-induced menopause, which is difficult to identify reliably, would be a potential source of inequality between the treatment arms. All of these features have been identified in screening logs as reasons for potential participants choosing not to join the study. Despite this, overall patient acceptance identified from the screening log was 47%, which exceeds the protocol-defined target of 40%. Of those patients who went on to be randomised, 20% of patients had node-negative disease while 15% had four or more involved nodes. A total of 31% of participants were reported to be pre- or perimenopausal at the time of randomisation. Thus, the OPTIMA prelim has demonstrated that it is possible to conduct a study with design features that will maximise the chances that the study will deliver an unbiased result despite their difficulties for potential participants.

The reasons for the choice of an Oncotype DX RS cut-off point of ≤ 25 vs. > 25 for chemotherapy have been explained in Chapter 2, Methods. To put this into context, a RS of 25 equates to a 16% risk of developing metastatic disease over 10 years in a node-negative breast cancer population treated with tamoxifen for 5 years but not chemotherapy,⁴¹ often termed residual risk. Any estimation of chemotherapy benefit at this level of risk, however, must take into account that approximately one-third of this risk relates to chemotherapy-insensitive late (beyond 5 years) relapse.⁹

Clinical decisions on chemotherapy use are influenced by residual risk and the predicted likelihood of chemotherapy benefit, meaning the prevention of distant recurrence and death, which is related to residual risk. There is no universally accepted level of predicted benefit that informs practice. Perception of clinical risk and what constitutes a meaningful benefit from medical treatment is very much a matter of individual judgement for both patients and clinicians. In the UK, most oncologists will discuss breast cancer chemotherapy with patients who have a 3–5% predicted chance of benefit and recommend its use for those with > 5% chance of benefit. At this level, although there is a significant population-wide treatment benefit, the probability that an individual patient will benefit is modest.

Risk of recurrence is influenced by clinical stage as well as by tumour biology, and adding stage information to multiparameter assay output improves the prediction of residual risk (e.g. see Tang et al.⁶⁴^,⁷⁰). Should the hypothesis underlying the OPTIMA study prove correct, however, then patients with tumours with a low multiparameter risk score are unlikely to benefit to a meaningful degree from chemotherapy despite a potentially high risk resulting from adverse stage, particularly lymph node involvement. This is the conclusion from the retrospective analysis of the NSABP B-20 trial.⁶⁴^,⁷⁰ It is difficult to make a direct estimate of the likely level of chemotherapy benefit in the largely node-positive OPTIMA population at the RS threshold of 25. Estimates based on NSABP B-20⁶¹ and SWOG88-14⁶² studies performed in node-negative and node-positive populations, respectively, suggest a reduction in 10-year breast cancer mortality risk of in the order 5%. If a RS of 25 is indeed close to the score at which the Oncotype DX predicts significant chemotherapy sensitivity, then nodal status and other stage information are unlikely to have much influence on this estimate. In contrast, for patients at risk by virtue of adverse stage and whose tumours have higher scores predicting chemotherapy sensitivity in addition to risk, chemotherapy is likely to offer a very substantial benefit. Similar considerations should apply to other multiparameter assays.

Patient focus groups

Three patient focus groups were held to assess the acceptability and decision-making for women who may have been offered randomisation into the OPTIMA prelim. These groups also reviewed the patient information, leaflets and consent form. The study was acceptable to the majority of participants. Some had clear preference for chemotherapy as that is what they had been offered and some discussed how they felt not being offered chemotherapy. The trial design using a ‘test’ to decide treatment was acceptable and most felt that a personalised approach using these multiparameter tests would be a preferred option. The results of the focus groups provided some triangulation on the QRS for the TMG.

Qualitative recruitment study

Key findings from the qualitative recruitment study

The QRS was designed as an integral part of the OPTIMA prelim and identified a number of key challenges to recruitment. It devised and delivered interventions to address these difficulties in collaboration with the TMG and CfI. The most prominent identified themes that recurred across interviews and audio-recorded consultations related to difficulties in eligibility processes and issues of clinician–patient communication. These challenges had the potential to limit the number of patients approached about the trial, and/or ran the risk of affecting the trial’s acceptability to patients.

In terms of difficulties in eligibility processes, some clinicians showed discomfort with the OPTIMA prelim’s eligibility criteria, which deviated from embedded clinical practice. For instance, some clinicians felt patients with the extensive lymph node involvement or higher tumour grade required chemotherapy, leading to reluctance to enter these individuals into a trial where they may not receive this treatment. Equipoise issues in some cases therefore had implications for whether or not certain high-risk patients were offered the trial at all. Discomfort surrounding eligibility criteria also had potential to influence how clinicians communicated with patients. For example, audio-recordings revealed that the prospect of chemotherapy was sometimes (possibly inadvertently) presented as the preferred choice by clinicians. Recordings of consultations also revealed examples of patients holding treatment expectations based on advice they had received from other clinicians they had encountered in their care pathway. Clinicians’ discretion played a role in determining whether or not patients would be approached about the trial; these judgments were sometimes based on subjective criteria, such as clinician assessments of whether or not patients could cope with additional information. As patient advocates, clinicians clearly felt they were protecting patients’ interests. However, these reported practices implied that patients themselves did not always make decisions about trial participation.

A number of recruitment challenges uncovered by phase 1 of the QRS related to matters of communication, such as the quality and clarity of explanations provided to patients. Recruiters experienced difficulties in explaining the OPTIMA prelim trial design – particularly the unconventional way in which the ‘test-directed treatment’ arm split into two further arms. Audio-recordings of consultations showed occasional use of problematic terminology to describe arms, and challenges in explaining trial-specific processes such as ‘randomisation’ and ‘blinding’. Concepts such as ‘uncertainty’ and ‘risk’ were also communicated in a variety of ways, to different levels of detail. Interviews uncovered recruiter discomfort in exploring patients’ decisions to decline the trial for fear of jeopardising relationships with patients, or being accused of coercive behaviour. While these scenarios should be handled sensitively, recruiters found it easier to avoid exploring decisions about trial participation altogether. However, analysis of audio-recordings provided some evidence to suggest that this avoidance could lead to missed opportunities to address patients’ misconceptions and concerns.

Identification of the above challenges led to a series of QRS interventions, designed and delivered in collaboration with the TMG and CfI. Key interventions included promotion of clinician-to-clinician discussions to address concerns about eligibility criteria; circulation of generic ‘tips and guidance’ sheets to help recruiters explain trial-specific processes within the context of the OPTIMA prelim; organisation of group feedback meetings to share QRS findings and address concerns collaboratively; and delivery of individual feedback, to provide tailored advice and consider solutions to challenges experienced by individual recruiters.

Recruitment into OPTIMA prelim continued to improve as the study progressed. However, it is difficult to assess the impact of the QRS, as claims of causality are problematic considering the multiple variables that could have influenced recruitment at any given time. Nonetheless, we found some qualitative evidence to suggest QRS interventions had an impact on some clinicians’ practices, although these findings were based on a small sample of audio-recorded consultations.

Strengths and limitations

There were a number of limitations to the QRS methods. In terms of sampling, the full range of the OPTIMA prelim centres were not available owing to some centres’ decisions to decline participation in the optional QRS. The QRS consisted of mainly mid-range recruiting centres, not those with the highest and lowest recruitment rates. Inclusion of centres with more extreme recruitment figures would have been beneficial for comparison purposes, and may have contributed to further advice and recommendations to optimise information provision and recruitment. Furthermore, questions should be raised about possible differences between centres that opted for and against QRS participation (e.g. differences in commitment to the trial, differences in resource availability).

Despite encouragement from the QRS team and TMG, recording of consultations did not occur as a matter of routine, and the full range of interactions with patients was not captured. This is particularly true for second oncology consultations (where patients sometimes gave their decisions about participation). Incomplete recordings of each patient’s pathway made it difficult to track events and made deciphering patients’ reasons for declining the trial problematic. In addition, limited numbers of audio-recordings for each individual recruiter restricted opportunities to assess the impact of QRS interventions through ‘before/after’ comparisons. Reluctance to routinely record consultations may have been an indirect consequence of recruiters perceiving the QRS as an optional additional task. The main trial, if funded, will need to frame the QRS as a fully integrated component of the trial.

Although not a limitation of the qualitative methodology, questions remain with regards to the impact of QRS interventions on recruitment rates and recruiters’ practices. We saw some evidence of recruiters changing their practices and can be certain that recruitment improved over time. However, we cannot make claims of causality between QRS interventions and recruitment rates. Further research by the QRS team will consider this in more detail by exploring innovative ways in which this type of work can be evaluated in the context of RCTs.

The above limitations are being addressed in future applications of integrated qualitative work by the QRS team, and will be in the main trial, if funded. However, the practical (rather than theoretical) applications of the QRS to the OPTIMA study need to be considered foremost; it is this applied nature of the QRS that distinguishes it from traditional qualitative studies.

A clear strength of the QRS was its mixed-method, flexible approach to investigation. This enabled the team to understand recruitment difficulties during enrolment in the trial, leading to identification of challenges that were specific to the OPTIMA prelim (e.g. eligibility concerns), as well as difficulties that have been previously reported in most trials (e.g. difficulties explaining trial-specific processes). The opportunity to feedback findings quickly to change participants’ practices was a key strength of this research, highlighting its direct practical applications. This process of sharing findings with participants also allowed for further exploration of ‘challenges’, thus reinforcing the iterative nature of the work undertaken and providing an informal form of participant validation.

The QRS work undertaken was a collaborative process that was designed to build on the support and expertise of the TMG to help design and deliver interventions. This element of the QRS is an important strength, as interventions were refined through input from a truly multidisciplinary team.

Lessons learnt from the qualitative research study: implications for a main trial

In the light of the above challenges and interventions in the OPTIMA prelim, there are a number of key lessons that could be carried forward to the main trial. As new centres are likely to open, issues of equipoise and discomfort surrounding eligibility criteria are likely to present again. The TMG could pre-empt these difficulties by providing time for clinicians to discuss eligibility criteria at site initiation meetings. One option may be to present clinical vignettes to prompt discussion about eligibility concerns – a technique adopted in some of the group feedback meetings in the OPTIMA prelim.

A clear lesson from the QRS is the need for continued effort to use qualitative methods to optimise the quality of information provision and address recruitment challenges in the main study. Experiences of the QRS within OPTIMA prelim will help inform ways in which the QRS can be better integrated to ensure it is used to its maximum potential. First, a more purposeful approach to sampling centres will be used, with a view to including a maximum range of centres that vary by size, geographical location and recruitment rates (based on screening log analysis). As suggested above, this process will work best if the QRS is presented as an integrated component of the main study. Recruiters will be encouraged to record routinely all interactions with patients who have given consent; this may call for the QRS team to offer further support with the recording process (e.g. by making site visits where needed) and for recruiters to prioritise face-to-face (rather than telephone) consultations whenever possible. Even if telephone consultations are a necessity, additional equipment can be provided to ensure these discussions are also captured.

Finally, a key consideration of the QRS within a main trial will be to compare recruiters’ practices before and after feedback. This will help in the development of strategies to evaluate the impact of QRS interventions. Promotion of routine audio-recording of consultations will be a cornerstone of this work.

Pathology and health economics

Characteristics of the OPTIMA prelim population

An important consideration is whether or not the population recruited represents those to whom the research question is relevant. Although the prevalence of high-grade tumours in the OPTIMA-eligible population is unknown, there seems to be an excess of histological grade 2 lesions compared with grade 3 tumours in the OPTIMA prelim; 6%, 67% and 27% of lesions were reported to be grade 1, 2 and 3, respectively. For reference, in an historic and completely unselected series of patients presenting with symptomatic early breast cancer the ratio of grade 1, 2 and 3 lesions was reported to be 2 : 3 : 5.¹⁸⁹ In mixed screening and symptomatic series of ER-positive tumours, lower proportions of high-grade tumours (e.g. 3 : 4 : 3¹⁹⁰) have been described. It is likely that the proportion of high-grade tumours in the OPTIMA-eligible population lies between these two proportions. Certainly the 18% of tumours with an Oncotype DX score of ≥ 25 was lower than the predicted 30% in our population. There appears, however, to be little difference from that expected in tumour type distribution, with 70% of cancers of no special type/ductal in large series, compared¹⁹⁰ with 71% in the OPTIMA prelim. One possible explanation for the apparently lower than expected recruitment of patients with more aggressive tumours is an element of patient selection by recruiting clinicians. It is important to ensure that the study is not skewed towards patients with very low-biological-risk tumours if it is to be relevant to the entire population. Recruiting clinicians therefore need to be reassured that multiparameter assays reliably classify tumour risk irrespective of histological grade; going forwards to the main trial, this will require education.

Central review of receptor status

We undertook routine central retesting of receptor status in the OPTIMA prelim because of concern about a potential for bias if patients with either ER-negative or HER2-positive disease based on local histopathology testing were included in the study. There are many possible causes for this scenario, including a well-recognised low incidence of HER2 gene amplification among tumours that are considered HER2-negative by immunohistochemistry in first step of the standard two-stage HER2 testing procedure recommended in UK guidelines.¹⁹¹ Any participants with ER-negative or HER2-positive disease would be disadvantaged if allocated to endocrine therapy only, and such imbalance in treatment allocation would be a potential source of confounding. To minimise this risk, central review to confirm the referring pathology laboratory results for ER positivity and HER2 negativity was undertaken with immunohistochemistry for ER, and with FISH for HER2 status. The former was scored using the Allred system (% staining and average intensity; range 0–8) with a cut-off point of ≥ 3 defined as positive. HER2 positivity with FISH was defined as a ratio of HER2 to chromosome 17 centromeric probe copy numbers of > 2.0 (2.0–2.2 defined as borderline but amplified). This confirmed the eligibility of all but 12 of 325 patients (96.3%) tested on central review of receptor status.

Reassuringly, in the OPTIMA prelim, only 1.2% (n = 4) of patients who locally were thought to have ER-positive disease were found centrally to have ER-negative tumours. A total of 2.2% (n = 7) had tumours with at least borderline amplification HER2 (three borderline but amplified, four amplified). Of these, one woman had a tumour that was both ER negative and HER2 positive centrally. Two additional tumours were heterogeneous, with discrete subpopulations of ER-positive and ER-negative cells and were, for this reason, deemed ineligible (< 1%).

This level of discrepancy is lower than reported in the majority of clinical trials of early breast cancer, although many of these have documented central review of HER2-positive (rather than HER2-negative) disease. For example, among cases regarded locally as ER-positive, 4% and 16%, respectively, were centrally redefined as ER-negative in two reviewing centres in Italy and the USA in the BIG 2-06/NCCTG N063D trial.¹⁹² Similarly, in a centrally reviewed consecutive series of node-negative breast cancer from the Netherlands, 4% and 5% disagreement was seen for ER and HER2, respectively, between local and central tests (n = 694).¹⁹³ The low level of disagreement in receptor status between local and central laboratories that we report in this study reflects improvements in standardisation of receptor testing and the high levels of quality assurance in the UK including, for example, mandatory participation in National External Quality Assessment Service for Immunohistochemistry.¹⁹⁴

The overall 3.7% (95% CI 1.7% to 5.8%) incidence of discrepancy between local and central laboratories is reassuringly low. Although none of these cases underwent Oncotype DX testing, it is likely that such cases would generate high-risk RS and consequently be allocated chemotherapy. Of the seven (2.2%) patients who were found to have tumours that were HER2 amplified, three were classified as borderline while four were clearly amplified. Such patients would ordinarily be treated with adjuvant trastuzumab but received this treatment only by virtue of agreeing to join the OPTIMA prelim. The evidence of benefit for adjuvant trastuzumab treatment for cases with equivocal HER2 amplification¹⁹⁵ is controversial.

Therefore, based on the experience in 325 tumours centrally tested for ER and HER2, given the discordant rates of ≤ 2.2% for each receptor separately, our experience with central review of ER and HER2 status suggests that this is unnecessary in a large Phase 3 trial.

Multiparameter assays

Oncotype DX was chosen as the primary discriminator for decisions on treatment allocation for OPTIMA prelim as this was judged most likely among the available tests to be acceptable to patients and clinicians alike. The evidence supporting the use of Oncotype DX was and remains significantly stronger than for potential alternatives. There is widespread familiarity with the test as the result of its successful marketing in the UK private health-care sector. This has been reinforced by the NICE DG10 guidance,¹¹⁷ which provides it with a stamp of authority that its competitors lack. Nevertheless there remain significant uncertainties in relation to the use of Oncotype DX and other multiparameter assays that justify the OPTIMA study.

One particular issue in relation to Oncotype DX is the choice of a cut-off point above which patients should be advised to have chemotherapy treatment. The test output is calibrated as risk of distant recurrence at 10 years in patients with ER-positive disease treated with tamoxifen alone. The Oxford Overview has clearly shown that there is no evidence that the recurrence risk in breast cancer patients who remain disease free 5 years after diagnosis is any different between those who did or did not receive adjuvant chemotherapy. This is despite the observation that one-third of the recurrences during the 10 years after diagnosis occur in the second half of this period.⁹ Therefore, estimation of likelihood of chemotherapy benefit in relation to test output is inherently difficult. In the absence of robust data on the effectiveness of chemotherapy in relation to Oncotype DX RS, the choice of a cut-off point is always going to be somewhat arbitrary. The rationale for selection of a cut-off point of 25 is described above (see Chapter 2, Methods).

As there is no established method for the comparison between multiparameter assay platforms being attempted in the OPTIMA prelim, the analysis has required the development of a methodology. There is much uncertainty inherent in this, which is discussed in Chapter 6, Pathology study results. A particular issue is that the assays have mostly developed to indicate residual risk of relapse after 10 years of endocrine therapy and their predefined risk groupings reflect this. The limitations of this definition in relation to decisions on chemotherapy use are described in the previous paragraph. As there is no standardised definition of risk of relapse, the equivalent cut-off points for the available assays differ. This means that the proportions of cases assigned by the assays to low-risk (or not high risk where there are multiple divisions) and high-risk groups differ, which for the purposes of OPTIMA means that proportions of patients advised chemotherapy differ. Consequently, the assumptions for treatment benefit made in the health economics analysis are uncertain. As a result of this, any conclusions that can be drawn from this analysis are, at best, an estimate and require formal testing in a prospective study.

Pathology study results

This study within the OPTIMA prelim represents, to our knowledge, one of the few and certainly the most comprehensive ‘head-to-head’ comparisons of multiple tests designed to utilise tumour biology to predict patient residual risk. Rarely has there been the opportunity to compare results of multiple tests on a patient-by-patient basis to explore differences in risk categorisation. However, it should be remembered in the interpretation of these data that this study was designed with limited and focused goals. We are aware of the intense interest around this study and wish to ensure that results are interpreted in the appropriate context.

The second objective of this study, after demonstrating feasibility, was to aid the selection of candidate biological risk stratification tools for inclusion in the main OPTIMA trial. The OPTIMA trial aims to test whether or not, for ER-positive HER2-negative patients deemed to be clinically high risk, an additional biological predictor of risk can inform treatment. Specifically, for the clinically high-risk population is it appropriate to assign biologically low-risk patients to endocrine therapy without chemotherapy while retaining chemotherapy for clinically and biologically high-risk patients?

A key challenge for the OPTIMA triallists remains: which of the several validated diagnostic assays that assess biological risk should be used to address the trial question?

In addressing this second goal, we recognise a number of important factors, which both inform and limit the interpretation of results from the OPTIMA prelim pathology analysis.

OPTIMA prelim was neither designed nor powered to evaluate or compare the prognostic value of the tests included in the study. The tests that have been evaluated in the study have been extensively validated prior to inclusion in this study for their ability to inform risk, mostly following endocrine treatment. In addition, a number of elegant studies from the transATAC study group have shown that each test provides broadly similar risk discrimination in luminal breast cancers.⁴³^,⁵⁵^,⁵⁶ However, all tests, while of significant value in informing patient choice, are only modestly predictive of the risk of relapse at the individual patient level. Within the validation studies for each test there are patients assigned to ‘low-risk’ groups who relapse and die of their disease, and in all cases the majority of patients who are ‘high-risk’ do not in fact relapse and die from their disease. These tests therefore represent a significant step towards personalised medicine but are not yet perfect. This is important since it implies that discordance in risk estimates between tests that measure risk in different ways may not be unexpected.
The OPTIMA prelim study sought to evaluate diagnostic tests in a high-risk patient group as defined by tumour stage; for some tests this means that results from the OPTIMA prelim are using risk scores designed for node-negative disease in a predominantly node-positive population. This is in line with current thinking that ‘biological’ risk measures are informative across, and to a degree independent of, stage of disease. This thinking is supported by the fact that only grade, in the current study, is significantly associated with biological risk defined by multiparameter assay. This association may be explained by the fact that both grade and most biological risk tools assess tumour proliferation.

The key end point required for an appropriate choice between tests for the OPTIMA trial, which was also a key objective of the current OPTIMA prelim analysis, was to identify the proportion of cases assigned as high risk (or low risk) by each test, accepting that for each test this was a ‘true result’. As regards biological risk, since we cannot assess outcome, each test must function as its own gold standard.

Each test for which data are available provides an individual patient risk assessment. If all, or indeed any, of the candidate tests were 100% accurate in predicting risk (recurrence) it would be appropriate to use concordance estimates or kappa statistics to compare tests. However, in the absence of such a ‘gold standard’, kappa comparisons between tests simply reflect agreement between multiple modestly accurate tests. The low concordance between tests reflect the fact that the tests are measuring different genes, using different technology and highlight the problems of predicting recurrence risk based on the biology and management of the tumour.

We have presented comparisons between two different types of test results: (1) molecular subtype comparisons that group tests into luminal A, luminal B, etc. and (2) risk assessments where tests have included a risk prediction (either categorical or continuous). For tests generating continuous risk scores we have grouped patients into ‘low risk’ (patients who might safely avoid chemotherapy in the OPTIMA study) or ‘high risk’ (patients for whom their biological risk would suggest that chemotherapy should be given). For example, Oncotype DX provides both a continuous risk score and assigns patients to ‘low’, ‘intermediate’ or ‘high’ relapse risk. For the purposes of the OPTIMA trial we have combined the OPTIMA prelim cases classified by Oncotype DX as ‘low‘ (54% of cases) and ‘intermediate’ risk (28% of cases) into a single category of ‘low’ risk of relapse/low probability of chemotherapy benefit (called low score for convenience). For other tests, patients were categorised on the basis of pre-established cut-off points or, in the cases of tests providing subtype information (e.g. BluePrint), on the basis of subtype (luminal A vs. all others).

Comparison of risk categorisation and subtype prediction between tests

Of the five tests that categorise risk, Oncotype DX, Prosigna and IHC4 predicted similar rates of ‘low-risk’ cases, while MammaPrint identified fewer biologically low-risk cases in the OPTIMA prelim population (see Table 26). It should, however, be recognised that with the small sample size available (302 patients) we are not able to state clearly that, for example, Oncotype DX identifies more ‘low-risk’ patients than ‘Prosigna’, or vice versa. However, at a simple level the proportion of patients who, on the basis of these five tests, might safely avoid chemotherapy, is comparable and can be used to inform cost-effectiveness analyses (see Discussion of the health economic methodology and results). Differences in proportions of cases identified between tests will impact on the cost-effectiveness of different tests in the context of the main OPTIMA trial.

Three tests (BluePrint, Prosigna and MammaTyper) provide information on molecularly defined intrinsic subtypes. BluePrint and Prosigna provide this information alongside molecular profiling for residual risk, while MammaTyper is specifically focused on providing subtype information including a stratification of ‘luminal B’ cases into low and high risk.

As with risk scores, we show a broad similarity in the proportion of OPTIMA prelim cases, all of which are centrally confirmed as ER-positive and HER-negative, assigned by each test to luminal A or to luminal B/HER2 enriched/basal-like. BluePrint assigned 61%, MammaTyper assigned 62% and Prosigna assigned 60% of cases to the luminal A category and 39%, 38% and 40%, respectively, to non-luminal A subtypes. Although interesting, given that the study was recruiting only patients with luminal breast cancer, the identification of small numbers of ‘HER2-enriched’ or ‘basal-like’ ER-positive/HER2-negative cancers, particularly with complex multiparameter tests like Prosigna, is unsurprising and has been frequently reported in other studies (e.g. Parker et al.²⁸).

What is also clear from these data is that different tests commonly disagree about individual tumours. Only 93 (31%) of patients were classified as low/intermediate risk by all five tests (IHC4, IHC4 AQUA, Oncotype DX, MammaPrint and Prosigna) and only 26 (8%) patients were classified as high risk by all five tests. Over 60% of patients were classified as high risk by at least one test and low risk by at least one test. A similar finding has been reported in a comparison between PAM50 and Oncotype DX.⁵⁵ Similar, but perhaps more surprising, disagreement between the three tests that predict subtype was observed; only 59% of cases were consistently classed as either luminal A or ‘non-luminal A’ by all three subtyping tests. Kappa values give a measure of this disagreement where test thresholds are similar; for example, in the comparisons between Oncotype DX and IHC4 risk categorisation, and in the subtype predictions made by BluePrint and Prosigna, the kappa values were 0.53 and 0.55, respectively, indicating modest agreement.

The discrepancies between the tests are likely to reflect the differences in both the specific genes and the number of genes assessed by the individual tests. Test training may also contribute to discrepancies, although most of the tests were developed in ER-positive patient populations treated with endocrine therapy but not chemotherapy. This aspect of the OPTIMA prelim study will be of interest as we seek to understand the factors that underlie the different risk and subtype classifications.

It is not possible for us to draw any conclusions about the relative clinical performance of the tests in the absence of outcome data, and, in any case, the sample size is too small to allow an adequately powered comparison to be made. Our conclusion from this analysis is that these tests, each of which is independently validated as a tool to assess residual risk of relapse (although as yet there is no formal publication of validation studies for IHC4 AQUA and MammaTyper), are all modestly accurate meaning that differences between risk categorisation highlight the potential for improvements in assessment of residual risk without, at this stage, providing a means of discriminating between tests currently available.

Summary

In one of the few detailed comparisons of risk signatures at a patient level we have demonstrated that marked differences between diagnostic tests exist at an individual patient level. This is as predicted from the relatively modest performance of all current residual risk tests. Other studies performed in the field, although with fewer comparisons, suggest that all existing tests are broadly similar in the overall ‘quantity’ of risk information they provide to patients.⁴³^,⁵⁵^,⁵⁶ The analysis from the OPTIMA prelim study, although we lack clinical outcome, is consistent with this conclusion. Therefore, although interesting and informative, this test-to-test comparison is unable to direct choice of tests for the OPTIMA main trial beyond triaging towards those tests most likely to be cost-effective in the economic analyses described in Discussion of the health economic methodology and results. Some tests, clearly, may be less ideal as candidates for the main OPTIMA trial for technical reasons, but others cannot be easily discarded at this stage, and indeed may be selected solely on the basis of potential cost-effectiveness between these biologically similar test approaches.

What this test-to-test comparison does reveal, illustrated by those cases where discrepancies between tests exist, is that collectively there is significant room for improvement of the predictive impact of tests of residual risks, which at present remains unrealised.

Discussion of the health economic methodology and results

The economic analysis excluded IHC4 because of lack of cost information and because of concern about its analytical validity. Although MammaTyper has demonstrated analytical validity, there are no published clinical validation data and no cost information available. There are no published data on either the analytical or the clinical validity of IHC4 AQUA.

The results of the economic analysis require careful interpretation. The differences in the expected value between the alternative tests are relatively small, and the uncertainty in the evidence for all tests is substantial. Even the test with the best current clinical evidence base, Oncotype DX, is less (rather than more) likely to be a good use of limited NHS resources when considered in comparison with alternative tests, according to NICE’s standard value criteria.¹⁴⁸^,¹⁹⁶

MammaPrint is expected to be dominated by, that is produce fewer health benefits and cost more than, other options in all of the scenarios, and the probability of MammaPrint being cost-effective never exceeds single figures. On this basis, it seems reasonable to exclude MammaPrint from consideration for current adoption, as such a low proportion of simulations indicate that choosing MammaPrint would be a suboptional decision. One can speculate why this is the case, but it may be that because this is a dichotomous test, developed across the whole spectrum of breast cancer subtypes, it may not have been optimised for the patient cohort of those with a luminal breast cancer, which is the target population for the OPTIMA study. Despite its apparent low expected cost-effectiveness, MammaPrint is associated with considerable uncertainty and this manifests as research value in the value-of-information analysis.

The other strong conclusion is that test-directed chemotherapy appears to be cost-effective, regardless of the choice of tests. This supports the further development of an evidence base sufficient for robust adoptions decision-making.

An additional question is whether or not there is value in further research on some of these tests. The estimates of the value of further research for Oncotype, MammaPrint and Prosigna tests are sufficient to justify UK NHS investment in studies capable of reducing the test-specific uncertainty. The range of analyses presented here show that there is substantial uncertainty about the performance of each of the tests and there is also substantial uncertainty about the appropriate characterisation of the long-term outcome, as demonstrated by the variation in the results of the analyses across the different scenarios. Addressing the latter source of uncertainty within the trial may require a trial design with long-term detailed follow-up. It may be efficient to reduce the uncertainty on these factors through analysis of existing cancer registry data, in addition to within a clinical trial, as information on cancer recurrence and post-recurrence treatments starts to be captured in national data sets.

It is important to note that differing conclusions are drawn regarding the optimal use of Prosigna (ROR_PT or Subtype) depending on which sensitivity analysis is correct. This kind of model-dependent structural uncertainty is difficult to characterise in the value-of-information analysis. Given the lack of consensus between models it is difficult to recommend an optimal use of Prosigna based on purely economic grounds. Ideally, different ROR_PT cut-off points should be explored in further prospective research. If a single cut-off point needs to be chosen, as appears to be the case in light of the manufacturer’s marketing strategy, then it may be reasonable to base its choice using clinical or mechanistic justification.

The final notable observation from these results is that the value of undertaking further research around Oncotype DX, when the comparator is ‘chemotherapy for all’, is considerably lower than for the comparator technologies in all scenarios. This difference is likely to be driven by the higher uncertainties around the longer-term clinical outcomes, which is amplified in the model by discordance with Oncotype DX. The price differential will also be contributing, with Oncotype DX being more expensive per test than Prosigna. Again, further sensitivity analysis is required to fully understand what is driving the value of information and a more granular EVPPI analysis will enhance understanding.

The economic analysis conducted at the end of the OPTIMA prelim has presented a number of challenges. The use of value-of-information analysis to inform a design decision between a feasibility study and a fully powered RCT is a new application for this method, much advocated by methodologists in the UK and worldwide.¹⁴⁶ The need for influential assumptions within the model prior to the availability of evidence from a definitive trial necessitates extensive scenario analysis to test these assumptions. This has resulted in the need for three different model specifications and multiple scenarios, making the results complex to understand. The recommendation of Prosigna as most valuable for further research has withstood testing of the main assumptions, suggesting that it is a robust recommendation.

A particular additional challenge in the use of value-of-information analysis within a short analysis period is the computational expense. Each EVPPI calculation in this analysis, which uses 5000 Monte Carlo simulations in both the inner and outer loops, required in excess of 7 months of processing time using a fast processor. The OPTIMA prelim analysis was therefore only possible using a high-performance computing facility (White Rose Grid: www.whiterose.ac.uk/projectswhite-rose-grid/) to implement analyses in parallel.

A desirable further sensitivity analysis is exploration of the influence of Adjuvant! Online as a forecasting tool in the main economic analysis. It is possible that the reliance on this tool has biased the results in favour of tests for which the risk score correlates most highly with it. This is exemplified in Figures 18 and 19, which show that Prosigna ROR_PT is more highly correlated with Adjuvant! Online predictions than is Oncotype DX. The lack of CIs around Adjuvant! Online estimates also means that uncertainty will be underestimated in the model. The validation model is protected from this bias and further work exploring alternative tests using this model would be warranted. It is disappointing that validated alternatives to Adjuvant! Online reporting RFS are not available.

Overall conclusions

OPTIMA prelim has succeeded in all of its aims. The study was designed to establish whether or not a large-scale RCT of test-directed chemotherapy in patients with early breast cancer who are at significant future recurrence risk by virtue of axillary lymph-node involvement is feasible in the UK. The underlying hypothesis is that multiparameter assays are predictive of chemotherapy efficacy, which is a recommendation for future research in the NICE DG10 guidelines.¹¹⁷

Prespecified targets were built into the protocol and the study was opened in a broadly representative set of UK centres. All targets were met, indicating feasibility of a study designed to demonstrate non-inferiority of chemotherapy allocation based on the outcome of a multiparameter assay performed on individual patients’ tumour samples.

Recruitment posed significant challenges. PPI in the study design and in the early phases of recruitment, and an integrated QRS provided valuable insights into barriers to recruitment, and resulted in changes to the PIS and to advice given to participating centres. These insights will be important for the main study.

Oncotype DX was used as the primary discriminator for chemotherapy decisions in the OPTIMA prelim. Additional multiparameter assays were performed on participant tumour samples and health-economic modelling was performed using the data. A substantial degree of discordance was demonstrated in the risk assignments/biological subtype of individual tumours although the overall amount of information provided by each assay in the overall population appears to be similar. Effectiveness studies are required to demonstrate whether or not the differences between assays are clinically significant. The health economics analysis suggests that, of the assays that were evaluated, there is substantial value in further research involving all tests, although Prosigna can currently be considered the highest priority from the perspective of the UK NHS. The level of uncertainty in health economics analyses undertaken, however, is substantial. A prospective study is therefore required to validate both the clinical and economic hypotheses that underlay the OPTIMA prelim.

Recommendations for further research

OPTIMA prelim has succeeded in its aims of demonstrating the feasibility, within the UK, of a large-scale clinical trial to establish a method of selecting patients with moderate- or high-risk hormone-sensitive primary breast cancer who are likely to benefit or not benefit from post-operative chemotherapy and to establish the cost-effectiveness of alternative test-guided treatment strategies. Existing data that support the use of multiparameter assays as an aid to making chemotherapy decisions are based on retrospective analysis of historical data. Very few data are available on the clinical validity of multiparameter assays in patients with lymph node involvement, which is the population in whom the cost-effectiveness of test-guided treatment decisions is likely to be greatest. The NICE DG10 guidance,¹¹⁷ which applies only to women with lymph node-negative breast cancer, recommends that further research on the ability of multiparameter assays to predict chemotherapy benefit be performed. A prospective clinical trial designed to answer these questions would be of benefit to the NHS and should be undertaken. It should be noted that, from the perspective of the NHS, the value of information for further research into Prosigna is higher than for the other two assays that were evaluated in the base-case analysis. The discrepancies demonstrated between the tests that were evaluated in this study at an individual patient/sample level show that there is substantial scope for refinement of exiting multiparameter assays.

Implications for practice

OPTIMA prelim was not designed and is not expected to generate any information that will result in a change in clinical practice. The finding of significant disagreement between multiparameter assays at an individual patient level should give clinicians currently using these technologies in their clinical practice pause for thought. As there is no established ‘best test’, not only are there limitations to the evidence base supporting multiparameter assay use but there are also limitations to the technology that is currently available. Nevertheless, the data generated by this study do not in any way undermine existing data supporting multiparameter assays use to guide chemotherapy decisions.

Copyright © Queen’s Printer and Controller of HMSO 2016. This work was produced by Stein et al. under the terms of a commissioning contract issued by the Secretary of State for Health. This issue may be freely reproduced for the purposes of private research and study and extracts (or indeed, the full report) may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NIHR Journals Library, National Institute for Health Research, Evaluation, Trials and Studies Coordinating Centre, Alpha House, University of Southampton Science Park, Southampton SO16 7NS, UK.

Included under terms of UK Non-commercial Government License.

Bookshelf ID: NBK343764

Contents