U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Observational evidence on the effectiveness of endovascular aneurysm repair compared with open surgical repair of unruptured abdominal aortic aneurysms: Abdominal aortic aneurysm: diagnosis and management: Evidence review K2. London: National Institute for Health and Care Excellence (NICE); 2020 Mar. (NICE Guideline, No. 156.)

Cover of Observational evidence on the effectiveness of endovascular aneurysm repair compared with open surgical repair of unruptured abdominal aortic aneurysms

Observational evidence on the effectiveness of endovascular aneurysm repair compared with open surgical repair of unruptured abdominal aortic aneurysms: Abdominal aortic aneurysm: diagnosis and management: Evidence review K2.

Show details

2Methods

2.1. General

This evidence review was developed using the methods and process described in Developing NICE guidelines: the manual (2014) and the general methods chapter for this guideline.

2.2. Identifying the evidence

A focused search strategy was used across multiple databases to identify relevant literature; see Appendix A for details. We reviewed the references of included articles and related reviews identified in the search to find any publications that the searches had missed.

2.3. Eligibility criteria

2.3.1. Study design

Recommended adjustment methods

As observational data are always subject to selection biases, we only included studies that made an attempt to account for differences in casemix between EVAR and OSR cohorts. We considered comparative studies that used any of the methods of adjustment enumerated in NICE DSU Technical Support Document 17 (Faria et al., 2015), including:

  • Regression adjustment
  • Inverse probability weighting
  • Regression on the propensity score
  • Matching (nearest neighbour or propensity score)
  • Instrumental variable methods
  • Difference-in-differences designs
  • Regression discontinuity design

In practice, only 3 of these designs were represented in the assembled evidence: regression on propensity score, matching by propensity score and inverse probability weighting. Each of these methods relies on the calculation of a propensity score, estimating the odds of an individual receiving 1 of the 2 treatments, given their characteristics. In regression analyses, the propensity score is used a single measure of each participant’s characteristics, in an attempt to isolate the independent effect of the treatment they received when isolated from the effects of the things that led to them being chosen for that approach. In matching by propensity score, each participant who received 1 of the treatments is matched with 1 or more similar participants who received the other, in an attempt to create the kinds of balanced cohorts that would be expected if treatment assignment had been randomised. In inverse probability weighting, the propensity score is used to assign weights to each of the participants in a study, again with the aim of creating 2 cohorts that have the same average characteristics.

In addition to the recommended methods of adjusting for selection bias, simple multivariable regression analyses are commonly reported in this area. Mostly, these take the form of logistic regression models (for example, for perioperative mortality) or Cox proportional hazards models (for example, for long-term survival). Such analyses attempt to isolate the independent effect of treatment when controlling for other covariates of outcome. These approaches are generally considered insufficient to identify treatment effects in the presence of selection bias, because differences in covariate distributions between treatment and control groups may compromise the linearity on which logistic regression and Cox proportional hazards models rely (Little and Rubin, 2000; Newgard et al., 2004). However, we identified studies that use these techniques and, in stratified analyses, explored whether similar effects are estimated as with the recommended approaches.

The protocol for this review stated that studies with recommended adjustment methods only should form our primary analysis, with the inclusion of naive regressions only as a secondary analysis. However, in practice, we found that there were no systematic discrepancies between the 2 types of data so, to avoid unhelpful duplication, we present both types in single analyses throughout (although analyses are always stratified, so that results for different designs can be isolated and compared, if readers prefer).

2.3.2. Other criteria

We limited the evidence-base to studies that commenced recruitment in or after 1999. There were 2 reasons for this: firstly, the RCTs began recruiting in 1999 and, as the point of the review is to examine whether things have changed since the RCTs, it does not make sense to include evidence that predates them; secondly, 1999 is when the first EVAR devices received US FDA approval, so we had some concern that studies predating this timepoint may feature non-approved endografts (including possibly some physician-developed devices) that would not provide valid evidence as to the benefits and harms of more established practice.

Aside from study design and recruitment date, the eligibility criteria for this review were essentially unchanged from evidence review K. However, we refined our definition of long-term survival based on findings from randomised evidence, and also in the light of the original economic modelling that had been undertaken to support that review. In particular, the randomised evidence shows that, while EVAR is associated with reduced perioperative mortality, compared with OSR, long-term survival estimates for people who survive repair are either neutral or in favour of OSR. For this reason, survival data should be analysed using an approach that can account for variable hazards of at least 2 phases. Accordingly, we did not include studies that reported a single effect measure purporting to summarise overall survival, including both the perioperative and post-perioperative periods (for example, a hazard ratio from a single Cox proportional hazards regression). In a few instances, studies reported the outputs of a Cox proportional hazards model for long-term survival conditional on surviving the perioperative period, and we included those estimates.

We also omitted ‘successful exclusion of the aneurysm, aneurysm rupture, or further aneurysm growth’ as a specific outcome of interest, as we considered these factors are well captured by reintervention data.

Table 1. Eligibility criteria (‘PICO’ table).

Table 1

Eligibility criteria (‘PICO’ table).

Studies were excluded if they:

  • were not in English
  • were not full reports of the study (for example, published only as an abstract)
  • were not peer-reviewed.

2.4. Critical appraisal

When compared with RCTs, all observational designs are at heightened risk of estimating biased treatment effects; therefore, it is particularly important to assess each study’s susceptibility to different biases and critically appraise the steps taken by investigators to minimise their effects.

In choosing an instrument with which to do this, we reviewed the generic provisions of the Cochrane Collaboration’s ‘Risk Of Bias In Non-randomized Studies of Interventions’ (ROBINS-I) tool (Sterne et al., 2016). Additionally, because it was necessary to undertake detailed appraisal of the statistical methods used to account for casemix in the included studies, we also considered the more technically focused criteria in the ‘Quality of Effectiveness Estimates from Non-Randomised Studies’ (QuEENS) checklist (Faria et al., 2015).

Neither instrument covered all relevant issues, and both contained multiple criteria that would never be helpful in discriminating the reliability of studies included in this particular review. For example, the ROBINS-I tool asks ‘Is there potential for confounding of the effect of intervention in this study?’ to which the answer will always be ‘yes’, and the QuEENS checklist asks ‘Are potential instrumental variables excluded from the set of conditioning variables?’ to which we would always have answered ‘no’, given that it is not clear that any such variables exist, in this case.

Therefore, we developed a bespoke instrument that incorporated important elements of the 2 published checklists, and also added consideration of areas that are not explicitly covered by either. Because the instrument was not designed to be reusable in other contexts, we also took the opportunity to specify criteria that are explicitly focused on our review question (for example, we ask ‘Does the study control appropriately for AAA characteristics?’, rather than ROBINS-I’s ‘Did the authors use an appropriate analysis method that controlled for all the important confounding domains?’ and so on). The instrument is shown in Appendix D; the table also highlights the domain(s) of the ROBINS-I and QuEENS checklists that each criterion reflects.

2.5. Outcomes

We adopted a single outcome for perioperative mortality, comprising data reported as 30-day mortality or in-hospital deaths. In the few cases where studies reported both outcomes, we extracted whichever was the higher.

Long-term survival corresponds to the definition of post-perioperative survival used in the original economic model – that is, survival conditional on surviving 30 postoperative days. A small number of studies report this outcome from their own analyses; however, in most cases, it was necessary for us to calculate the relevant effect ourselves. This was possible for any study that published Kaplan–Meier time-to-event curves for casemix-adjusted cohorts. We used digitising software (Engauge Digitizer v10.10) to extract data from the graphs, and then used the method of Guyot et al. (2012) to reconstruct approximate patient-level data and estimate hazard ratios (HRs) for the dataset. We checked the accuracy of this method by (a) comparing our results across the length of the survival function with HRs published by authors – in each case, we found that our estimated HR and its 95% confidence interval matched the published values extremely closely (in most cases, it was correct to within 2 decimal places); and (b) overlaying Kaplan–Meier curves from the reconstructed data on the published graphs – again, there was excellent agreement between the 2. Having reconstructed patient-level data, we removed cases that died or were lost to follow-up in the first 30 days and estimated post-perioperative HRs from the remaining data. We reviewed ‘log–log’ plots of the post-perioperative cumulative hazard functions, to assess the appropriateness of summarising treatment effect using a single HR (that is, assuming proportional hazards). We noted that, in most cases, the lines were broadly parallel, suggesting limited evidence of non-proportional hazards. Where any anomalies were present, they tended to be in the early part of the functions, suggesting that the excess mortality associated with OSR may not, in some datasets, be fully realised by the 31st postoperative day. However, any deviations resolved quickly as follow-up time extended, so it appeared reasonable to assume approximately proportional hazards, and we calculated HRs using Cox models with a single explanatory variable for treatment assignment.

2.6. Evidence categorisation and synthesis

General methods for evidence synthesis are as set out in the methods chapter for this guideline.

As we had for Evidence review K, we wanted to subdivide the decision problem according to AAA anatomy – infrarenal and complex (with the latter representing juxta-, para- and supra-renal AAAs, as well as type IV thoracoabdominal aneurysms). However, in contrast to the RCTs, which all recruited people with infrarenal AAAs only, several studies in the observational dataset do not have clear eligibility criteria and many explicitly include all AAAs regardless of anatomy. Therefore, we have subdivided our analyses as follows:

  • ‘Exclusively or predominantly infrarenal AAAs’, in which we present stratified analyses, comprising
    • ‘Infrarenal AAAs’ – studies that are clearly limited to infrarenal anatomy
    • ‘All AAAs’ – studies that do not distinguish AAA anatomy (including some that label their results as ‘infrarenal’ but do not take adequate steps to limit their dataset to such cases)

We consider that these categories are likely to provide broadly comparable results, as any studies that do not distinguish between AAA anatomy are likely to include a preponderance of infrarenal cases. However, because this is a potential source of bias, we present results in stratified analyses, so estimated effects in cases we can confidently call infrarenal can be seen and compared with the less well defined group.

  • ‘Complex AAAs’. We were able to subdivide this group further into
    • Studies comparing fenestrated EVAR grafts only with OSR performed in an analogous population (although how investigators identified the latter group is a possible source of bias, as reflected in our critical appraisal).
    • Studies that identify a population with complex AAAs and compare all relevant endovascular approaches with OSR.

We noted that several datasources form the basis of more than 1 study. Examples include the US Medicare database, the American College of Surgeons’ National Surgical Quality Improvement Program and the National Inpatient Sample. To prevent double-counting of participants, we only entered 1 study per datasource into any given synthesis (except in the case of studies with non-overlapping recruitment periods). Where we had multiple studies to choose from, we selected the study to include according to the following hierarchy:

  • Where naïve multivariable regressions are included alongside more robust methods of adjustment, we prefer any study that uses recommended methods.
  • We prefer studies that distinguish clearly between AAA anatomy – that is, we prefer those we can categorise as ‘infrarenal’ over those in the ‘all AAA’ category that are likely to comprise a preponderance of infrarenal AAAs, but provide no detail about the types of AAAs reflected in their data.
  • We aim to accrue the largest sample size possible. In most cases, this will be achieved by selecting the study with the largest number of participants. However, it might be better to include 2 smaller studies with non-overlapping recruitment periods if, between them, they represent a larger sample than is available in any other single study.

Wherever we have excluded datapoints from meta-analyses on this basis, we have nonetheless shown the data in the relevant forest plot, to maximise transparency and to facilitate comparison with included data; however, the excluded datapoints do not contribute to pooled totals.

The potential for double-counting between – as well as within – datasources remains. For example, it is possible that cases in single-centre studies also had data submitted to regional and/or national surgical registries, and the same people may also appear in studies based on routine administrative data. It is not possible to ascertain the extent of any residual duplication from the aggregate-level data available to us.

2.7. Meta-analysis and meta-regression

We present results in stratified meta-analyses that distinguish both between extent of AAA and method of adjustment (recommended techniques versus naive multivariable regression).

We chose fixed-effect and random-effects models on a priori grounds, rather than on inspection of statistical heterogeneity. For RCTs, we follow the decision-rule adopted by the Cochrane review of randomised evidence, in this area (Paravastu et al., 2014) – that is, where there is evidence of statistical heterogeneity (I2>50%), a random-effects model is used; otherwise, a fixed-effect analysis is preferred. However, our a priori expectation of observational data is that – owing to varied selection biases, varied approaches to casemix adjustment and varied outcome definitions – they are likely to be heterogeneous. Moreover, whereas it is natural to place proportional weight on larger randomised studies, this is not a desirable property of syntheses of observational data, as the size of a study has no bearing on its ability to estimate accurate effects – there is every danger that, in the fixed-effect paradigm, a single, large study could swamp a meta-analysis containing smaller studies that may be less biased. For these reasons, we used random-effects (der Simonian and Laird) models for all syntheses of observational data. We test for differences between study designs with z-tests.

To test the hypothesis that between-treatment effects have changed over time, we employ meta-regression with midpoint of recruitment as a continuous covariate of relative effects (mixed-effects model with REML estimator of between-study variance).

Copyright © NICE 2020.
Bookshelf ID: NBK556895

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.0M)

Other titles in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...