U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Guideline Centre (UK). Cirrhosis in Over 16s: Assessment and Management. London: National Institute for Health and Care Excellence (NICE); 2016 Jul. (NICE Guideline, No. 50.)

Cover of Cirrhosis in Over 16s

Cirrhosis in Over 16s: Assessment and Management.

Show details

4Methods

This chapter sets out in detail the methods used to review the evidence and to develop the recommendations that are presented in subsequent chapters of this guideline. This guidance was developed in accordance with the methods outlined in the NICE guidelines manual (the 2012 version was followed until consultation and 2014 version was followed from the start of consultation).143,145

Sections 4.1 to 4.3 describe the process used to identify and review clinical evidence (summarised in Figure 1), Sections 4.2 and 4.4 describe the process used to identify and review the health economic evidence, and Section 4.5 describes the process used to develop recommendations.

Figure 1. Step-by-step process of review of evidence in the guideline.

Figure 1

Step-by-step process of review of evidence in the guideline.

4.1. Developing the review questions and outcomes

Review questions were developed using a PICO framework (patient, intervention, comparison and outcome) for intervention reviews; using a framework of population, index tests, reference standard and target condition for reviews of diagnostic test accuracy; and using population, presence or absence of factors under investigation (for example prognostic factors) and outcomes for prognostic reviews.

This use of a framework guided the literature searching process, critical appraisal and synthesis of evidence, and facilitated the development of recommendations by the GDG. The review questions were drafted by the NGC technical team and refined and validated by the GDG. The questions were based on the key clinical areas identified in the scope (Appendix A).

A total of 17 review questions were identified.

Full literature searches, critical appraisals and evidence reviews were completed for all the specified review questions.

Table 1. Review questions.

Table 1

Review questions.

4.2. Searching for evidence

4.2.1. Clinical literature search

Systematic literature searches were undertaken to identify all published clinical evidence relevant to the review questions. Searches were undertaken according to the parameters stipulated within the NICE guidelines manual.143,145 Databases were searched using relevant medical subject headings, free-text terms and study-type filters where appropriate. Where possible, searches were restricted to articles published in English. Studies published in languages other than English were not reviewed. All searches were conducted in Medline, Embase, and The Cochrane Library. All searches were updated on 24 August 2015. No papers published after this date were considered.

Search strategies were quality assured by cross-checking reference lists of highly relevant papers, analysing search strategies in other systematic reviews, and asking GDG members to highlight any additional studies. Searches were quality assured by a second information scientist before being run. The questions, the study types applied, the databases searched and the years covered can be found in Appendix G.

The titles and abstracts of records retrieved by the searches were sifted for relevance, with potentially significant publications obtained in full text. These were assessed against the inclusion criteria.

During the scoping stage, a search was conducted for guidelines and reports on the websites listed below from organisations relevant to the topic.

All references sent by stakeholders were considered. Searching for unpublished literature was not undertaken. The NGC and NICE do not have access to drug manufacturers' unpublished clinical trial results, so the clinical evidence considered by the GDG for pharmaceutical interventions may be different from that considered by the Medicines and Healthcare Products Regulatory Agency (MHRA) and European Medicines Agency for the purposes of licensing and safety regulation.

4.2.2. Health economic literature search

Systematic literature searches were also undertaken to identify health economic evidence within published literature relevant to the review questions. The evidence was identified by conducting a broad search relating to cirrhosis in the NHS Economic Evaluation Database (NHS EED), the Health Technology Assessment database (HTA) and the Health Economic Evaluations Database (HEED) with no date restrictions (NHS EED ceased to be updated after March 2015; HEED was used for searches up to 27 August 2014 but subsequently ceased to be available). Additionally, the search was run on Medline and Embase using a health economic filter, from 2013 to ensure recent publications that had not yet been indexed by the economic databases were identified. This was supplemented by an additional search that looked for economic papers specifically relating to the modelling of liver disease on Medline, Embase, HTA, NHS EED and HEED to ensure no modelling studies were missed. Where possible, searches were restricted to articles published in English. Studies published in languages other than English were not reviewed.

The health economic search strategies are included in Appendix G. All searches were updated on 27 August 2015. No papers published after this date were considered.

4.3. Identifying and analysing evidence of effectiveness

Research fellows conducted the tasks listed below, which are described in further detail in the rest of this section:

  • Identified potentially relevant studies for each review question from the relevant search results by reviewing titles and abstracts. Full papers were then obtained.
  • Reviewed full papers against prespecified inclusion and exclusion criteria to identify studies that addressed the review question in the appropriate population, and reported on outcomes of interest (review protocols are included in Appendix C).
  • Critically appraised relevant studies using the appropriate study design checklist as specified in the NICE guidelines manual.143,145 Prognostic or qualitative studies were critically appraised using NGC checklists.
  • Extracted key information about interventional study methods and results using ‘Evibase’, NGC's purpose-built software. Evibase produces summary evidence tables, including critical appraisal ratings. Key information about non-interventional study methods and results was manually extracted onto standard evidence tables and critically appraised separately (evidence tables are included in Appendix H).
  • Generated summaries of the evidence by outcome. Outcome data were combined, analysed and reported according to study design:
    • Randomised data were meta-analysed where appropriate and reported in GRADE profile tables.
    • Observational data were presented as a range of values in GRADE profile tables or meta-analysed if appropriate.
    • Prognostic data were meta-analysed where appropriate and reported in GRADE profile tables.
    • Diagnostic data studies were meta-analysed where appropriate or presented as a range of values in adapted GRADE profile tables
  • A sample of a minimum of 10% of the abstract lists of the first 3 sifts by new reviewers were double-sifted by a senior research fellow. As no papers were missed by any reviewers, no further double-sifting was carried out. All of the evidence reviews were quality assured by a senior research fellow. This included checking:
    • papers were included or excluded appropriately,
    • a sample of the data extractions,
    • correct methods were used to synthesise data
    • a sample of the risk of bias assessments.

4.3.1. Inclusion and exclusion criteria

The inclusion and exclusion of studies was based on the criteria defined in the review protocols, which can be found in Appendix C. Excluded studies by review question (with the reasons for their exclusion) are listed in Appendix L. The GDG was consulted about any uncertainty regarding inclusion or exclusion.

The key population inclusion criterion was:

  • Adults and young people (16 years and over) with cirrhosis

The key population exclusion criterion was:

  • Children <16 years with cirrhosis

Conference abstracts were not automatically excluded from any review. The abstracts were initially assessed against the inclusion criteria for the review question and further processed when a full publication was not available for that review question. If the abstracts were included the authors were contacted for further information. No relevant conference abstracts were identified for this guideline. Literature reviews, posters, letters, editorials, comment articles, unpublished studies and studies not in English were excluded.

4.3.2. Type of studies

Randomised trials, non-randomised trials, and observational studies (including diagnostic or prognostic studies) were included in the evidence reviews as appropriate.

For most intervention reviews in this guideline, parallel randomised controlled trials (RCTs) were included because they are considered the most robust type of study design that can produce an unbiased estimate of the intervention effects. If non-randomised studies were appropriate for inclusion (for example, non-drug trials with no randomised evidence) the GDG stated a priori in the protocol that either certain identified variables must be equivalent at baseline or else the analysis had to adjust for any baseline differences. If the study did not fulfil either criterion it was excluded. Please refer to the review protocols in Appendix C for full details on the study design of studies selected for each review question.

For diagnostic review questions, diagnostic RCTs, cross-sectional studies and retrospective studies were included. For prognostic review questions, prospective and retrospective cohort studies were included. Case-control studies were not included.

Where data from observational studies were included, the results for each outcome were presented separately for each study and meta-analysis was not conducted.

4.3.3. Methods of combining clinical studies

4.3.3.1. Data synthesis for intervention reviews

Where possible, meta-analyses were conducted using Cochrane Review Manager (RevMan5)2 software to combine the data given in all studies for each of the outcomes of interest for the review question.

For some questions stratification was used, and this is documented in the individual review question protocols (see Appendix C).

4.3.3.1.1. Analysis of different types of data
Dichotomous outcomes

Fixed-effects (Mantel-Haenszel) techniques (using an inverse variance method for pooling) were used to calculate risk ratios (relative risk, RR) for the binary outcomes. The absolute risk difference was also calculated using GRADEpro91 software, using the median event rate in the control arm of the pooled results.

For binary variables where there were zero events in either arm or a less than 1% event rate, Peto odds ratios, rather than risk ratios, were calculated. Peto odds ratios are more appropriate for data with a low number of events.

Where there was sufficient information provided, Hazard Ratios were calculated in preference for outcomes such as mortality where the time to the event occurring was important for decision-making.

Continuous outcomes

Continuous outcomes were analysed using an inverse variance method for pooling weighted mean differences. Where the studies within a single meta-analysis had different scales of measurement, standardised mean differences were used (providing all studies reported either change from baseline or final values rather than a mixture of the 2); each different measure in each study was ‘normalised’ to the standard deviation value pooled between the intervention and comparator groups in that same study.

The means and standard deviations of continuous outcomes are required for meta-analysis. However, in cases where standard deviations were not reported, the standard error was calculated if the p values or 95% confidence intervals (95% CI) were reported, and meta-analysis was undertaken with the mean and standard error using the generic inverse variance method in Cochrane Review Manager (RevMan5)2 software. Where p values were reported as ‘less than’, a conservative approach was undertaken. For example, if a p value was reported as ‘p≤0.001’, the calculations for standard deviations were based on a p value of 0.001. If these statistical measures were not available then the methods described in Section 16.1.3 of the Cochrane Handbook (version 5.1.0, updated March 2011) were applied.

4.3.3.1.2. Generic inverse variance

If a study reported only the summary statistic and 95% CI the generic inverse variance method was used to enter data into RevMan5.2 If the control event rate was reported this was used to generate the absolute risk difference in GRADEpro.91 If multivariate analysis was used to derive the summary statistic but no adjusted control event rate was reported no absolute risk difference was calculated.

4.3.3.1.3. Heterogeneity

Statistical heterogeneity was assessed for each meta-analysis estimate by considering the chi-squared test for significance at p<0.1 or an I-squared (I2) inconsistency statistic (with an I-squared value of more than 50% indicating significant heterogeneity) as well as the distribution of effects. Where significant heterogeneity was present, predefined subgrouping of studies was carried out.

If the subgroup analysis resolved heterogeneity within all of the derived subgroups, then each of the derived subgroups were adopted as separate outcomes (providing at least 1 study remained in each subgroup). Assessments of potential differences in effect between subgroups were based on the chi-squared tests for heterogeneity statistics between subgroups. Any subgroup differences were interpreted with caution as separating the groups breaks the study randomisation and as such is subject to uncontrolled confounding.

If all predefined strategies of subgrouping were unable to explain statistical heterogeneity within each derived subgroup, then a random effects (DerSimonian and Laird) model was employed to the entire group of studies in the meta-analysis. A random effects model assumes a distribution of populations, rather than a single population. This leads to a widening of the confidence interval around the overall estimate, thus providing a more realistic interpretation of the true distribution of effects across more than 1 population. If, however, the GDG considered the heterogeneity was so large that meta-analysis was inappropriate, then the results were described narratively.

4.3.3.1.4. Complex analysis

Network meta-analysis was considered for the comparison of interventional treatments for acute hepatic encephalopathy, but was not pursued because of insufficient data available for the relevant outcomes.

4.3.3.2. Data synthesis for prognostic factor reviews

Odds ratios (ORs), risk ratios (RRs), or hazard ratios (HRs), with their 95% CIs, for the effect of the prespecified prognostic factors were extracted from the studies. Studies were only included if the confounders prespecified by the GDG were either matched at baseline or were adjusted for in multivariate analysis.

Studies of lower risk of bias were preferred, taking into account the analysis and the study design. In particular, prospective cohort studies were preferred if they reported multivariable analyses that adjusted for key confounders identified by the GDG at the protocol stage for that outcome. Data were not combined in meta-analyses for prognostic studies.

4.3.3.3. Data synthesis for diagnostic test accuracy reviews

Two review protocols were produced to reflect the 2 different diagnostic study designs.

4.3.3.3.1. Diagnostic RCTs

Diagnostic RCTs (sometimes referred to as test and treat trials) are a randomised comparison of 2 diagnostic tests, with study outcomes being clinically important consequences of the diagnosis (patient-related outcome measures similar to those in intervention trials, such as mortality). Patients are randomised to receive test A or test B, followed by identical therapeutic interventions based on the results of the test (so someone with a positive result would receive the same treatment regardless of whether they were diagnosed by test A or test B). Downstream patient outcomes are then compared between the 2 groups. As treatment is the same in both arms of the trial, any differences in patient outcomes will reflect the accuracy of the tests in correctly establishing who does and does not have the condition. Data were synthesised using the same methods for intervention reviews (see Section 4.3.3.1.1 above).

4.3.3.3.2. Diagnostic accuracy studies

For diagnostic test accuracy studies, a positive result on the index test was found if the patient had values of the measured quantity above or below a threshold value, and different thresholds could be used. The thresholds were prespecified by the GDG including whether or not data could be pooled across a range of thresholds. Diagnostic test accuracy measures used in the analysis were: area under the receiver operating characteristics (ROC) curve (AUC), and, for different thresholds (if appropriate), sensitivity and specificity. The threshold of a diagnostic test is defined as the value at which the test can best differentiate between those with and without the target condition. In practice this varies amongst studies. If a test has a high sensitivity then very few people with the condition will be missed (few false negatives). For example, a test with a sensitivity of 97% will only miss 3% of people with the condition. Conversely, if a test has a high specificity then few people without the condition would be incorrectly diagnosed (few false positives). For example, a test with a specificity of 97% will only incorrectly diagnose 3% of people who do not have the condition as positive. For this guideline, sensitivity was considered more important than specificity due to the consequences of a missed diagnosis of cirrhosis (false negative result).

Coupled forest plots of sensitivity and specificity with their 95% CIs across studies (at various thresholds) were produced for each test, using RevMan5.2 In order to do this, 2×2 tables (the number of true positives, false positives, true negatives and false negatives) were directly taken from the study if given, or else were derived from raw data or calculated from the set of test accuracy statistics.

Diagnostic meta-analysis was conducted where appropriate, that is, when 3 or more studies were available per threshold. Test accuracy for the studies was pooled using the bivariate method for the direct estimation of summary sensitivity and specificity using a random effects approach in WinBUGS software.3 The advantage of this approach is that it produces summary estimates of sensitivity and specificity that account for the correlation between the 2 statistics. Other advantages of this method have been described elsewhere.172,238,239 The bivariate method uses logistic regression on the true positives, true negatives, false positives and false negatives reported in the studies. Overall sensitivity and specificity and confidence regions were plotted (using methods outlined by Novielli 2010.150) Pooled sensitivity and specificity and their 95% CIs were reported in the clinical evidence summary tables. For thresholds with fewer than 3 studies, median sensitivity and the paired specificity were reported where possible.

Heterogeneity or inconsistency amongst studies was visually inspected in the forest plots.

Area under the ROC curve (AUC) data for each study was also plotted on a graph, for each diagnostic test. The AUC describes the overall diagnostic accuracy across the full range of thresholds. The following criteria were used for evaluating AUCs:

  • ≤0.50: worse than chance
  • 0.50–0.60: very poor
  • 0.61–0.70: poor
  • 0.71–0.80: moderate
  • 0.81–0.92: good
  • 0.91–1.00: excellent or perfect test.

Heterogeneity or inconsistency amongst studies was visually inspected.

4.3.3.4. Data synthesis for risk prediction rules

Evidence reviews on risk prediction rules or risk prediction tool results were presented separately for discrimination and calibration. The discrimination data were analysed according to the principles of data synthesis for diagnostic accuracy studies as outlined in Section 4.3.3.3.2. Calibration data such as r-squared (R2), if reported, were presented separately to the discrimination data. The results were presented for each study separately along with the quality rating for the study.

4.3.4. Appraising the quality of evidence by outcomes

4.3.4.1. Intervention reviews

The evidence for outcomes from the included RCTs and, where appropriate, observational studies were evaluated and presented using an adaptation of the ‘Grading of Recommendations Assessment, Development and Evaluation (GRADE) toolbox’ developed by the international GRADE working group (http://www.gradeworkinggroup.org/). The software (GRADEpro91) developed by the GRADE working group was used to assess the quality of each outcome, taking into account individual study quality and the meta-analysis results.

Each outcome was first examined for each of the quality elements listed and defined in Table 2.

Table 2. Description of quality elements in GRADE for intervention studies.

Table 2

Description of quality elements in GRADE for intervention studies.

Details of how the 4 main quality elements (risk of bias, indirectness, inconsistency and imprecision) were appraised for each outcome are given below. Publication or other bias was only taken into consideration in the quality assessment if it was apparent.

4.3.4.1.1. Risk of bias

The main domains of bias for RCTs are listed in Table 3. Each outcome had its risk of bias assessed within each study first. For each study, if there were no risks of bias in any domain, the risk of bias was given a rating of 0. If there was risk of bias in just 1 domain, the risk of bias was given a ‘serious’ rating of −1, but if there was risk of bias in 2 or more domains the risk of bias was given a ‘very serious’ rating of −2. A weighted average score was then calculated across all studies contributing to the outcome, by taking into account the weighting of studies according to study precision. For example if the most precise studies tended to each have a score of −1 for that outcome, the overall score for that outcome would tend towards −1.

Table 3. Principle domains of bias in randomised controlled trials.

Table 3

Principle domains of bias in randomised controlled trials.

4.3.4.1.2. Indirectness

Indirectness refers to the extent to which the populations, interventions, comparisons and outcome measures are dissimilar to those defined in the inclusion criteria for the reviews. Indirectness is important when these differences are expected to contribute to a difference in effect size, or may affect the balance of harms and benefits considered for an intervention. As for the risk of bias, each outcome had its indirectness assessed within each study first. For each study, if there were no sources of indirectness, indirectness was given a rating of 0. If there was indirectness in just 1 source (for example in terms of population), indirectness was given a ‘serious’ rating of −1, but if there was indirectness in 2 or more sources (for example, in terms of population and treatment) the indirectness was given a ‘very serious’ rating of −2. A weighted average score was then calculated across all studies contributing to the outcome by taking into account study precision. For example, if the most precise studies tended to have an indirectness score of −1 each for that outcome, the overall score for that outcome would tend towards −1.

4.3.4.1.3. Inconsistency

Inconsistency refers to an unexplained heterogeneity of results for an outcome across different studies. When estimates of the treatment effect across studies differ widely, this suggests true differences in the underlying treatment effect, which may be due to differences in populations, settings or doses. When heterogeneity existed within an outcome (chi-squared p<0.1, or I2>50%), but no plausible explanation could be found, the quality of evidence for that outcome was downgraded. Inconsistency for that outcome was given a ‘serious’ score of −1 if the I2 was 50–74%, and a ‘very serious’ score of −2 if the I2 was 75% or more.

If inconsistency could be explained based on prespecified subgroup analysis (that is, each subgroup had an I2<50%), the GDG took this into account and considered whether to make separate recommendations on new outcomes based on the subgroups defined by the assumed explanatory factors. In such a situation the quality of evidence was not downgraded for those emergent outcomes.

Since the inconsistency score was based on the meta-analysis results, the score represented the whole outcome and so weighted averaging across studies was not necessary.

4.3.4.1.4. Imprecision

The criteria applied for imprecision were based on the 95% CIs for the pooled estimate of effect, and the minimal important differences (MID) for the outcome. The MIDs are the threshold for appreciable benefits and harms, separated by a zone either side of the line of no effect where there is assumed to be no clinically important effect. If either end of the 95% CI of the overall estimate of effect crossed 1 of the MID lines, imprecision was regarded as serious and a ‘serious’ score of −1 was given. This was because the overall result, as represented by the span of the confidence interval, was consistent with 2 interpretations as defined by the MID (for example, both no clinically important effect and clinical benefit were possible interpretations). If both MID lines were crossed by either or both ends of the 95% CI then imprecision was regarded as very serious and a ‘very serious’ score of −2 was given. This was because the overall result was consistent with all 3 interpretations defined by the MID (no clinically important effect, clinical benefit and clinical harm). This is illustrated in Figure 2. As for inconsistency, since the imprecision score was based on the meta-analysis results, the score represented the whole outcome and so weighted averaging across studies was not necessary.

Figure 2. Illustration of precise and imprecise outcomes based on the 95% CI of dichotomous outcomes in a forest plot (Note that all 3 results would be pooled estimates, and would not, in practice, be placed on the same forest plot).

Figure 2

Illustration of precise and imprecise outcomes based on the 95% CI of dichotomous outcomes in a forest plot (Note that all 3 results would be pooled estimates, and would not, in practice, be placed on the same forest plot).

The position of the MID lines is ideally determined by values reported in the literature. ‘Anchor-based’ methods aim to establish clinically meaningful changes in a continuous outcome variable by relating or ‘anchoring’ them to patient-centred measures of clinical effectiveness that could be regarded as gold standards with a high level of face validity. For example, a MID for an outcome could be defined by the minimum amount of change in that outcome necessary to make patients feel their quality of life had ‘significantly improved’. MIDs in the literature may also be based on expert clinician or consensus opinion concerning the minimum amount of change in a variable deemed to affect quality of life or health. For binary variables, any MIDs reported in the literature will inevitably be based on expert consensus, as such MIDs relate to all-or-nothing population effects rather than measurable effects on an individual, and so are not amenable to patient-centred ‘anchor’ methods.

In the absence of values identified in the literature, the alternative approach to deciding on MID levels is the ‘default’ method, as follows:

  • For categorical outcomes the MIDs were taken to be RRs of 0.75 and 1.25. For ‘positive’ outcomes such as ‘patient satisfaction’, the RR of 0.75 is taken as the line denoting the boundary between no clinically important effect and a clinically significant harm, whilst the RR of 1.25 is taken as the line denoting the boundary between no clinically important effect and a clinically significant benefit. For ‘negative’ outcomes such as ‘bleeding’, the opposite occurs, so the RR of 0.75 is taken as the line denoting the boundary between no clinically important effect and a clinically significant benefit, whilst the RR of 1.25 is taken as the line denoting the boundary between no clinically important effect and a clinically significant harm.
  • For mortality any change was considered to be clinically important and the imprecision was assessed on the basis of whether the confidence intervals crossed the line of no effect: that is, whether the result was consistent with both benefit and harm.
  • For continuous outcome variables the MID was taken as half the median baseline standard deviation of that variable, across all studies in the meta-analysis. Hence the MID denoting the minimum clinically significant benefit was positive for a ‘positive’ outcome (for example, a quality of life measure where a higher score denotes better health), and negative for a ‘negative’ outcome (for example, a visual analogue scale [VAS] pain score). Clinically significant harms will be the converse of these. If baseline values are unavailable, then half the median comparator group standard deviation of that variable will be taken as the MID.
  • If standardised mean differences have been used, then the MID will be set at the absolute value of +0.5. This follows because standardised mean differences are mean differences normalised to the pooled standard deviation of the 2 groups, and are thus effectively expressed in units of ‘numbers of standard deviations’. The 0.5 MID value in this context therefore indicates half a standard deviation, the same definition of MID as used for non-standardised mean differences.

The default MID value was subject to amendment after discussion with the GDG. If the GDG decided that the MID level should be altered, after consideration of absolute as well as relative effects, this was allowed, provided that any such decision was not influenced by any bias towards making stronger or weaker recommendations for specific outcomes.

For this guideline, no appropriate MIDs for continuous or dichotomous outcomes were found in the literature, and so the default method was adopted.

4.3.4.1.5. Overall grading of the quality of clinical evidence

Once an outcome had been appraised for the main quality elements, as above, an overall quality grade was calculated for that outcome. The scores (0, −1 or −2) from each of the main quality elements were summed to give a score that could be anything from 0 (the best possible) to −8 (the worst possible). However scores were capped at −3. This final score was then applied to the starting grade that had originally been applied to the outcome by default, based on study design. All RCTs started as High and the overall quality became Moderate, Low or Very Low if the overall score was −1, −2 or −3 points respectively. The significance of these overall ratings is explained in Table 4. The reasons for downgrading in each case were specified in the footnotes of the GRADE tables.

Table 4. Overall quality of outcome evidence in GRADE.

Table 4

Overall quality of outcome evidence in GRADE.

Observational interventional studies started at Low, and so a score of −1 would be enough to take the grade to the lowest level of Very Low. Observational studies could, however, be upgraded if there were all of: a large magnitude of effect, a dose-response gradient, and if all plausible confounding would reduce the demonstrated effect.

4.3.4.2. Prognostic reviews

The quality of evidence for prognostic studies was evaluated according to the criteria given in Table 5. If data were meta-analysed, the quality for pooled studies was presented. If the data were not pooled, then a quality rating was presented for each study.

Table 5. Description of quality elements for prospective studies.

Table 5

Description of quality elements for prospective studies.

4.3.4.2.1. Inconsistency

Inconsistency was assessed as for intervention studies.

4.3.4.2.2. Imprecision

In meta-analysed outcomes, or for non-pooled outcomes, the position of the 95% CIs in relation to the null line determined the existence of imprecision. If the 95% CI did not cross the null line then no serious imprecision was recorded. If the 95% CI crossed the null line then serious imprecision was recorded.

4.3.4.2.3. Overall grading

Quality rating started at High for prospective studies, and each major limitation brought the rating down by 1 increment to a minimum grade of Very Low, as explained for interventional reviews. For prognostic reviews prospective cohort studies with a multivariate analysis are regarded as the gold standard because RCTs are usually inappropriate for these types of review for ethical or pragmatic reasons. Furthermore, if the study is looking at more than 1 risk factor of interest then randomisation would be inappropriate as it can only be applied to 1 of the risk factors.

4.3.4.3. Diagnostic studies

Risk of bias and indirectness of evidence for diagnostic data were evaluated by study using: the Quality Assessment of Diagnostic Accuracy Studies version 2 (QUADAS-2) checklists (see Appendix H in the NICE guidelines manual 2014143). Risk of bias and applicability in primary diagnostic accuracy studies in QUADAS-2 consists of 4 domains (see Table 6):

Table 6. Summary of QUADAS-2 with list of signalling, risk of bias and applicability questions.

Table 6

Summary of QUADAS-2 with list of signalling, risk of bias and applicability questions.

  • patient selection
  • index test
  • reference standard
  • flow and timing.
4.3.4.3.1. Inconsistency

Inconsistency refers to an unexplained heterogeneity of results for an outcome across different studies. Inconsistency was assessed by inspection of the sensitivity (based on the primary measure) using the point estimates and 95% CIs of the individual studies on the forest plots. Particular attention was placed on values above or below 50% (diagnosis based on chance alone) and a 95% threshold set by the GDG (the threshold above which it would be acceptable to recommend a test). The evidence was downgraded by 1 increment if the individual studies varied across 2 areas (for example, 50–95% and 95–100%) and by 2 increments if the individual studies varied across 3 areas (for example, 0–50%, 50–95% and 95–100%).

4.3.4.3.2. Imprecision

The judgement of precision was based on visual inspection of the confidence region around the summary sensitivity and specificity point from the diagnostic meta-analysis, if conducted. Where a diagnostic meta-analysis was not performed, imprecision was assessed according to the sensitivity confidence region of the largest study. Imprecision was assessed on the sensitivity confidence region as the primary measure for decision-making. Particular attention was placed on values above or below 50% (diagnosis based on chance alone) and the 95% threshold set by the GDG (the threshold above which would be acceptable to recommend a test). The evidence was downgraded by 1 increment if the confidence interval varied across 2 areas (for example 50–95% and 95–100%) and by 2 increments if the confidence interval varied across 3 areas (for example 0–50%, 50–95% and 95–100%).

4.3.4.3.3. Overall grading

Quality rating started at High for prospective and retrospective cross-sectional studies, and each major limitation (risk of bias, indirectness, inconsistency and imprecision) brought the rating down by 1 increment to a minimum grade of Very Low, as explained for intervention reviews.

4.3.5. Assessing clinical importance

The GDG assessed the evidence by outcome in order to determine if there was, or potentially was, a clinically important benefit, a clinically important harm or no clinically important difference between interventions. To facilitate this, binary outcomes were converted into absolute risk differences (ARDs) using GRADEpro91 software: the median control group risk across studies was used to calculate the ARD and its 95% CI from the pooled risk ratio.

The assessment of clinical benefit, harm, or no benefit or harm was based on the point estimate of absolute effect for intervention studies, which was standardised across the reviews. The GDG considered for most of the outcomes in the intervention reviews that if at least 100 participants per 1000 (10%) achieved the outcome of interest (for a positive outcome) in the intervention group compared to the comparison group then this intervention would be considered beneficial. The same point estimate but in the opposite direction applied if the outcome was negative. For adverse events 50 events or more per 1000 (5%) represented clinical harm. For continuous outcomes if the mean difference was greater than the minimally important difference (MID) then this resented a clinical benefit or harm. For critical outcomes such as mortality any reduction or increase was considered to be clinically important.

This assessment was carried out by the GDG for each critical outcome, and an evidence summary table was produced to compile the GDG's assessments of clinical importance per outcome, alongside the evidence quality and the uncertainty in the effect estimate (imprecision).

4.3.6. Clinical evidence statements

Clinical evidence statements are summary statements that are included in each review chapter, and which summarise the key features of the clinical effectiveness evidence presented. The wording of the evidence statements reflects the certainty or uncertainty in the estimate of effect. The evidence statements are presented by outcome and encompass the following key features of the evidence:

  • The number of studies and the number of participants for a particular outcome.
  • An indication of the direction of clinical importance (if one treatment is beneficial or harmful compared to the other, or whether there is no difference between the 2 tested treatments).
  • A description of the overall quality of the evidence (GRADE overall quality).

4.4. Identifying and analysing evidence of cost-effectiveness

The GDG is required to make decisions based on the best available evidence of both clinical effectiveness and cost-effectiveness. Guideline recommendations should be based on the expected costs of the different options in relation to their expected health benefits (that is, their ‘cost-effectiveness’) rather than the total implementation cost.143 Thus, if the evidence suggests that a strategy provides significant health benefits at an acceptable cost per patient treated, it should be recommended even if it would be expensive to implement across the whole population.

Health economic evidence was sought relating to the key clinical issues being addressed in the guideline. Health economists:

  • Undertook a systematic review of the published economic literature.
  • Undertook new cost-effectiveness analysis in priority areas.

4.4.1. Literature review

The health economists:

  • Identified potentially relevant studies for each review question from the health economic search results by reviewing titles and abstracts. Full papers were then obtained.
  • Reviewed full papers against prespecified inclusion and exclusion criteria to identify relevant studies (see below for details).
  • Critically appraised relevant studies using economic evaluations checklist as specified in the NICE guidelines manual.143,145
  • Extracted key information about the studies' methods and results into economic evidence tables (included in Appendix I).
  • Generated summaries of the evidence in NICE economic evidence profile tables (included in the relevant chapter for each review question) – see below for details.

4.4.1.1. Inclusion and exclusion criteria

Full economic evaluations (studies comparing costs and health consequences of alternative courses of action: cost-utility, cost-effectiveness, cost-benefit and cost-consequences analyses) and comparative costing studies that addressed the review question in the relevant population were considered potentially includable as economic evidence.

Studies that only reported cost per hospital (not per patient), or only reported average cost-effectiveness without disaggregated costs and effects, were excluded. Literature reviews, abstracts, posters, letters, editorials, comment articles, unpublished studies and studies not in English were excluded. Studies published before 1999 and studies from non-OECD countries or the USA were also excluded, on the basis that the applicability of such studies to the present UK NHS context is likely to be too low for them to be helpful for decision-making.

Remaining health economic studies were prioritised for inclusion based on their relative applicability to the development of this guideline and the study limitations. For example, if a High quality, directly applicable UK analysis was available, then other less relevant studies may not have been included. However, in this guideline, no economic studies were excluded on the basis that more applicable evidence was available.

For more details about the assessment of applicability and methodological quality see Table 7 below and the economic evaluation checklist (Appendix G of the 2012 NICE guidelines manual145) and the health economics review protocol in Appendix D.

Table 7. Content of NICE economic evidence profile.

Table 7

Content of NICE economic evidence profile.

When no relevant health economic studies were found from the economic literature review, relevant UK NHS unit costs related to the compared interventions were presented to the GDG to inform the possible economic implications of the recommendations.

4.4.1.2. NICE economic evidence profiles

NICE economic evidence profile tables were used to summarise cost and cost-effectiveness estimates for the included health economic studies in each review chapter. The economic evidence profile shows an assessment of applicability and methodological quality for each economic study, with footnotes indicating the reasons for the assessment. These assessments were made by the health economist using the economic evaluation checklist from the NICE guidelines manual.145 It also shows the incremental costs, incremental effects (for example, quality-adjusted life years [QALYs]) and incremental cost-effectiveness ratio (ICER) for the base case analysis in the study, as well as information about the assessment of uncertainty in the analysis. See Table 7 for more details.

When a non-UK study was included in the profile, the results were converted into pounds sterling using the appropriate purchasing power parity.151

4.4.2. Undertaking new health economic analysis

As well as reviewing the published health economic literature for each review question, as described above, new health economic analysis was undertaken by the health economist in selected areas. Priority areas for new analysis were agreed by the GDG after formation of the review questions and consideration of the existing health economic evidence.

The GDG identified the highest priority areas for original health economic modelling as:

  • risk factors for cirrhosis
  • the appropriate tests (blood tests, non-invasive tests or a combination) for diagnosing cirrhosis
  • frequency of surveillance testing for the early detection of hepatocellular carcinoma
  • frequency of surveillance testing for the detection of oesophageal varices.

This was due to the number of people affected by these questions and the current uncertainty as to what the most cost-effective solutions would be, due to the lack of published economic models encompassing the whole pathway of cirrhosis from diagnosis to end-stage liver disease. New work was therefore conducted, which entailed the development of the NGC Liver Disease Pathway Model to address all of the questions prioritised for this guideline.

The following general principles were adhered to in developing the cost-effectiveness analysis:

  • Methods were consistent with the NICE reference case for interventions with health outcomes in NHS settings.143,146
  • The GDG was involved in the design of the model, selection of inputs and interpretation of the results.
  • Model inputs were based on the systematic review of the clinical literature supplemented with other published data sources where possible.
  • When published data were not available GDG expert opinion was used to populate the model.
  • Model inputs and assumptions were reported fully and transparently.
  • The results were subject to sensitivity analysis and limitations were discussed.
  • The model was peer-reviewed by another health economist at the NGC.

Full methods for the cost-effectiveness analysis are described in Appendix N.

4.4.3. Cost-effectiveness criteria

NICE's report ‘Social value judgements: principles for the development of NICE guidance’ sets out the principles that GDGs should consider when judging whether an intervention offers good value for money.144 In general, an intervention was considered to be cost-effective (given that the estimate was considered plausible) if either of the following criteria applied:

  • the intervention dominated other relevant strategies (that is, it was both less costly in terms of resource use and more clinically effective compared with all the other relevant alternative strategies), or
  • the intervention cost less than £20,000 per QALY gained compared with the next best strategy.

If the GDG recommended an intervention that was estimated to cost more than £20,000 per QALY gained, or did not recommend one that was estimated to cost less than £20,000 per QALY gained, the reasons for this decision are discussed explicitly in the ‘Recommendations and link to evidence’ section of the relevant chapter, with reference to issues regarding the plausibility of the estimate or to the factors set out in ‘Social value judgements: principles for the development of NICE guidance’.144

When QALYs or life years gained are not used in the analysis, results are difficult to interpret unless one strategy dominates the others with respect to every relevant health outcome and cost.

4.4.4. In the absence of economic evidence

When no relevant published health economic studies were found, and a new analysis was not prioritised, the GDG made a qualitative judgement about cost-effectiveness by considering expected differences in resource use between options and relevant UK NHS unit costs, alongside the results of the review of clinical effectiveness evidence.

The UK NHS costs reported in the guideline are those that were presented to the GDG and were correct at the time recommendations were drafted. They may have changed subsequently before the time of publication. However, we have no reason to believe they have changed substantially.

4.5. Developing recommendations

Over the course of the guideline development process, the GDG was presented with:

  • Evidence tables of the clinical and economic evidence reviewed from the literature. All evidence tables are in Appendices H and I.
  • Summaries of clinical and economic evidence and quality (as presented in Chapters 515).
  • Forest plots (Appendix K).
  • A description of the methods and results of the cost-effectiveness analysis undertaken for the guideline (Appendix N).

Recommendations were drafted on the basis of the GDG's interpretation of the available evidence, taking into account the balance of benefits, harms and costs between different courses of action. This was either done formally in an economic model, or informally. Firstly, the net clinical benefit over harm (clinical effectiveness) was considered, focusing on the critical outcomes. When this was done informally, the GDG took into account the clinical benefits and harms when one intervention was compared with another. The assessment of net clinical benefit was moderated by the importance placed on the outcomes (the GDG's values and preferences), and the confidence the GDG had in the evidence (evidence quality). Secondly, the GDG assessed whether the net clinical benefit justified any differences in costs between the alternative interventions.

When clinical and economic evidence was of poor quality, conflicting or absent, the GDG drafted recommendations based on its expert opinion. The considerations for making consensus-based recommendations include the balance between potential harms and benefits, the economic costs compared to the economic benefits, current practices, recommendations made in other relevant guidelines, patient preferences and equality issues. The consensus recommendations were agreed through discussions in the GDG meeting. The GDG also considered whether the uncertainty was sufficient to justify delaying making a recommendation to await further research, taking into account the potential harm of failing to make a clear recommendation (see Section 4.5.1 below).

The GDG considered the appropriate ‘strength’ of each recommendation. This takes into account the quality of the evidence but is conceptually different. Some recommendations are ‘strong’ in that the GDG believes that the vast majority of healthcare and other professionals and patients would choose a particular intervention if they considered the evidence in the same way that the GDG has. This is generally the case if the benefits clearly outweigh the harms for most people and the intervention is likely to be cost-effective. However, there is often a closer balance between benefits and harms, and some patients would not choose an intervention whereas others would. This may happen, for example, if some patients are particularly averse to some side effect and others are not. In these circumstances the recommendation is generally weaker, although it may be possible to make stronger recommendations about specific groups of patients.

The GDG focused on the following factors in agreeing the wording of the recommendations:

  • The actions health professionals need to take.
  • The information readers need to know.
  • The strength of the recommendation (for example the word ‘offer’ was used for strong recommendations and ‘consider’ for weak recommendations).
  • The involvement of patients (and their carers if needed) in decisions on treatment and care.
  • Consistency with NICE's standard advice on recommendations about drugs, waiting times and ineffective interventions (see Section 9.2 in the 2014 NICE guidelines manual143).

The main considerations specific to each recommendation are outlined in the ‘Recommendations and link to evidence’ sections within each chapter.

4.5.1. Research recommendations

When areas were identified for which good evidence was lacking, the GDG considered making recommendations for future research. Decisions about inclusion were based on factors such as:

  • the importance to patients or the population
  • national priorities
  • potential impact on the NHS and future NICE guidance
  • ethical and technical feasibility.

4.5.2. Validation process

This guidance is subject to a 6-week public consultation and feedback as part of the quality assurance and peer review of the document. All comments received from registered stakeholders are responded to in turn and posted on the NICE website

4.5.3. Updating the guideline

Following publication, and in accordance with the NICE guidelines manual, NICE will undertake a review of whether the evidence base has progressed significantly to alter the guideline recommendations and warrant an update.

4.5.4. Disclaimer

Healthcare providers need to use clinical judgement, knowledge and expertise when deciding whether it is appropriate to apply guidelines. The recommendations cited here are a guide and may not be appropriate for use in all situations. The decision to adopt any of the recommendations cited here must be made by practitioners in light of individual patient circumstances, the wishes of the patient, clinical expertise and resources.

The National Guideline Centre disclaims any responsibility for damages arising out of the use or non-use of this guideline and the literature used in support of this guideline.

4.5.5. Funding

The National Guideline Centre was commissioned by the National Institute for Health and Care Excellence to undertake the work on this guideline.

Copyright © National Institute for Health and Care Excellence 2016.
Bookshelf ID: NBK385232

Views

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...