U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Guideline Centre (UK). Non-Alcoholic Fatty Liver Disease: Assessment and Management. London: National Institute for Health and Care Excellence (NICE); 2016 Jul. (NICE Guideline, No. 49.)

Cover of Non-Alcoholic Fatty Liver Disease

Non-Alcoholic Fatty Liver Disease: Assessment and Management.

Show details

4Methods

This chapter sets out in detail the methods used to review the evidence and to develop the recommendations that are presented in subsequent chapters of this guideline. This guidance was developed in accordance with the methods outlined in the NICE guidelines manual, 2012 versions.120,123

Sections 4.1 to 4.3 describe the process used to identify and review clinical evidence (summarised in Figure 1), Sections 4.2 and 4.4 describe the process used to identify and review the health economic evidence, and Section 4.5 describes the process used to develop recommendations.

Figure 1. Step-by-step process of review of evidence in the guideline.

Figure 1

Step-by-step process of review of evidence in the guideline.

4.1. Developing the review questions and outcomes

Review questions were developed using a PICO framework (patient, intervention, comparison and outcome) for intervention reviews; using a framework of population, index tests, reference standard and target condition for reviews of diagnostic test accuracy; and using population, presence or absence of factors under investigation (for example prognostic factors) and outcomes for prognostic reviews.

This use of a framework guided the literature searching process, critical appraisal and synthesis of evidence, and facilitated the development of recommendations by the GDG. The review questions were drafted by the NGC technical team and refined and validated by the GDG. The questions were based on the key clinical areas identified in the scope (Appendix A).

A total of 13 review questions were identified.

Full literature searches, critical appraisals and evidence reviews were completed for all the specified review questions.

Table 1. Review questions.

Table 1

Review questions.

4.2. Searching for evidence

4.2.1. Clinical literature search

Systematic literature searches were undertaken to identify all published clinical evidence relevant to the review questions. Searches were undertaken according to the parameters stipulated within the NICE guidelines manual.123 Databases were searched using relevant medical subject headings, free-text terms and study-type filters where appropriate. Where possible, searches were restricted to articles published in English. Studies published in languages other than English were not reviewed. All searches were conducted in Medline, Embase, and The Cochrane Library. Additional subject specific databases were used for some questions: AMED, and Cinahl for the exercise, lifestyle and diet reviews, as well as PsycINFO for the lifestyle review. All searches were updated on 27 August 2015. No papers published after this date were considered.

Search strategies were quality assured by cross-checking reference lists of highly relevant papers, analysing search strategies in other systematic reviews, and asking GDG members to highlight any additional studies. Searches were quality assured by a second information scientist before being run. The questions, the study types applied, the databases searched and the years covered can be found in Appendix G.

The titles and abstracts of records retrieved by the searches were sifted for relevance, with potentially significant publications obtained in full text. These were assessed against the inclusion criteria.

During the scoping stage, a search was conducted for guidelines and reports on the websites listed below from organisations relevant to the topic.

All references sent by stakeholders were considered. Searching for unpublished literature was not undertaken. The NGC and NICE do not have access to drug manufacturers' unpublished clinical trial results, so the clinical evidence considered by the GDG for pharmaceutical interventions may be different from that considered by the MHRA and European Medicines Agency for the purposes of licensing and safety regulation.

4.2.2. Health economic literature search

Systematic literature searches were also undertaken to identify health economic evidence within published literature relevant to the review questions. The evidence was identified by conducting a broad search relating to non-alcoholic fatty liver disease in the NHS Economic Evaluation Database (NHS EED), the Health Technology Assessment database (HTA) and the Health Economic Evaluations Database (HEED) with no date restrictions (NHS EED ceased to be updated after March 2015; HEED was used for searches up to 13 June 2014 but subsequently ceased to be available). Additionally, the search was run on Medline and Embase using a health economic filter, from 1 January 2013, to ensure recent publications that had not yet been indexed by the economic databases were identified. This was supplemented by an additional search that looked for economic papers specifically relating to the modelling of liver disease in NHS EED, HTA and HEED with no date restrictions (NHS EED ceased to be updated after March 2015; HEED was used for searches up to 13 June 2014, but subsequently ceased to be available) and additionally in Medline and Embase using a health economic filter, from 1 January 2013, to ensure no modelling studies were missed. Where possible, searches were restricted to articles published in English. Studies published in languages other than English were not reviewed.

The health economic search strategies are included in Appendix G. All searches were updated on 27 August 2015. No papers published after this date were considered.

4.3. Identifying and analysing evidence of effectiveness

Research fellows conducted the tasks listed below, which are described in further detail in the rest of this section:

  • Identified potentially relevant studies for each review question from the relevant search results by reviewing titles and abstracts. Full papers were then obtained.
  • Reviewed full papers against pre-specified inclusion and exclusion criteria to identify studies that addressed the review question in the appropriate population, and reported on outcomes of interest (review protocols are included in Appendix C).
  • Critically appraised relevant studies using the appropriate study design checklist as specified in the NICE guidelines manual.120,123 Prognostic studies were critically appraised using NGC checklists.
  • Extracted key information about interventional study methods and results using ‘Evibase’, NGC's purpose-built software. Evibase produces summary evidence tables, including critical appraisal ratings. Key information about non-interventional study methods and results was manually extracted onto standard evidence tables and critically appraised separately (evidence tables are included in Appendix H).
  • Generated summaries of the evidence by outcome. Outcome data were combined, analysed and reported according to study design:
    • Randomised data were meta-analysed where appropriate and reported in GRADE profile tables.
    • Observational data were presented as a range of values in GRADE profile tables or meta-analysed if appropriate.
    • Prognostic data were meta-analysed where appropriate and reported in GRADE profile tables.
    • Diagnostic data studies were meta-analysed where appropriate or presented as a range of values in adapted GRADE profile tables
  • A sample of a minimum of 10% of the abstract lists of the first sifts by new reviewers and those for complex review questions (for example, prognostic reviews) were double-sifted by a senior research fellow and any discrepancies were rectified. All of the evidence reviews were quality assured by a senior research fellow. This included checking:
    • papers were included or excluded appropriately
    • a sample of the data extractions
    • correct methods were used to synthesise data
    • a sample of the risk of bias assessments.

4.3.1. Inclusion and exclusion criteria

The inclusion and exclusion of studies was based on the criteria defined in the review protocols, which can be found in Appendix C. Excluded studies by review question (with the reasons for their exclusion) are listed in Appendix M. The GDG was consulted about any uncertainty regarding inclusion or exclusion.

The key population inclusion criteria were:

  • Adults, children and young people with suspected or confirmed primary NAFLD.
  • No subgroups of people have been identified as needing specific consideration.

The key population exclusion criterion was:

  • People with secondary causes of fatty liver (for example, chronic hepatitis C infection, total parenteral nutrition treatment and drug-induced fatty liver).

Literature reviews, conference abstracts, posters, letters, editorials, comment articles, unpublished studies and studies not in English were excluded.

4.3.2. Type of studies

Randomised trials, non-randomised trials, and observational studies (including diagnostic or prognostic studies) were included in the evidence reviews as appropriate.

For most intervention reviews in this guideline, parallel randomised controlled trials (RCTs) were included because they are considered the most robust type of study design that could produce an unbiased estimate of the intervention effects. If non-randomised studies were appropriate for inclusion, for example, non-drug trials with no randomised evidence, the GDG identified a priori in the protocol the variables which must either be equivalent at baseline or that the analysis had to adjust for any baseline differences. If the study did not fulfil either criterion it was excluded. Please refer to Appendix C for full details on the study design of studies selected for each review question.

For diagnostic review questions, diagnostic RCTs, cross-sectional studies and retrospective studies were included. For prognostic review questions, prospective and retrospective cohort studies were included. Case–control studies were not included.

Qualitative research was not considered in this guideline as no review questions exploring outcomes that would require investigation of qualitative research were prioritised in the scope.

4.3.3. Methods of combining evidence

4.3.3.1. Data synthesis for intervention reviews

Where possible, meta-analyses were conducted using Cochrane Review Manager (RevMan5)1 software to combine the data given in all studies for each of the outcomes of interest for the review question.

Most analyses were stratified for age (under 18 years and 18 years or over), which meant that different studies with predominant age-groups in different age strata were not combined and analysed together. For some questions population was not stratified by age (diagnosis, assessment, extra-hepatic conditions, caffeine and the omega-3 section of the diet modification reviews) as the GDG felt that studies could be considered together in these instances and there was no clinical rationale for stratification.

The primary outcome for most of the reviews was progression of NAFLD. This could be as measured by a range of different techniques. For example:

  • liver biopsy
  • MRI or MRS
  • ultrasound (presence or absence of steatosis only)
  • the enhanced liver fibrosis (ELF) score
  • transient elastography
  • NAFLD fibrosis score.

The GDG felt that for liver biopsy progression measured using only the NAFLD activity score (NAS) by Brunt/Kleiner/NASH-CRN was acceptable and that progression of liver fat as measured by other methods such as ISHAK score would be excluded. It was acknowledged that papers could report progression of NAFLD by the means listed above as either dichotomous (for example, an improvement of 2 or more on the NAS) or continuous (mean and SD of NAFLD fibrosis score). With respect to ultrasound, the experience of the GDG was that whilst ultrasound is a useful tool for identifying whether there is steatosis of the liver or not, it is not an appropriate technique for quantifying the degree of fat within the liver because of wide inter-observer variability. Furthermore, the degree of hepatic steatosis cannot be interpreted as a marker of severity of NAFLD. As such, the GDG considered that measurement of the degree of steatosis on ultrasound should not be considered as a relevant outcome, and that the use of ultrasound should only be reported if it was used to indicate presence or absence of steatosis.

4.3.3.1.1. Analysis of different types of data
Dichotomous outcomes

Fixed-effects (Mantel-Haenszel) techniques (using an inverse variance method for pooling) were used to calculate risk ratios (relative risk) for the binary outcomes, which included:

  • progression of NAFLD (author thresholds of improvement/no improvement) as assessed by:
    • liver biopsy
    • MRI or MRS
    • ultrasound (presence or absence of steatosis only)
    • the enhanced liver fibrosis (ELF) score
    • transient elastography
    • NAFLD fibrosis score
  • serious adverse events
  • weight loss
  • liver blood tests (for example ALT levels, ALT/AST ratio)
  • adverse events.

The absolute risk difference was also calculated using GRADEpro56 software, using the median event rate in the control arm of the pooled results.

For binary variables where there were zero events in either arm or a less than 1% event rate, Peto odds ratios, rather than risk ratios were calculated. Peto odds ratios are more appropriate for data with a low number of events.

Where there was sufficient information provided, hazard ratios were calculated in preference for outcomes such as mortality where the time to the event occurring was important for decision-making.

Continuous outcomes

The continuous outcomes were analysed using an inverse variance method for pooling weighted mean differences. These outcomes included:

  • progression of NAFLD as assessed by:
    • liver biopsy
    • MRI or MRS
    • the enhanced liver fibrosis (ELF) score
    • transient elastography
    • NAFLD fibrosis score
  • heath-related quality of life (HRQoL)
  • weight loss
  • liver blood tests (for example ALT levels, ALT/AST ratio).

Where the studies within a single meta-analysis had different scales of measurement, standardised mean differences were used (providing all studies reported either change from baseline or final values rather than a mixture of both), where each different measure in each study was ‘normalised’ to the standard deviation value pooled between the intervention and comparator groups in that same study.

The means and standard deviations of continuous outcomes are required for meta-analysis. However, in cases where standard deviations were not reported, the standard error was calculated if the p values or 95% confidence intervals (95% CI) were reported, and meta-analysis was undertaken with the mean and standard error using the generic inverse variance method in Cochrane Review Manager (RevMan5)1 software. Where p values were reported as ‘less than’, a conservative approach was undertaken. For example, if a p value was reported as ‘p ≤0.001’, the calculations for standard deviations were based on a p value of 0.001. If these statistical measures were not available then the methods described in Section 16.1.3 of the Cochrane Handbook (version 5.1.0, updated March 2011) were applied.

4.3.3.1.2. Generic inverse variance

If a study reported only the summary statistic and 95% CI the generic-inverse variance method was used to enter data into RevMan5.1 If the control event rate was reported this was used to generate the absolute risk difference in GRADEpro.56 If multivariate analysis was used to derive the summary statistic but no adjusted control event rate was reported no absolute risk difference was calculated.

4.3.3.1.3. Heterogeneity

Statistical heterogeneity was assessed for each meta-analysis estimate by considering the chi-squared test for significance at p<0.1 or an I-squared (I2) inconsistency statistic (with an I-squared value of more than 50% indicating significant heterogeneity) as well as the distribution of effects. Where significant heterogeneity was present, predefined subgrouping of studies was carried out as per the review question protocols.

If the subgroup analysis resolved heterogeneity within all of the derived subgroups, then each of the derived subgroups were adopted as separate outcomes (providing at least 1 study remained in each subgroup. Assessments of potential differences in effect between subgroups were based on the chi-squared tests for heterogeneity statistics between subgroups. Any subgroup differences were interpreted with caution as separating the groups breaks the study randomisation and as such is subject to uncontrolled confounding.

If predefined strategies of subgrouping were unable to explain statistical heterogeneity within each derived subgroup, then a random effects (DerSimonian and Laird) model was employed to the entire group of studies in the meta-analysis. A random-effects model assumes a distribution of populations, rather than a single population. This leads to a widening of the confidence interval around the overall estimate, thus providing a more realistic interpretation of the true distribution of effects across more than 1 population. If, however, the GDG considered the heterogeneity was so large that meta-analysis was inappropriate, then the results were described narratively.

4.3.3.2. Data synthesis for prognostic factor reviews

Odds ratios (ORs), risk ratios (RRs), or hazard ratios (HRs), with their 95% CIs, for the effect of the pre-specified prognostic factors were extracted from the studies. Studies were only included if the confounders pre-specified by the GDG were either matched at baseline or were adjusted for in multivariate analysis.

Studies of lower risk of bias were preferred, taking into account the analysis and the study design. In particular, prospective cohort studies were preferred if they reported multivariable analyses that adjusted for key confounders identified by the GDG at the protocol stage for that outcome.

If more than 1 study covered the same combination of population, risk factor and outcome, and adjusted for the same key confounders, then meta-analysis was used to pool results. Meta-analysis was carried out using the generic inverse variance function on RevMan51 using fixed effects. Heterogeneity was assessed using the same criteria as for intervention studies, with an I2 of 50–74% representing serious inconsistency and an I2 of 75% or more representing very serious inconsistency. If serious or very serious heterogeneity existed, then subgrouping strategies were based on pre-specified subgrouping criteria as for interventional reviews. If subgrouping failed to explain heterogeneity, then the random-effects model was used. If subgrouping successfully explained heterogeneity then each of the subgroups was presented as a separate outcome (for example, mortality in people under 30 years and mortality in people 30 years and over) and a fixed-effects model was used.

Where evidence was not meta-analysed, because studies differed in population, outcome or risk factors, then no alternative pooling strategies were carried out, on the basis that such pooling would have little meaning. Results from single studies were presented.

4.3.3.3. Data synthesis for prognostic monitoring review

The monitoring review question (Chapter 8) was undertaken using a stepwise approach in agreement with the GDG. The information extracted from the papers included the number of patients with NAFLD, NAFL and NASH at initial biopsy, the average time between biopsies, and the numbers who had progressed, regressed or remained stable in fibrosis staging on the Brunt/CRN criteria. For papers with mixed NAFLD populations, the data are presented as a total and also separately for those with initial NASH and NAFL where possible. If the fibrosis progression rate was reported this was also included in the modified clinical evidence summary table (a calculation based on the difference between fibrosis stage at baseline and follow-up using the Brunt/CRN criteria, divided by the time in years between the 2 measurements). The GDG recognised that the fibrosis progression rate was useful in comparing the included studies as they each had very different average times between the biopsies. This additional information was available within 1 identified meta-analysis176 as the authors had contacted the authors of primary studies for further information and had calculated fibrosis progression scores specifically for people within the studies who started with no fibrosis at baseline. After discussion with the GDG these summary statistics were included in the evidence table. The mean fibrosis progression rate for the studies where it was possible to extract was calculated for NAFLD, NAFL and NASH populations and meta-analysed using the generic inverse variance method described in section 4.3.3.1.2.

The GDG was also interested in which population required more intensive monitoring. Clinical evidence was extracted from studies that listed multivariate analysis on factors associated with fibrosis progression. Following discussion it was felt most useful to present these grouped into factors from initial biopsy and at follow-up. These were presented in modified GRADE tables with quality assessments and forest plots. The GDG felt that the forest plots axis should be labelled so that the point estimate reflected those with the identified risk factor, rather than favouring those without, in order to ease understanding.

4.3.3.4. Data synthesis for diagnostic test accuracy reviews

For diagnostic test accuracy studies, a positive result on the index test was found if the patient had values of the measured quantity above or below a threshold value, and different thresholds could be used. Few of the diagnostic tests listed in the review protocols had widely acknowledged or commonly pre-specified thresholds, therefore results for all thresholds used were reported and the GDG agreed groups of threshold ranges to aid with presentation of results. Diagnostic test accuracy measures used in the analysis were: area under the receiver operating characteristics (ROC) curve (AUC), and, for different thresholds (if appropriate), sensitivity and specificity. The threshold of a diagnostic test is defined as the value at which the test can best differentiate between those with and without the target condition. In practice this varies amongst studies. If a test has a high sensitivity then very few people with the condition will be missed (few false negatives). For example, a test with a sensitivity of 97% will only miss 3% of people with the condition. Conversely, if a test has a high specificity then few people without the condition would be incorrectly diagnosed (few false positives). For example, a test with a specificity of 97% will only incorrectly diagnose 3% of people who do not have the condition as positive. For this guideline, sensitivity was considered more important than specificity. A high sensitivity (true positives) of a test can pick up the majority of the correct cases with NAFLD, NASH or fibrosis who may benefit from treatment (non-pharmacological or pharmacological) and ongoing monitoring; conversely, a high specificity (true negatives) can correctly exclude people without NAFLD, NASH or fibrosis who would not require management or monitoring. Coupled forest plots of sensitivity and specificity with their 95% CIs across studies (at various thresholds) were produced for each test, using RevMan5.1 In order to do this, 2×2 tables (the number of true positives, false positives, true negatives and false negatives) were directly taken from the study if given, or else were derived from raw data or calculated from the set of test accuracy statistics.

Diagnostic meta-analysis was conducted where appropriate, that is, when 3 or more studies were available per threshold. Test accuracy for the studies was pooled using the bivariate method for the direct estimation of summary sensitivity and specificity using a random-effects approach in WinBUGS software.2 See Appendix L for further details. The advantage of this approach is that it produces summary estimates of sensitivity and specificity that account for the correlation between the 2 statistics. Other advantages of this method have been described elsewhere. 155,199,200 The bivariate method uses logistic regression on the true positives, true negatives, false positives and false negatives reported in the studies. Overall sensitivity and specificity and confidence regions were plotted (using methods outlined by Novielli 2010131). Pooled sensitivity and specificity and their 95% CIs were reported in the clinical evidence summary tables. For scores with fewer than 3 studies, individual studies' sensitivity and the paired specificity were reported where possible. If an even number of studies were reported the results of the study with the lower sensitivity value of the 2 middle studies was reported.

Heterogeneity or inconsistency amongst studies was visually inspected in the coupled forest plots and pooled diagnostic meta-analysis plots.

Area under the ROC curve (AUC) data for each study were also plotted on a graph, for each diagnostic test. The AUC describes the overall diagnostic accuracy across the full range of thresholds. The following criteria were used for evaluating AUCs:

  • ≤0.50: worse than chance
  • 0.50–0.60: very poor
  • 0.61–0.70: poor
  • 0.71–0.80: moderate
  • 0.81–0.92: good
  • 0.91–1.00: excellent or perfect test.

Heterogeneity or inconsistency amongst studies was visually inspected.

4.3.4. Appraising the quality of evidence by outcomes

4.3.4.1. Interventional studies

The evidence for outcomes from the included RCTs and, where appropriate, observational studies were evaluated and presented using an adaptation of the ‘Grading of Recommendations Assessment, Development and Evaluation (GRADE) toolbox’ developed by the international GRADE working group (http://www.gradeworkinggroup.org/). The software (GRADEpro56) developed by the GRADE working group was used to assess the quality of each outcome, taking into account individual study quality and the meta-analysis results.

Each outcome was first examined for each of the quality elements listed and defined in Table 2.

Table 2. Description of quality elements in GRADE for intervention studies.

Table 2

Description of quality elements in GRADE for intervention studies.

Details of how the 4 main quality elements (risk of bias, indirectness, inconsistency and imprecision) were appraised for each outcome are given below. Publication or other bias was only taken into consideration in the quality assessment if it was apparent.

4.3.4.1.1. Risk of bias

The main domains of bias for RCTs are listed in Table 3. Each outcome had its risk of bias assessed within each study first. For each study, if there were no risks of bias in any domain, the risk of bias was given a rating of 0. If there was risk of bias in just 1 domain, the risk of bias was given a ‘serious’ rating of −1, but if there was risk of bias in 2 or more domains the risk of bias was given a ‘very serious’ rating of −2. A weighted average score was then calculated across all studies contributing to the outcome, by taking into account the weighting of studies according to study precision. For example if the most precise studies tended to each have a score of −1 for that outcome, the overall score for that outcome would tend towards −1.

Table 3. Principle domains of bias in randomised controlled trials.

Table 3

Principle domains of bias in randomised controlled trials.

4.3.4.1.2. Indirectness

Indirectness refers to the extent to which the populations, interventions, comparisons and outcome measures are dissimilar to those defined in the inclusion criteria for the reviews. Indirectness is important when these differences are expected to contribute to a difference in effect size, or may affect the balance of harms and benefits considered for an intervention. As for the risk of bias, each outcome had its indirectness assessed within each study first. For each study, if there were no sources of indirectness, indirectness was given a rating of 0. If there was indirectness in just 1 source (for example in terms of population), indirectness was given a ‘serious’ rating of −1, but if there was indirectness in 2 or more sources (for example, in terms of population and treatment) the indirectness was given a ‘very serious’ rating of −2. A weighted average score was then calculated across all studies contributing to the outcome by taking into account study precision. For example, if the most precise studies tended to have an indirectness score of −1 each for that outcome, the overall score for that outcome would tend towards −1.

4.3.4.1.3. Inconsistency

Inconsistency refers to an unexplained heterogeneity of results for an outcome across different studies. When estimates of the treatment effect across studies differ widely, this suggests true differences in the underlying treatment effect, which may be due to differences in populations, settings or doses. When heterogeneity existed within an outcome (chi-squared p<0.1, or I2>50%), but no plausible explanation could be found, the quality of evidence for that outcome was downgraded. Inconsistency for that outcome was given a ‘serious’ score of −1 if the I2 was 50–74%, and a ‘very serious’ score of −2 if the I2 was 75% or more.

If inconsistency could be explained based on pre-specified subgroup analysis (that is, each subgroup had an I2<50%), the GDG took this into account and considered whether to make separate recommendations on new outcomes based on the subgroups defined by the assumed explanatory factors. In such a situation the quality of evidence was not downgraded for those emergent outcomes.

Since the inconsistency score was based on the meta-analysis results, the score represented the whole outcome and so weighted averaging across studies was not necessary.

4.3.4.1.4. Imprecision

The criteria applied for imprecision were based on the 95% CIs for the pooled estimate of effect, and the minimal important differences (MID) for the outcome. The MIDs are the threshold for appreciable benefits and harms, separated by a zone either side of the line of no effect where there is assumed to be no clinically important effect. If either end of the 95% CI of the overall estimate of effect crossed 1 of the MID lines, imprecision was regarded as serious and a ‘serious’ score of −1 was given. This was because the overall result, as represented by the span of the confidence interval, was consistent with 2 interpretations as defined by the MID (for example, both no clinically important effect and clinical benefit were possible interpretations). If both MID lines were crossed by either or both ends of the 95% CI then imprecision was regarded as very serious and a ‘very serious’ score of −2 was given. This was because the overall result was consistent with all 3 interpretations defined by the MID (no clinically important effect, clinical benefit and clinical harm). This is illustrated in Figure 2. As for inconsistency, since the imprecision score was based on the meta-analysis results, the score represented the whole outcome and so weighted averaging across studies was not necessary.

Figure 2. Illustration of precise and imprecise outcomes based on the 95% CI of dichotomous outcomes in a forest plot (Note that all 3 results would be pooled estimates, and would not, in practice, be placed on the same forest plot).

Figure 2

Illustration of precise and imprecise outcomes based on the 95% CI of dichotomous outcomes in a forest plot (Note that all 3 results would be pooled estimates, and would not, in practice, be placed on the same forest plot).

The position of the MID lines is ideally determined by values reported in the literature. ‘Anchor-based’ methods aim to establish clinically meaningful changes in a continuous outcome variable by relating or ‘anchoring’ them to patient-centred measures of clinical effectiveness that could be regarded as gold standards with a high level of face validity. For example, a MID for an outcome could be defined by the minimum amount of change in that outcome necessary to make patients feel their quality of life had ‘significantly improved’. MIDs in the literature may also be based on expert clinician or consensus opinion concerning the minimum amount of change in a variable deemed to affect quality of life or health. For binary variables, any MIDs reported in the literature will inevitably be based on expert consensus, as such MIDs relate to all-or-nothing population effects rather than measurable effects on an individual, and so are not amenable to patient-centred ‘anchor’ methods.

In the absence of values identified in the literature, the alternative approach to deciding on MID levels is the ‘default’ method, as follows:

  • For categorical outcomes the MIDs were taken to be RRs of 0.75 and 1.25. For ‘positive’ outcomes such as ‘patient satisfaction’, the RR of 0.75 is taken as the line denoting the boundary between no clinically important effect and a clinically significant harm, whilst the RR of 1.25 is taken as the line denoting the boundary between no clinically important effect and a clinically significant benefit. For ‘negative’ outcomes such as ‘bleeding’, the opposite occurs, so the RR of 0.75 is taken as the line denoting the boundary between no clinically important effect and a clinically significant benefit, whilst the RR of 1.25 is taken as the line denoting the boundary between no clinically important effect and a clinically significant harm.
  • For continuous outcome variables the MID was taken as half the median baseline standard deviation of that variable, across all studies in the meta-analysis. Hence the MID denoting the minimum clinically significant benefit was positive for a ‘positive’ outcome (for example, a quality of life measure where a higher score denotes better health), and negative for a ‘negative’ outcome (for example, a visual analogue scale [VAS] pain score). Clinically significant harms will be the converse of these. If baseline values are unavailable, then half the median comparator group standard deviation of that variable will be taken as the MID.
  • If standardised mean differences have been used, then the MID will be set at the absolute value of +0.5. This follows because standardised mean differences are mean differences normalised to the pooled standard deviation of the 2 groups, and are thus effectively expressed in units of ‘numbers of standard deviations’. The 0.5 MID value in this context therefore indicates half a standard deviation, the same definition of MID as used for non-standardised mean differences.

The default MID value was subject to amendment after discussion with the GDG. If the GDG decided that the MID level should be altered, after consideration of absolute as well as relative effects, this was allowed, provided that any such decision was not influenced by any bias towards making stronger or weaker recommendations for specific outcomes.

For this guideline, no appropriate MIDs for continuous or dichotomous outcomes were found in the literature, and so the default method was adopted for imprecision and the clinical importance of each effect size was discussed with the GDG.

4.3.4.1.5. Overall grading of the quality of clinical evidence

Once an outcome had been appraised for the main quality elements, as above, an overall quality grade was calculated for that outcome. The scores (0, −1 or −2) from each of the main quality elements were summed to give a score that could be anything from 0 (the best possible) to −8 (the worst possible). However scores were capped at −3. This final score was then applied to the starting grade that had originally been applied to the outcome by default, based on study design. All RCTs started as High and the overall quality became Moderate, Low or Very Low if the overall score was −1, −2 or −3 points respectively. The significance of these overall ratings is explained in Table 4. The reasons for downgrading in each case were specified in the footnotes of the GRADE tables.

Table 4. Overall quality of outcome evidence in GRADE.

Table 4

Overall quality of outcome evidence in GRADE.

Observational interventional studies started at Low, and so a score of −1 would be enough to take the grade to the lowest level of Very Low. Observational studies could, however, be upgraded if there were all of: a large magnitude of effect, a dose-response gradient, and if all plausible confounding would reduce the demonstrated effect.

4.3.4.2. Prognostic reviews

The quality of evidence for prognostic studies was evaluated according to the criteria given in Table 5. If data were meta-analysed, the quality for pooled studies was presented. If the data were not pooled, then a quality rating was presented for each study.

Table 5. Description of quality elements for prospective studies.

Table 5

Description of quality elements for prospective studies.

4.3.4.2.1. Inconsistency

Inconsistency was assessed as for intervention studies.

4.3.4.2.2. Imprecision

In meta-analysed outcomes, or for non-pooled outcomes, the position of the 95% CIs in relation to the null line determined the existence of imprecision. If the 95% CI did not cross the null line then no serious imprecision was recorded. If the 95% CI crossed the null line then serious imprecision was recorded.

4.3.4.2.3. Overall grading

Quality rating started at high for prospective studies, and each major limitation brought the rating down by 1 increment to a minimum grade of Very Low, as explained for interventional reviews. For prognostic reviews prospective cohort studies with a multivariate analysis are regarded as the gold standard because RCTs are usually inappropriate for these types of review for ethical or pragmatic reasons. Furthermore, if the study is looking at more than 1 risk factor of interest then randomisation would be inappropriate as it can only be applied to 1 of the risk factors.

4.3.4.3. Diagnostic reviews

Risk of bias and indirectness of evidence for diagnostic data were evaluated by study using the Quality Assessment of Diagnostic Accuracy Studies version 2 (QUADAS-2) checklists (see Appendix H in the NICE guidelines manual 2014120). Risk of bias and applicability in primary diagnostic accuracy studies in QUADAS-2 consists of 4 domains (see Table 6: Summary of QUADAS-2 with list of signalling, risk of bias and applicability questions.

Table 6. Summary of QUADAS-2 with list of signalling, risk of bias and applicability questions.

Table 6

Summary of QUADAS-2 with list of signalling, risk of bias and applicability questions.

):

  • patient selection
  • index test
  • reference standard
  • flow and timing.
4.3.4.3.1. Inconsistency

Inconsistency refers to an unexplained heterogeneity of results for an outcome across different studies. Inconsistency was assessed by inspection of the sensitivity value (based on the primary measure) using the point estimates and 95% CIs of the individual studies on the forest plots. Particular attention was placed on values above or below 50% (diagnosis based on chance alone) and the threshold set by the GDG (the threshold above which would be acceptable to recommend a test) of 90%. The evidence was downgraded by 1 increment if the individual studies varied across 2 areas (50–90% and 90–100%) and by 2 increments if the individual studies varied across 3 areas (0–50%, 50–90% and 90–100%).

4.3.4.3.2. Imprecision

The judgement of precision was based on visual inspection of the confidence region around the summary sensitivity and specificity point from the diagnostic meta-analysis, if a diagnostic meta-analysis was conducted. Where a diagnostic meta-analysis was not conducted imprecision was assessed according to the range of point estimates or, if only 1 study contributed to the evidence, the confidence interval around the single study. As a rule of thumb (after discussion with the GDG) a variation of 0–20% was considered precise, 20–40% serious imprecisions, and >40% very serious imprecision. Imprecision was assessed on the primary outcome measure for decision-making (sensitivity).

4.3.4.3.3. Overall grading

Quality rating started at High for prospective and retrospective cross sectional studies, and each major limitation (risk of bias, indirectness, inconsistency and imprecision) brought the rating down by 1 increment to a minimum grade of Very Low, as explained for interventional studies.

4.3.5. Assessing clinical importance

The GDG assessed the evidence by outcome in order to determine if there was, or potentially was, a clinically important benefit, a clinically important harm or no clinically important difference between interventions. To facilitate this, binary outcomes were converted into absolute risk differences (ARDs) using GRADEpro56 software: the median control group risk across studies was used to calculate the ARD and its 95% CI from the pooled risk ratio.

The assessment of clinical benefit, harm, or no benefit or harm was based on the point estimate of absolute effect for intervention studies, which was standardised across the reviews. The GDG considered for most of the outcomes in the intervention reviews that if at least 100 more participants per 1000 (10%) achieved the outcome of interest in the intervention group compared to the comparison group for a positive outcome then this intervention would be considered beneficial. The same point estimate but in the opposite direction applied for a negative outcome. For the critical outcome of mortality any reduction represented a clinical benefit. For adverse events 50 events or more per 1000 represented clinical harm. For continuous outcomes if the mean difference was greater than the minimally important difference (MID) then this resented a clinical benefit or harm. For outcomes such as mortality any reduction or increase was considered to be clinically important.

This assessment was carried out by the GDG for each critical outcome, and an evidence summary table was produced to compile the GDG's assessments of clinical importance per outcome, alongside the evidence quality and the uncertainty in the effect estimate (imprecision).

4.3.6. Clinical evidence statements

Clinical evidence statements are summary statements that are included in each review chapter, and which summarise the key features of the clinical effectiveness evidence presented. The wording of the evidence statements reflects the certainty or uncertainty in the estimate of effect. The evidence statements are presented by outcome and encompass the following key features of the evidence:

  • The number of studies and the number of participants for a particular outcome.
  • An indication of the direction of clinical importance (if 1 treatment is beneficial or harmful compared to the other, or whether there is no difference between the 2 tested treatments).
  • A description of the overall quality of the evidence (GRADE overall quality).

4.4. Identifying and analysing evidence of cost-effectiveness

The GDG is required to make decisions based on the best available evidence of both clinical effectiveness and cost-effectiveness. Guideline recommendations should be based on the expected costs of the different options in relation to their expected health benefits (that is, their ‘cost-effectiveness’) rather than the total implementation cost.120 Thus, if the evidence suggests that a strategy provides significant health benefits at an acceptable cost per patient treated, it should be recommended even if it would be expensive to implement across the whole population.

Health economic evidence was sought relating to the key clinical issues being addressed in the guideline. Health economists:

  • Undertook a systematic review of the published economic literature.
  • Undertook new cost-effectiveness analysis in priority areas.

4.4.1. Literature review

The health economists:

  • Identified potentially relevant studies for each review question from the health economic search results by reviewing titles and abstracts. Full papers were then obtained.
  • Reviewed full papers against pre-specified inclusion and exclusion criteria to identify relevant studies (see below for details).
  • Critically appraised relevant studies using economic evaluations checklists as specified in the NICE guidelines manual.120,123
  • Extracted key information about the studies' methods and results into economic evidence tables (included in Appendix I).
  • Generated summaries of the evidence in NICE economic evidence profile tables (included in the relevant chapter for each review question) – see below for details.

4.4.1.1. Inclusion and exclusion criteria

Full economic evaluations (studies comparing costs and health consequences of alternative courses of action: cost-utility, cost-effectiveness, cost-benefit and cost-consequences analyses) and comparative costing studies that addressed the review question in the relevant population were considered potentially includable as economic evidence.

Studies that only reported cost per hospital (not per patient), or only reported average cost-effectiveness without disaggregated costs and effects were excluded. Literature reviews, abstracts, posters, letters, editorials, comment articles, unpublished studies and studies not in English were excluded. Studies published before 1999 and studies from non-OECD countries or the USA were also excluded, on the basis that the applicability of such studies to the present UK NHS context is likely to be too low for them to be helpful for decision-making.

Remaining health economic studies were prioritised for inclusion based on their relative applicability to the development of this guideline and the study limitations. For example, if a high quality, directly applicable UK analysis was available, then other less relevant studies may not have been included. However, in this guideline, no economic studies were excluded on the basis that more applicable evidence was available.

For more details about the assessment of applicability and methodological quality see Table 7 below and the economic evaluation checklist (Appendix G of the 2012 NICE guidelines manual123) and the health economics review protocol in Appendix D.

Table 7. Content of NICE economic evidence profile.

Table 7

Content of NICE economic evidence profile.

When no relevant health economic studies were found from the economic literature review, relevant UK NHS unit costs related to the compared interventions were presented to the GDG to inform the possible economic implications of the recommendations.

4.4.1.2. NICE economic evidence profiles

NICE economic evidence profile tables were used to summarise cost and cost-effectiveness estimates for the included health economic studies in each review chapter. The economic evidence profile shows an assessment of applicability and methodological quality for each economic study, with footnotes indicating the reasons for the assessment. These assessments were made by the health economist using the economic evaluation checklist from the NICE guidelines manual.123 It also shows the incremental costs, incremental effects (for example, quality-adjusted life years [QALYs]) and incremental cost-effectiveness ratio (ICER) for the base case analysis in the study, as well as information about the assessment of uncertainty in the analysis. See Table 7 for more details.

When a non-UK study was included in the profile, the results were converted into pounds sterling using the appropriate purchasing power parity.132

4.4.2. Undertaking new health economic analysis

As well as reviewing the published health economic literature for each review question, as described above, new health economic analysis was undertaken by the health economists in selected areas. Priority areas for new analysis were agreed by the GDG after formation of the review questions and consideration of the existing health economic evidence.

The GDG identified the highest priority areas for original health economic modelling as:

  • risk factors for NAFLD or severe NAFLD
  • the appropriate investigations for diagnosing NAFLD
  • the appropriate investigations for identifying the stage of NAFLD
  • how often people with NAFLD or NASH should be monitored.

This was due to the number of people affected by these questions and the current uncertainty as to what the most cost-effective solutions would be, due to the lack of published economic models encompassing the whole pathway of liver disease from early NAFLD to end-stage liver disease. New work was therefore conducted, which entailed the development of the NGC liver disease pathway model (LDPM) to address all of the questions prioritised for this guideline (as well as to address additional questions raised in the NICE cirrhosis guideline).

The following general principles were adhered to in developing the cost-effectiveness analysis:

  • Methods were consistent with the NICE reference case for interventions with health outcomes in NHS settings.120,124
  • The GDG was involved in the design of the model, selection of inputs and interpretation of the results.
  • Model inputs were based on the systematic review of the clinical literature supplemented with other published data sources where possible.
  • When published data were not available GDG expert opinion was used to populate the model.
  • Model inputs and assumptions were reported fully and transparently.
  • The results were subject to sensitivity analysis and limitations were discussed.
  • The model was peer-reviewed by another health economist at the NGC.

Full methods for the cost-effectiveness analysis are described in Appendix N.

4.4.3. Cost-effectiveness criteria

NICE's report ‘Social value judgements: principles for the development of NICE guidance’ sets out the principles that GDGs should consider when judging whether an intervention offers good value for money.122 In general, an intervention was considered to be cost-effective (given that the estimate was considered plausible) if either of the following criteria applied:

  • the intervention dominated other relevant strategies (that is, it was both less costly in terms of resource use and more clinically effective compared with all the other relevant alternative strategies), or
  • the intervention cost less than £20,000 per QALY gained compared with the next best strategy.

If the GDG recommended an intervention that was estimated to cost more than £20,000 per QALY gained, or did not recommend one that was estimated to cost less than £20,000 per QALY gained, the reasons for this decision are discussed explicitly in the ‘Recommendations and link to evidence’ section of the relevant chapter, with reference to issues regarding the plausibility of the estimate or to the factors set out in ‘Social value judgements: principles for the development of NICE guidance’.122

When QALYs or life years gained are not used in the analysis, results are difficult to interpret unless one strategy dominates the others with respect to every relevant health outcome and cost.

4.4.4. In the absence of economic evidence

When no relevant published health economic studies were found, and a new analysis was not prioritised, the GDG made a qualitative judgement about cost-effectiveness by considering expected differences in resource use between options and relevant UK NHS unit costs, alongside the results of the review of clinical effectiveness evidence.

The UK NHS costs reported in the guideline are those that were presented to the GDG and were correct at the time recommendations were drafted. They may have changed subsequently before the time of publication. However, we have no reason to believe they have changed substantially.

4.5. Developing recommendations

Over the course of the guideline development process, the GDG was presented with:

  • Evidence tables of the clinical and economic evidence reviewed from the literature. All evidence tables are in Appendices H and I.
  • Summaries of clinical and economic evidence and quality (as presented in Chapters 517).
  • Forest plots and diagnostic meta-analysis plots (Appendix K).
  • A description of the methods and results of the cost-effectiveness analysis undertaken for the guideline (Appendix N).

Recommendations were drafted on the basis of the GDG's interpretation of the available evidence, taking into account the balance of benefits, harms and costs between different courses of action. This was either done formally in an economic model, or informally. Firstly, the net clinical benefit over harm (clinical effectiveness) was considered, focusing on the critical outcomes. When this was done informally, the GDG took into account the clinical benefits and harms when 1 intervention was compared with another. The assessment of net clinical benefit was moderated by the importance placed on the outcomes (the GDG's values and preferences), and the confidence the GDG had in the evidence (evidence quality). Secondly, the GDG assessed whether the net clinical benefit justified any differences in costs between the alternative interventions.

When clinical and economic evidence was of poor quality, conflicting or absent, the GDG drafted recommendations based on its expert opinion. The considerations for making consensus-based recommendations include the balance between potential harms and benefits, the economic costs compared to the economic benefits, current practices, recommendations made in other relevant guidelines, patient preferences and equality issues. The consensus recommendations were agreed through discussions in the GDG. The GDG also considered whether the uncertainty was sufficient to justify delaying making a recommendation to await further research, taking into account the potential harm of failing to make a clear recommendation (see Section 4.5.1 below).

The GDG considered the appropriate ‘strength’ of each recommendation. This takes into account the quality of the evidence but is conceptually different. Some recommendations are ‘strong’ in that the GDG believes that the vast majority of healthcare and other professionals and patients would choose a particular intervention if they considered the evidence in the same way that the GDG has. This is generally the case if the benefits clearly outweigh the harms for most people and the intervention is likely to be cost-effective. However, there is often a closer balance between benefits and harms, and some patients would not choose an intervention whereas others would. This may happen, for example, if some patients are particularly averse to some side effect and others are not. In these circumstances the recommendation is generally weaker, although it may be possible to make stronger recommendations about specific groups of patients.

The GDG focused on the following factors in agreeing the wording of the recommendations:

  • The actions health professionals need to take.
  • The information readers need to know.
  • The strength of the recommendation (for example the word ‘offer’ was used for strong recommendations and ‘consider’ for weak recommendations).
  • The involvement of patients (and their carers if needed) in decisions on treatment and care.
  • Consistency with NICE's standard advice on recommendations about drugs, waiting times and ineffective interventions (see Section 9.2 in the 2014 NICE guidelines manual120).

The main considerations specific to each recommendation are outlined in the ‘Recommendations and link to evidence’ sections within each chapter.

4.5.1. Research recommendations

When areas were identified for which good evidence was lacking, the GDG considered making recommendations for future research. Decisions about inclusion were based on factors such as:

  • the importance to patients or the population
  • national priorities
  • potential impact on the NHS and future NICE guidance
  • ethical and technical feasibility.

4.5.2. Validation process

This guidance is subject to a 6-week public consultation and feedback as part of the quality assurance and peer review of the document. All comments received from registered stakeholders are responded to in turn and posted on the NICE website.

4.5.3. Updating the guideline

Following publication, and in accordance with the NICE guidelines manual, NICE will undertake a review of whether the evidence base has progressed significantly to alter the guideline recommendations and warrant an update.

4.5.4. Disclaimer

Healthcare providers need to use clinical judgement, knowledge and expertise when deciding whether it is appropriate to apply guidelines. The recommendations cited here are a guide and may not be appropriate for use in all situations. The decision to adopt any of the recommendations cited here must be made by practitioners in light of individual patient circumstances, the wishes of the patient, clinical expertise and resources.

The National Guideline Centre disclaims any responsibility for damages arising out of the use or non-use of this guideline and the literature used in support of this guideline.

4.5.5. Funding

The National Guideline Centre was commissioned by the National Institute for Health and Care Excellence to undertake the work on this guideline.

Copyright © National Institute for Health and Care Excellence 2016.
Bookshelf ID: NBK384743

Views

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...