Results

Suzanne West; Valerie King; Timothy S Carey; Kathleen N Lohr; Nikki McKoy; Sonya F Sutton; Linda Lux

This publication is provided for historical reference only and the information may be out of date.

3Results

This chapter documents the results of this study in several parts. We first discuss the outcome of our data collection efforts (chiefly the two literature searches, one for rating study quality and the second for grading the strength of a body of evidence). We then provide our findings for rating study quality overall and by study type (i.e., systematic reviews, randomized controlled trials [RCTs], observational studies, and diagnostic studies). Last, we provide our findings on grading the strength of a body of evidence. Detailed tabular information is derived from the full assessments of all types of studies provided in Grids 1-4 (Appendix B) and Grid 5 (Appendix C); labels of domains of interest in developing the tables in this chapter are in some cases abbreviated versions of the domains defined in Tables 7-11 in Chapter 2 (e.g., funding or sponsorship is denoted funding).

For both study quality and strength of evidence, we identify selected systems that appear to cover domains we regard as particularly important. These systems might be regarded as ones that could be used today with confidence that they represent the current state of the art of assessing study quality or strength of evidence. Chapter 4, Discussion, examines the implications of these findings in more detail and gives our recommendations for research priorities concerned with systems for rating the scientific evidence for evidence reviews and technology assessments.

Data Collection Efforts

Rating Study Quality

Our first task was to identify instruments ("systems" in the original legislation mandating this report for the Agency for Healthcare Research and Quality [AHRQ]) for rating study quality. During our search process, we identified scales, checklists, and evaluations of quality components. In addition, we identified publications that discussed the importance of assessing article quality and that included quality items for consideration; we refer to these publications as guidance documents. To be complete, we include the guidance documents in Grids 1-4 (Appendix B), but in their current state we do not believe such documents can or should be used to rate the quality of individual studies.

Overall, we reviewed 82 different quality rating instruments or guidance documents for all four grids. This number encompasses reference papers that describe a study quality rating scheme or a rating method that is specific to work from an AHRQ-supported Evidence-based Practice Center (EPC). Because several of these 82 systems could be used to rate quality for more than one study design, we included them on multiple grids. Some came from our literature search, but we identified most by reviewing the previous effort of the Research Triangle Institute-University of North Carolina EPC¹ and work from Moher et al.¹⁰¹ and by hand searching Internet sites and bibliographies.

As shown in Table 12, we assessed 20 systems for Grid 1 (systematic reviews), 49 systems for Grid 2 (RCTs), 19 for Grid 3 (observational studies), and 18 for Grid 4 (diagnostic studies). These systems can be characterized by instrument type as scales, checklists, or component evaluations; guidance documents; and EPC quality rating systems.

Table

Table 12. Number of Systems Reviewed for Four Types of Studies by Type of System, Instruments, or Document.

Grading the Strength of a Body of Evidence

We found it difficult to discern the most productive, yet specific, search terms for identifying literature that discussed grading a body of evidence. We approached our search from many different perspectives. In the end, although we identified numerous papers through the search, we found the majority of the relevant publications through hand searches and contacts with experts in the field. We suspect that, at present, the subject headings for coding the literature on this topic are not adequate to yield an appropriately thorough and productive search.

Thus, many of the 40 systems on which we provide information in Grid 5 (Appendix C) were identified through other sources or by reviewing bibliographies from papers retrieved by the search. Excluding the six evidence grading systems developed by the EPCs, approximately two-thirds (n = 22) of the remaining 34 systems arose from the guideline or clinical recommendations literature. Thus, only 12 of the evidence grading systems we reviewed were developed for nonguideline needs such as a literature synthesis or for purposes of evidence-based practice in general.

Findings for Systems to Rate The Quality of Individual Studies

Background

Chapter 2 describes the four study quality grids in Appendix B, including both the domains and elements used to compare rating systems (see Tables 7-10) and the properties used to describe them.

Evaluation According to Domains and Elements

The first part of each grid provides our assessment of the extent to which each system covered the relevant domains; we used a simple categorical scheme for this assessment:

"Yes" (, the system fully addressed the domain);
"No" (, it did not address the domain at all); or
"Partial" (, it addressed the domain to some extent).

In defining domains, we differentiated between "empirical" elements and "good (or best) practice" elements. The former have been shown to affect the conduct and/or analysis of a study based on the results of rigorously designed methodological research. The latter elements have been identified as critical for the design of a well-conducted study but have not been tested in real life. As noted in Chapter 2 (and Appendix D), few empirical studies have been conducted; as a result, we have specified few empirical elements. Results of our analysis of each system appear below.

Description According to Key Characteristics

The second, descriptive part of each grid (see Table 6) provides general information on each rating system (e.g., type of system; whether inter-rater reliability had been assessed; how rigorously the system was developed). Although we focused on generic instruments, we did identify 18 "topic-specific" systems or instruments, especially among the EPC rating systems, and we also differentiate among the systems based on whether it is a scale, checklist, evaluation component only, or a guidance document.

Item Selection

In terms of approaches used by system developers to select the specific items or questions in their quality rating instruments, we found it difficult to determine whether they had chosen items on the basis of empirical research (theirs or others') or simply good practices (accepted) criteria. We based our categorization on whether the authors of the rating system referenced any empirical studies. One system included only empirical items;³⁴ another was a component evaluation of two empirical elements for RCTs (randomization and allocation concealment).⁵¹ Remaining systems were based on accepted criteria, a mixture of accepted and empirical criteria, or modifications of another system.

Rigorous Development

As described in Chapter 1, a quality rating instrument could be developed in several steps, one of which is to measure inter-rater reliability. However, inter-rater reliability is only one facet of the instrument development process; by itself, it does not make an instrument "rigorously developed." We gave a system a Yes rating for rigorous development process if the authors indicated that they used "typical instrument development techniques," regardless of our rating for inter-rater reliability. Developmental rigor was typically a No for guidance documents, but we did give a Partial to some guidance documents because their quality criteria had been developed through formal expert consensus.

Inter-rater Reliability

Inter-rater reliability had been assessed in only 39 percent of the scales and checklists we reviewed, including those from the EPCs. We gave five systems (8 percent) a Partial rating for inter-rater reliability because the developers evaluated agreement among their raters but did not present the actual statistics. Inter-rater reliability was not relevant for guidance documents (always a No).

Quality Definition and Scoring

The last two descriptive items for quality rating systems -- whether quality was defined or described and whether instructions were provided for use -- had been included on an earlier summary of quality rating systems prepared by Moher and colleagues.¹⁰⁷ Of the 82 systems we evaluated, 53 (65 percent) discussed their definition of quality to some extent (Yes or Partial for the category). Most of the systems did provide information on how to score each of the quality items; 64 systems (78 percent) were given either a Yes or Partial for instructions.

Rating Systems for Systematic Reviews

Type of System or Instrument

Twenty systems were concerned with systematic reviews or meta-analyses (Grid 1). Of these (Table 13), we categorized one as a scale³ and 10 as checklists.^4-14 The remainder are considered guidance documents.^15-23,59,68 In the presentation below, we group scales and checklists into one set of results and comment on guidance documents separately.

Table

Table 13. Evaluation of Scales and Checklists for Systematic Reviews, by Specific Instrument and 11 Domains.

Evaluation of Systems According to Coverage of Empirical or Best Practices Domains

Empirical Domains

The 11 domains used for assessing these systems (Table 13 or Grid 1A) reflect characteristics specific to both systematic reviews and general study design (see Table 7). Of these domains, four contain elements that are derived from empirical research: search strategy, study quality, data synthesis, and funding or sponsorship. Funding had only a single element (and it had an empirical basis). The study quality and data synthesis domains each comprised two or more elements (but only one element was empirically derived). Search strategy had four elements (of which two were empirical -- comprehensive search strategy and justification of search restrictions). We give particular attention in the results below to the extent to which the systems we reviewed covered these empirical domains.

The one scale addressed all four domains with empiric elements (with a Partial grade for search strategy).³ Of the 10 checklists, that by Sacks and colleagues fully addressed all four domains with empirical elements.⁷ The checklist developed by Auperin and colleagues addressed three of the four empirically derived domains fully; the Partial score was for the study quality domain.⁸

All of the remaining eight systems excluded funding.^4-6,9-14 Five systems fully addressed three of the four empirically derived domains, omitting only funding.^4-6,11,12,14 The remaining three systems either did not address one or more empirically derived domains^9,13 or did so only partially.¹⁰

Best Practices Domains

The remaining seven domains -- study question, inclusion and exclusion criteria, interventions, outcomes, data extraction, results, and discussion -- come from best practices criteria. We included these for comparison purposes, mainly because many of the systems we evaluated included items addressing these domains.

The scale by Barnes and Bero fully addressed study question and inclusion/exclusion criteria but did not deal with or only partially addressed interventions, outcomes, data extraction, results, and discussion.³ Of the 10 checklists, only one fully addressed all these good practices domains,¹² and two others addressed these domains to some degree.^7,8 The remaining seven systems entirely omitted one or more of these seven domains.^{4-6,9-11,13,14}

Every system addressed the inclusion/exclusion criteria at least partially. Most of these systems did cover study question and results, but the other domains excluded varied by system. One checklist did not address results in any way.¹⁰ Four systems did not include intervention at all;^4,5,9,11,13 four did not include outcomes;^3,9-11 and five did not include data extraction.^{3,10,11,13,14 The discussion domain was absent from four
systems4-6,14} and rated as Partial for five others.^3,7,8,10,13

Because guidance documents have not been developed as tools for assessing quality per se, we did not contrast them with the scales and checklists and included them for illustrative purposes primarily. Like the scales and checklists, the results varied for the guidance documents. The two consensus statements that provide reporting guidelines include nearly all of the 11 domains. MOOSE included all 11 but received a Partial for the intervention domain.²³ The QUOROM statement did not include funding.²¹

Evaluation of Systems According to Descriptive Attributes

According to the descriptive information available in Grid 1B, none of the scales and checklists underwent rigorous development as defined earlier. We gave two checklists a score of Partial for this attribute,^11,14 mainly because the quality domains were selected by consensus. Four systems provided inter-rater reliability estimates that suggest that the quality ratings from multiple reviewers are consistent.^3-5,8,9 Interestingly enough, none of the systems that measured inter-rater reliability estimates had been rigorously developed.

Evaluation of Systems According to Seven Domains Considered Informative for Study Quality

Apart from the four domains that contained empirical elements, we concluded that three additional domains provide important information on the quality of a systematic review -- study question, inclusion/exclusion criteria, and data extraction. The degree to which instruments concerned with systematic reviews covered these three domains is described just below, followed by a discussion of systems that appeared to deal with all seven domains.

Study Question

A clearly specified study question is important to define the search appropriately, determine which articles to exclude from the analysis, focus the interventions and outcomes, and conduct a meaningful data synthesis. Only two of the 20 systems omitted study question as a domain,^17,22 and an additional two received a Partial score for this domain.^8,10

Inclusion/Exclusion

After the search is completed, determination of article eligibility is based on clearly specified selection criteria with reasons for inclusion and exclusion. Developing and adhering to strict inclusion and exclusion criteria makes the systematic review more reproducible and less subject to selection bias. Of the 20 systems we reviewed, every one addressed the inclusion/exclusion domain, with only three systems receiving a Partial for this domain.^4,5,14,15

Data Extraction

How data had been extracted from single articles for purposes of systematic reviews is often overlooked in assessing the quality of a systematic review. Like the search strategy domain, the data extraction domain provides useful insight on the reproducibility of the systematic review. Reviews that do not use dual extraction may miss or misrepresent important concepts. Of the 20 systems we reviewed, six omitted data extraction altogether^{3,10,11,13,14,22} and three were given a Partial score for this domain.^4,5,15,19

Coverage of Seven Key Domains

To arrive at a set of high-performing scales or checklists pertaining to systematic reviews, we took account of seven domains in all: study question, search strategy, inclusion/exclusion criteria, data extraction, study quality, data synthesis, and funding. We then used these seven domains as the criteria by which to identify a selected group of systems that could be said with some confidence to represent acceptable approaches that could be used today without major modifications. These are depicted in Table 14.

Table

Table 14. Evaluation of Scales and Checklists for Systematic Reviews by Instrument and Seven Key Domains.

Five systems met most of the criteria for systematic reviews. One checklist fully addressed all seven domains.⁷ A second checklist also addressed all seven domains but merited only a Partial for study question and study quality.⁸ Two additional checklists^6,12 and the one scale³ addressed six of the domains. These latter two checklists excluded funding; the scale omitted data extraction and had a Partial score for search strategy.

Rating Systems for Randomized Controlled Trials

Type of System or Instrument

In evaluating systems concerned with RCTs, we reviewed 20 scales,^18,24,42 11 checklists,^12-14,43-50 one component evaluation,⁵¹ and seven guidance documents.^1,11,52-57 In addition, we reviewed 10 EPC rating systems.^58-68 In the presentation below, we group scales, checklists, and the component system into a single set of results. We comment on guidance documents and EPC rating systems separately.

Our literature search focused on articles that described quality rating systems from 1995 until June 2000. Earlier work in this field had identified many scales and checklists for evaluating RCTs,^1,107 so duplicating prior work was not efficient. We did review and include many systems that we identified through the bibliographies of the more recent articles on RCT quality rating systems.

Evaluation of Systems According to Coverage of Empirical or Best Practices Domains

Empirical Domains

The 10 domains used for assessing these systems (Table 15 or Grid 2A) reflect characteristics specific to both RCTs and general study design (see Table 8). Of these domains, four contain elements that are derived from empirical research: randomization, blinding, statistical analysis, and funding or sponsorship. Both blinding and funding had only a single element (which was based on empirical research). The randomization domain comprised three elements, all of which were empirically derived. Statistical analysis had four elements, only one of which was empirically derived. In the results below, we focus on the extent to which the systems we reviewed covered these empirical domains.

Table

Table 15. Evaluation of Scales, Checklists, and Component Evaluations for Randomized Controlled Trials, by Specific Instrument and 10 Domains.

Of the 32 scales, checklists, and component systems concerned with RCTs (Grid 2), only two fully addressed the four domains with empiric elements.^25,45 An additional 12 systems fully addressed randomization, blinding, and statistical analysis but not source of funding.^{12,14,18,26,36,38-42,49,51} If we consider the systems that addressed the first three domains (randomization, blinding, statistical analysis) either partially or fully, we would add another 14 to this count.^{13,25,27,28,29,31-35,37,43,44,47,48} Thus, only four of the RCT scales or checklists failed to address one or more of the three empirical domains, randomization, blinding, or statistical analysis.^29,30,46,50

Best Practices Domains

The remaining six domains -- study question, study population, interventions, outcomes, results, and discussion -- come from best practices criteria. We included these for comparison purposes and because many of the systems we evaluated included items addressing these domains.

Focusing on the 14 scales, checklists, and component evaluation (Table 15) that fully addressed the three empiric domains -- randomization, blinding, and statistical analysis -- few systems included either study question or discussion.^14,38,40,45 However, 11 systems did address three other domains -- study population, intervention, and results -- either partially or fully.^{12,14,18,24,26,36,38-40,42,45} Of these 11 systems, 10 also included outcomes as a domain; the exception is the work of the NHS Centre for Reviews and Dissemination.¹² Thus, these 11 systems included, either fully or in part, most of the domains that we selected to compare across systems.

Because guidance documents have not been developed as tools for assessing quality per se, we have examined them primarly for illustrative purposes (Table 16). The number of domains addressed in the guidance documents varied by system -- from as few as three to all 10 of the domains. The consensus statements typically include most of the 10 domains.^55-57 The earliest consensus statement fully addressed seven domains, partially addressed one other, and failed to address two domains.⁵⁵ The Asimolar Working Group included all 10 domains but received a Partial for the randomization, blinding, and statistical analysis domains.⁵⁶ The most recent CONSORT statement fully addressed nine domains, omitting funding.⁵⁷

Table

Table 16. Evaluation of Guidance Documents for Randomized Controlled Trials, by Instrument and 10 Domains.

Of the 10 EPC rating systems (see Grid 2A in Appendix B), all included both randomization and blinding at least partially. Statistical analysis was addressed either fully or partially by all but one system.⁶³ Study population, interventions, outcomes, and results were covered fully by five EPC systems.^{60,61,63,65,66} EPC quality systems for RCTs rarely included either study question or discussion.

Evaluation of Systems According to Descriptive Attributes

The RCT system attributes are compared in Grid 2B (Appendix B). Most systems provided their definition of quality and selected their quality domains based on best practices criteria. Several used both best practices and empirical criteria for the selection process. Eight non-EPC scales and checklists were modifications of other systems.^{26,27,31,33,35,37,41,44}

According to their authors, five scales underwent rigorous scale development along with the calculation of inter-rater reliabilities;^{34,35,37,38,40} the one component system was both rigorously developed and measured inter-rater reliability.⁵¹ Several scales and checklists were given a Partial score for their development process;^{14,27,30-32,48} three of these also reported inter-rater reliability.^30,32

Evaluation of Systems According to Seven Domains Considered Informative for Study Quality

As noted above, we identified four empirically based quality domains. To these we added three domains derived from best practices -- study population, interventions, and outcomes -- that we regarded as important for evaluating the quality of RCTs.

Study Population

The most important element in the study population domain is the specification of inclusion and exclusion criteria for entry of participants in the trial. Although such criteria constrain the population being studied (thereby making the study less generalizable), they reduce heterogeneity among the persons being studied. In addition, the criteria reduce variability, which improves our certainty of claiming a treatment effect if one truly exists.

Interventions

Intervention is another important quality domain mainly for one of its elements -- that the intervention be clearly defined. For reasons of reproducibility both within the study and for comparison with other studies, investigators ought to describe fully the intervention under study with respect to dose, timing, administration, or other factors. Paying careful attention to the details of an intervention also tends to reduce variability among the subjects, which also influences what can be said about the study outcome.

Outcomes

As important as it is to describe the intervention clearly, it is also critical to specify clearly the outcomes under study and how they are to be measured. Again, this is important for both reproducibility and to decrease variability.

Coverage of Seven Key Domains

We designated a set of high-performing scales or checklists pertaining to RCTs by assessing their coverage of the following seven domains: study population, randomization, blinding, interventions, outcomes, statistical analysis, and funding. As with the five systems identified for systematic reviews, we concluded that these eight systems for RCTs represent acceptable approaches that could be used today without major modifications (Table 17).

Table

Table 17. Evaluation of Scales and Checklists for Randomized Controlled Trials, by Instrument and Seven Key Domains.

Two systems fully addressed all seven domains,^24,45 and six others addressed all but funding.^{14,18,26,36,38,40} Two were rigorously developed.^38,40 We might assume that the rigorousness with which the instruments were developed is important for assessing quality, but this has not been tested. Users wishing to adopt a system for rating the quality of RCTs will need to do so on the basis of the topic under study, whether a scale or checklist is desired, and apparent ease of use.

Rating Systems for Observational Studies

Type of System or Instrument

Seventeen systems concerned observational studies (Grid 3). Of these, we categorized four as scales^31,32,40,69 and eight as checklists (Table 18)^{12-14,45,47,49,50,70} We classified the remaining five as guidance documents.^1,71-74 Two EPCs used quality rating systems for evaluating observational studies -- these systems were identical to those used for RCTs. In the presentation below, we discuss scales and checklists separately from guidance documents and EPC rating systems.

Table

Table 18. Evaluation of Scales and Checklists for Observational Studies, by Specific Instrument and Nine Domains.

Evaluation of Systems According to Coverage of Empirical or Best Practices Domains

Empirical Domains

The nine domains used for assessing these systems (Grid 3) reflect general study design issues common to observational studies (see Table 9). Of these domains, two have empirical elements: comparability of subjects and funding or sponsorship. Because the funding domain had only one element, it was required to give that domain a full Yes. We did not require that systems address the empirical element, use of concurrent controls, to receive a full Yes grade for the comparability-of-subjects domain. With the exception of one checklist that received a Partial score,⁷⁰ all scales and checklists received a full Yes rating for the comparability-of-subjects domain. Only one checklist received a full Yes for the funding domain.⁴⁵

Best Practices Domains

The remaining seven domains -- study question, study population, exposure/intervention, outcomes, statistical analysis, results, and discussion -- come from best practices criteria. These domains are typically evaluated when critiquing an observational study. Of the 12 scales and checklists in Table 18, half fully addressed study question;^{14,31,32,40,45,70} the remainder did not address this domain at all.^{12,13,47,49,50,69} Similarly, for the discussion domain, we gave Yes or Partial ratings to only seven instruments.^{13,31,32,40,45,47,50} Many systems covered results as a study quality domain, either fully or in part.^{13,14,31,32,40,45,49,50,70} We rated the study population, exposure/intervention, outcome measure, and statistical analysis domains as Yes or Partial on most of the scales and checklists we reviewed.

Of the 12 scales and checklists, three fully addressed all these best practices domains.^32,40,45 Five others addressed most of the seven domains to some degree: One omitted exposure/intervention,³¹ two did not include study question,^13,50 and the remaining two missed the discussion domain.^14,70 The remaining four systems entirely omitted two or more of the seven domains.^12,47,49,60

Guidance Documents and EPC Systems

Guidance documents pertinent to observational studies (Grid 3) were not developed as tools for assessing quality, but all of them included comparability of subjects and outcomes either partially or fully. Most also included study population, statistical analysis, and results. The two EPC rating systems for observational studies are the same as those used for RCTs but with minor modifications; they were evaluated using the observational quality domains. One EPC system fully covered seven of the nine domains;⁶⁰ it omitted study question and funding. The other EPC system covered four domains -- fully addressing comparability of subjects and outcomes but only partially addressing statistical analysis and results.⁶⁴

Evaluation of Systems According to Descriptive Attributes

Of the 12 scales or checklists relating to observational studies, six selected their quality items based on accepted criteria;^{12,45,47,50,69,70} five systems used both accepted and empirical criteria for item selection;^{13,14,32,40,49} and one scale was a modification of another system.³¹ One system was rigorously developed and provided an estimate of inter-rater reliability.⁴⁰ Three others received a Partial score for rigorousness of development but reported inter-rater reliability as well.^31,32,70

Evaluation of Systems According to Domains Considered Informative for Study Quality

To arrive at a set of high-performing scales or checklists pertaining to observational studies, we considered the following five domains: comparability of subjects, exposure/intervention, outcomes, statistical analysis, and funding or sponsorship. As before, we concluded that systems that cover these domains represent acceptable approaches for assessing the quality of observational studies. The inclusion of the two empirical domains is self-explanatory (comparability of subjects and funding or sponsorship); we explain below why we considered the following as critical domains.

Exposure or Intervention

Unlike RCTs where treatment is administered in a controlled fashion, exposure or treatment in observational studies is based on the clinical situation and may be subject to unknown biases. These biases may result from provider, patient, or health care system differences. Thus, a clear description of how the exposure definition was derived is critical for understanding the effects of that exposure on outcome.

Outcomes

Investigators need to supply a specific definition of outcome that is independent of exposure. The presence or absence of an outcome should be based on standardized criteria to reduce bias and enhance reproducibility.

Statistics and Analysis

Of the six elements in the statistical analysis domain, confounding assessment was considered essential for a full Yes rating. Observational studies are particularly subject to several biases; these include measurement bias (usually addressed by specific exposure and outcome definitions) and selection bias (typically addressed by ensuring the comparability among subjects and confounding assessment). We did not consider any of the remaining five statistical analysis elements -- statistical tests, multiple comparisons, multivariate techniques, power calculations, and dose response assessments -- as more important than any other when evaluating systems on this domain.

Coverage of Five Key Domains

Of the 12 scales and checklists we reviewed, all included comparability of subjects either fully or in part. Only one included funding or sponsorship and the other four domains we considered critical for observational studies.⁴⁵ Five additional systems fully included all four domains without funding or sponsorship (Table 19).^{14,32,40,47,50} In choosing among these six systems for assessing study quality, one will have to evaluate which system is most appropriate for the task being undertaken, how long it takes to complete each system, and its ease of use. We were unable to evaluate these three instrument properties in the project.

Table

Table 19. Evaluation of Scales and Checklists for Observational Studies, by Instrument and Five Key Domains.

Rating Systems for Diagnostic Studies

Type of System or Instrument

As discussed in Chapter 2, the domains that we used to compare systems for assessing the quality of diagnostic test studies are to be used in conjunction with those relevant for judging the quality of RCTs or observational studies. Thus, here we contrast systems on the basis of five domains -- study population, adequate description of the test, appropriate reference standard, blinded comparison of test and reference, and avoidance of verification bias. We identified 15 systems for assessing the quality of diagnostic studies. Seven are checklists (Grid 4);^{12,14,49,75-78,111} of these, one is a test-specific instrument.¹¹¹ The remainder are guidance documents. In addition, three EPCs used systems to evaluate the quality of the diagnostic studies.^59,68,79,80 In the discussion below, we comment on the checklists separately from the guidance documents and EPC scales.

Evaluation of Systems According to Coverage of Empirical or Best Practices Domains

Empirical Domains

The five domains used for assessing these systems (Table 10 and Grid 4) reflect design issues specific to evaluating diagnostic tests. Three domains -- study population, adequate description of the test, and avoidance of verification bias -- have only a single, empirical element; the other two domains each contain two elements, one of which has an empirical base.

Of the generic checklists we reviewed (Table 20), three fully addressed all six domains.^49,77,78 Two systems dealt with four of the five domains either fully or in part.^12,14 One checklist, the oldest of those we reviewed, addressed only one domain fully -- use of an appropriate reference standard -- and partially addressed the blinded reference comparison domain.^75,76

Table

Table 20. Evaluation of Scales and Checklists for Diagnostic Test Studies, by Specific Instrument and Five Domains.

Almost all of the nine guidance documents included all these domains. One omitted the avoidance of verification bias domain;⁷¹ another omitted adequate description of the test.⁶ Of the three EPC scales, two addressed all five domains either fully⁸⁰ or in part.^59,68 We gave the remaining EPC system a No for adequate description of the test under study, although the information about the test was likely to have been captured apart from the quality rating system.⁷⁹

Evaluation of Systems According to Descriptive Attributes

The six checklists were all generic instruments. Two systems used accepted criteria for selecting their quality items;^75-77 three used both accepted and empirical criteria;^12,14,78 and one was a modification of another checklist.⁴⁹ We gave two checklists a Partial score for development rigor primarily because they involved some type of consensus process.^14,78 Only the oldest system we reviewed addressed inter-rater reliability.^75,76,111

Evaluation of Systems According to Domains Considered Informative for Study Quality

We consider all five domains in Table 20 to be critical for judging the quality of diagnostic test reports. As noted there, three checklists met all these criteria.^49,77,78 Two others did not address test description, but this omission is easily remedied should users wish to put these systems into practice.^12,14 The oldest system appears to be too incomplete for wide use.^75,76

Findings for Systems to Rate the Strength Of a Body of Evidence

Background

Chapter 2 describes the development of the Summary Strength of Evidence Grid (Grid 5A) and Overall Strength of Evidence Grid (Grid 5B) that appear in Appendix C. Table 11 outlines our domains -- quality, quantity, and consistency -- for grading the strength of a body of evidence and gave their definitions.

We reviewed 40 systems that addressed grading the strength of a body of evidence. In discussing these approaches, we focus on 34 systems identified from our searches and prior research separately from those developed by six EPCs. The non-EPC systems came from numerous international sources, with the earliest systems coming from Canada. Based on the affiliation of the lead author, they originated as follows: Canada (11), United States (10), United Kingdom (6), Australia/New Zealand (3), the Netherlands (3), and a multi-national consensus group (1).

Evaluation According to Domains and Elements

Grid 5A distills the detailed information in Grid 5B. We use the same rating scheme as we did for the quality grids: Yes ( Image f3729_CIRCF.jpg , the instrument fully addressed the domain); No ( Image f3729_CIRC.jpg , it did not address the domain at all); or Partial ( Image f3729_HALFCIRC.jpg , it addressed the domain to some extent). Our findings for each system are discussed below.

Quality

The quality domain included only one element that incorporated our definition of quality (cited in Chapter 1), which was based on methodologic rigor -- that is, the extent to which bias was minimized. Although the 34 non-EPC systems we reviewed included study quality in some way -- that is, quality was graded as fully or partially met -- their definitions of quality varied. Many systems defined quality solely by study design, where meta-analyses of RCTs and RCTs in general received the highest quality grade;^{87-89,91,112-121} we gave these systems a Partial score. Systems indicating that conduct of the study was incorporated into their definition of quality received a Yes score for this domain.^{11-14,22,39,70,81-86,90,122-128}

Of the six EPC grading systems, five received a full Yes score for quality.^{59,60,67,68,129} One EPC system received an NA (not available) for quality because published information about evidence levels for efficacy did not directly incorporate methodologic rigor.⁶⁶ However, we know that this EPC measures study quality as part of its evidence review process.

Quantity

We combined three elements -- numbers of studies, sample size or power, and magnitude of effect -- under the heading of "quantity." As indicated in Chapter 2, a full Yes for this domain required that two of the three elements be covered. The quantity domain included magnitude of effect with both numbers of studies and sample size because we felt that these three elements provide assurance that the identified finding is true. Of the 34 non-EPC systems, 16 fully addressed quantity,^{11,13,22,81-86,88,89,91,117,122,124,125,127} and 15 addressed quantity in part.^{12,14,39,70,84,90,112-114,118,121,123,126,128} Three systems did not include magnitude of effect, number of studies, or sample size as part of their evidence grading scheme.^117,119,120

All the EPC systems that assessed the strength of the evidence in their first evidence reports included at least one of the three attributes we required for quantity; five fully addressed this domain,^59,65-68 and one did so in part.⁶⁰

Consistency

The consistency domain had only one element, but it could be met only if the body of evidence on a given topic itself comprised more than one study. This would typically occur in the development of systematic reviews, meta-analyses, and evidence reports for which numerous studies are reviewed to arrive at a summary finding. As indicated in Chapter 2, this domain is dichotomous; a Yes indicates that the system took consistency into account and a No indicates that the system appeared not to consider consistency in its view of the strength of evidence. Of the 34 non-EPC systems, approximately half incorporated the consistency domain into their approach to grading strength of evidence.^{11,12,14,39,70,81-91} Only one EPC system included this domain.⁶⁵

Evaluation of Systems According to Three Domains That Address the Strength of the Evidence

Domains

As indicated in Table 21, the 34 non-EPC systems incorporated quality, quantity, and consistency to varying degrees. Seven systems fully addressed the quality, quantity, and consistency domains.^11,81-86 Nine others incorporated the three domains at least in part.^{12,14,39,70,87-91}

Table

Table 21. Extent to Which 34 Non-EPC Strength of Evidence Grading Systems Incorporated Three Domains of Quality, Quantity, and Consistency.

Of the six EPC grading systems, only one incorporated quality, quantity, and consistency.⁶⁵ Four others included quality and quantity either fully or partially.^59,60,67,68 The one remaining EPC system included quantity; study quality is measured as part of their literature review process but this domain is apparently not directly incorporated into the grading system.⁶⁶

Domains, Publication Year, and Purpose of System

Whether the grading systems dealing with overall strength of evidence dealt with all three domains appeared to differ by year of publication. The more recent systems included, either fully or partially, all three domains more frequently than did the older systems. Of the 23 evidence grading systems that had been published before 2000, seven (30 percent) included quality, quantity, and consistency to some degree; the same was true for nine (82 percent) of the 11 systems published in 2000 or later. This wide disparity among the systems can be attributed to the consistency domain, which began to appear more frequently from 2000 onward.

As discussed above, many evidence grading systems came from the clinical practice guideline literature. Table 22 shows that, at least among the 34 non-EPC grading systems, whether the three domains were incorporated differed by year of publication and primary purpose (i.e., for guideline development per se or for evidence grading). The nonguideline systems seemingly tended to incorporate all three domains more than the guideline systems, and this trend appears to be increasing over time.

Table

Table 22. Number of Non-EPC Systems to Grade Strength of Evidence, by Number of Domains Addressed, Primary Purpose for System Development, and Year of Publication.

Evaluation of Systems According to Domains Considered Informative for Assessing the Strength of a Body of Evidence

Of the seven systems that fully addressed quality, quantity, and consistency,^11,81-86 four were used for developing guidelines or practice recommendations,^81-83,86 and the remaining three were used for promoting evidence-based health care.^11,84,85

These seven systems are very different (Table 23). Three appear to provide hierarchical grading of bodies of evidence,^82,83,85 and a fourth provides this hierarchy as part of its recommendations language.⁸⁶ Whether a hierarchy is desired will depend on the purpose for which the evidence grading is being done. However, as a society, we are used to numerical grading systems for comparing educational attainment, restaurant cleanliness, or other qualities, and a hierarchical system to grade the strength of bodies of evidence would be well understood and received.

Table

Table 23. Characteristics of Seven Systems to Grade Strength of Evidence.

Publication Details

Copyright

Copyright Notice

Publisher

Agency for Healthcare Research and Quality (US), Rockville (MD)

NLM Citation

West S, King V, Carey TS, et al. Systems to Rate the Strength Of Scientific Evidence. Rockville (MD): Agency for Healthcare Research and Quality (US); 2002 Apr. (Evidence Reports/Technology Assessments, No. 47.) 3, Results.

Instrument	Domains
Instrument	Study Question	Search Strategy*	Inclusion/Exclusion	Interventions	Outcomes	Data Extraction	Study Quality/Validity*	Data Synthesis and Analysis*	Results	Discussion	Funding*
Oxman et al., 1991;⁴ Oxman et al., 1991⁵
Irwig et al., 1994⁶
Sacks et al., 1996⁷
Auperin et al., 1997⁸
Beck, 1997⁹
Smith, 1997¹⁰
Barnes and Bero, 1998³
Clarke and Oxman, 1999¹¹
Khan et al., 2000¹²
New Zealand Guidelines Group, 2000¹³
Harbour and Miller, 2001¹⁴

Instrument	Domains
Instrument	Study Question	Search Strategy*	Inclusion/ Exclusion	Data Extraction	Study Quality*	Data Synthesis/ Analysis*	Funding*
Irwig et al., 1994⁶
Sacks et al., 1996⁷
Auperin et al., 1997⁸
Barnes and Bero, 1998³
Khan et al., 2000¹²

Instrument	Domains
Instrument	Study Question	Study Popu-lation	Randomization*	Blinding*	Interventions	Outcomes	Statistical Analysis*	Results	Discussion	Funding*
Chalmers et al., 1981²⁴
Liberati et al., 1986²⁶
Reisch et al., 1989⁴⁵
Schulz et al., 1995⁵¹
van der Heijden et al., 1996³⁶
de Vet et al., 1997¹⁸
Sindhu et al., 1997³⁸
van Tulder et al., 1997³⁹
Downs et al., 1998⁴⁰
Moher et al., 1998⁴¹
Khan et al., 2000¹²
NHMRC, 2000⁴⁹
Harbour and Miller, 2001¹⁴
Turlik et al., 2000⁴²

Instrument	Domains
Instrument	Study Question	Study Population	Randomization*	Blinding*	Interventions	Outcomes	Statistical Analysis*	Results	Discussion	Funding*
Prendiville et al., 1988⁵²
Guyatt et al., 1993;⁵⁴ Guyatt et al., 1994⁵³
Standards of Reporting Trials Group, 1994⁵⁵
Asilomar Working Group, 1996⁵⁶
Moher et al., 2001⁵⁷
Clarke and Oxman, 1999¹¹
Lohr and Carey, 1999¹

Instrument	Domains
Instrument	Study Population	Random-ization*	Blinding*	Interventions	Outcomes	Statistical Analysis*	Funding*
Chalmers et al., 1981²⁴
Liberati et al., 1986²⁶
Reisch et al., 1989⁴⁵
van der Heijden and van der Windt, 1996³⁶
de Vet et al., 1997¹⁸
Sindhu et al., 1997³⁸
Downs and Black, 1998⁴⁰
Harbour and Miller, 2001¹⁴

Study Design (Grid)	Total	Scales, Checklists, and Component Evaluations	Guidance Documents	EPC Rating Systems
Systematic Reviews (Grid 1)	20	11	9	0
Randomized Controlled Trials (Grid 2)	49	32	7	10
Observational Studies (Grid 3)	19	12	5	2
Diagnostic Tests (Grid 4)	18	6	9	3

Instrument	Domains
Instrument	Comparability of Subjects	Exposure/ Intervention	Outcome Measure	Statistical Analysis	Funding
Reisch et al., 1989⁴⁵
Spitzer et al, 1990⁴⁷
Goodman et al., 1994³²
Downs and Black, 1998⁴⁰
Harbour and Miller, 2001¹⁴
Zaza et al., 2000⁵⁰

Instrument	Domains*
Instrument	Study Population	Adequate Description of Test	Appropriate Reference Standard	Blinded Comparison of Test and Reference	Avoidance of Verification Bias
Sheps and Schechter, 1984;⁷⁵ Arroll et al., 1988⁷⁶
Cochrane Methods Working Group, 1996⁷⁷
Lijmer et al., 1999⁷⁸
Khan et al., 2000¹²
NHMRC, 2000⁴⁹
Harbour and Miller, 2001¹⁴

Number of Domains Addressed and Extent of Coverage	Number of Systems
All three domains
Addressed fully	7^11,81-86
Addressed fully or partially	9^{12,14,39,70,87-91}
Two of three domains
Addressed fully	5^{13,22,122,124,125}
Addressed fully or partially	10^{112-116,118,121,123,126-128}
One domain addressed fully or partially	3^117,119,120

Number of Domains Addressed*	Guideline System		Non-Guideline System
	Before 2000	After 2000	Before 2000	After 2000
3 domains addressed either partially or fully	3^81,88,89	5^{14,82,83,86,91}	4^11,39,87,90	4^12,70,84,85
<3 domains addressed either partially or fully	13^{112-116,118-123,125,126,128}	2^13,22	3^117,121,126	0

Source	Domain
Source	Quality	Quantity	Consistency	Strength of Evidence Grading System	Comments
Gyorkos et al., 1994⁸¹	Validity of studies	Strength of association and precision of estimate	Variability in findings from independent studies	Overall assessment of level of evidence based on four elements: Validity of individual studies Strength of association between intervention and outcomes of interest Precision of the estimate of strength of association Variability in findings from independent studies of the same or similar interventions For each element a qualitative assessment of whether there is strong, moderate, or weak support for a causal association.
Clarke and Oxman, 1999¹¹	Based on hierarchy of research design, validity, and risk of bias	Magnitude of effect	Consistency of effect across studies	Questions to consider regarding the strength of inference about the effectiveness of an intervention in the context of a systematic review of clinical trials: How good is the quality of the included trials? How large and significant are the observed effects? How consistent are the effects across trials? Is there a clear dose-response relationship? Is there indirect evidence that supports the inference? Have other plausible competing explanations of the observed effects (e.g., bias or cointervention) been ruled out?	Other domains: Dose-response relationship Supporting indirect evidence No other plausible explanation
Briss et al., 2000⁸²	Threats to Validity: - Study description - Sampling - Measurement - Data analysis - Interpretation of results - Other Quality of Execution: - Good (0-1 threats) - Fair (2-4 threats) - Limited (5+ threats) Design suitability: Greatest concurrent comparison groups and prospective measure-ment Moderate all retrospective designs or multiple pre or post measurements; no concurrent comparison group Least single pre and post measurements; no concurrent comparison group or exposure and outcome measured in a single group at the same point in time.	Effect size - Sufficient - Large - Small Larger effect sizes (absolute or relative risk) are considered to represent stronger evidence of effective-ness than smaller effect sizes with judgments made on an individual basis	Consistency as yes or no.	Evidence of effectiveness is based on execution, design suitability, number of studies, consistency, and effect size Strong: Good and greatest, at least 2 studies consistent, sufficient Good/fair and great/moderate, at least 5 studies, consistent, sufficient Good/fair and any design, at least 5 studies, consistent, sufficient Sufficient Good and greatest, one study, consistency unknown, sufficient Good/fair and great/moderate, at least 3 studies, consistent, sufficient Good/fair and any design, at least 5 studies consistent, sufficient Expert opinion: sufficient effect size Insufficient: insufficient design, too few studies, inconsistent, small effect size
Greer et al., 2000⁸³	Strong design not defined but includes issues of bias and research flaws	System incorporates number of studies and adequacy of sample size	Incorporates consistency	Grade I: Evidence from studies of strong design; results are both clinically important and consistent with minor exceptions at most; results are free from serious doubts about generalizability, bias, and flaws in research design. Studies with negative results have sufficiently large samples to have adequate statistical power. II: Evidence from studies of strong design but there is some uncertainty due to inconsistencies or concern about generalizability, bias, research design flaws, or adequate sample size. Or, evidence consistent from studies of weaker designs. III: The evidence is from a limited number of studies of weaker design. Studies with strong design either haven't been done or are inconclusive. IV: Support solely from informed medical commentators based on clinical experience without substantiation from the published literature.	Does not require a systematic review of the literature -- only six "important" research papers.
Guyatt et al., 2000⁸⁴	Based on hierarchy of research design, with some attention to size and consistency of effect	Multiplicity of studies, with some attention to magnitude of treatment effects	Consistency of effect considered	Hierarchy of vidence for application to patient care: N of 1 randomized trial Systematic reviews of randomized trials Single randomized trials Systematic review of observational studies addressing patient-important outcomes Single observational studies addressing patient-important outcomes Physiologic studies Unsystematic clinical observations Authors also discuss a hierarchy of preprocessed evidence that can be used to guide the care of patients: Primary studies -- by selecting studies that are both highly relevant and with study designs that minimize bias, permitting a high strength of inference Summaries -- systematic reviews Synopses -- of individual studies or systematic reviews Systems -- practice guidelines, clinical pathways, or evidence-based text book summaries	Evidence defined broadly as any empirical observation about the apparent relationship between events. "The hierarchy is not absolute. If treatment effects are sufficiently large and consistent, for instance, observational studies may provide more compelling evidence than most RCTs."
NHS Centre for Evidence Based Medicine, (http://cebm.jr2.ox.ac.uk) (Accessed 12-2001)⁸⁵	Based on hierarchy of research design with some attention to risk of bias	Multiplicity of studies, and precision of estimate	Homogeneity of studies considered	Criteria to rate levels of evidence vary by one of four areas under consideration (Therapy/ Prevention or Etiology/Harm; Prognosis Diagnosis and Economic nalysis). For example, for the first area (Therapy/ Prevention or Etiology/Harm) the levels of evidence are as follows: 1a: SR with homogeneity of RCTs 1b: Individual RCT with narrow 1c: All or none (this criteri met when all patients died the treatment becm available and now some survive or some died previously and now none die) 2a: with homogeneity of cohort studies 2b: Individual cohort study (including low quality RCT; e.g. <80% follow-up) 2c: "Outcomes" research 3a: SR with homogeneity of case-control studies 3b: Individual case-control study 4: Case-series and poor quality cohort and case-control studies 5: Expert opinion without explicit critical appraisal or based on physiology, bench research or "first principles."
Harris et al., 2001⁸⁶ (for the U.S. Preventive Services Task Force)	Based on hierarchy of research design and methodologic quality (good, fair, poor) within research design	Number of studies, see Consistency	Consistency Consistency is not required by the Task Force but if present, contributes to both coherence and quality of the body of evidence	Levels of evidence: I Evidence from at least one properly randomized controlled trial II-1 Well-designed controlled trial without randomization II-2 Well-designed cohort or case-control analytic studies, preferably from more than one center or group II-3 Multiple time series with or without the intervention (also includes dramatic results in uncontrolled experiments): III Opinions of respected authorities, based on clinical experience, descriptive studies, and case reports, or reports of expert committees Aggregate internal validity is the degree to which the study(ies) provides valid evidence for the population and setting in which it was conducted. Aggregate external validity is the extent to which the evidence is relevant and generalizable to the population and conditions of typical primary care practice. Coherence/consistency	Other domains: Coherence Coherence implies that the evidence fits the underlying biologic model.