U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Nelson HD, Fu R, Humphrey L, et al. Comparative Effectiveness of Medications To Reduce Risk of Primary Breast Cancer in Women [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2009 Sep. (AHRQ Comparative Effectiveness Reviews, No. 17.)

  • This publication is provided for historical reference only and the information may be out of date.

This publication is provided for historical reference only and the information may be out of date.

Cover of Comparative Effectiveness of Medications To Reduce Risk of Primary Breast Cancer in Women

Comparative Effectiveness of Medications To Reduce Risk of Primary Breast Cancer in Women [Internet].

Show details

Appendix CQuality and Strength of Evidence Criteria and Rating

Appendix C-1. Quality Rating Criteria* and Applicability Assessment with PICOTS

Quality Rating Criteria

Randomized Controlled Trials (RCTs) and Cohort Studies

Criteria:

  • Initial assembly of comparable groups: RCTs—adequate randomization, including concealment and whether potential confounders were distributed equally among groups; cohort studies—consideration of potential confounders with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts
  • Maintenance of comparable groups (includes attrition, cross-overs, adherence, contamination)
  • Important differential loss to follow-up or overall high loss to follow-up
  • Measurements: equal, reliable, and valid (includes masking of outcome assessment)
  • Clear definition of interventions
  • Important outcomes considered
  • Analysis: adjustment for potential confounders for cohort studies, or intention-to-treat analysis for RCTs; for cluster RCTs, correction for correlation coefficient

Definition of ratings based on above criteria:

Good: Meets all criteria: Comparable groups are assembled initially and maintained throughout the study (follow-up at least 80 percent); reliable and valid measurement instruments are used and applied equally to the groups; interventions are spelled out clearly; important outcomes are considered; and appropriate attention to confounders in analysis.

Fair: Studies will be graded “fair” if any or all of the following problems occur, without the important limitations noted in the “poor” category below: Generally comparable groups are assembled initially but some question remains whether some (although not major) differences occurred in follow-up; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for.

Poor: Studies will be graded “poor” if any of the following major limitations exists: Groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied at all equally among groups (including not masking outcome assessment); and key confounders are given little or no attention.

Studies of Risk Assessment Tools

Adapted from the United States Preventive Services Task Force Quality Rating Criteria for Diagnostic Accuracy Studies

Criteria:

  • Risk assessment tool appropriate for a primary care screening tool
  • Tool evaluates diagnostic test performance in a population other than the one used to derive the instrument
  • Study evaluates a consecutive clinical series of patients or a random subset
  • Study adequately describes the population in which the risk instrument was tested
  • Study adequately describes the instrument evaluated
  • Study includes appropriate criteria in the instrument (must include age, family history and/or some other measure of risk)
  • Study adequately describes the method used to calculate the risk index
  • Study uses appropriate criterion to assess the risk factors (uses either a validated questionnaire or other corroborated method)
  • Study evaluates outcomes or the reference standard in all patients enrolled (up to 20% loss considered acceptable)
  • Follow up with standard diagnostic testing (mammogram/biopsy/pathology) performed consistently without regard for the results of the risk assessment
  • Study evaluates outcomes blinded to results of the screening instrument

Definition of ratings based on above criteria:

Good: Evaluates relevant screening test appropriate for primary care setting; risk instrument is validated in a population other than the one used to derive the instrument; risk instrument adequately described; uses an appropriate reference standard (eg. SEER data); handles indeterminate results in a reasonable manner; broad spectrum of patients and adequate number of incident cases; use of primary data; appropriate duration of follow up and standardized diagnostic screening in follow up (mammogram).

Fair: Evaluates relevant available screening test; moderate sample size; medium spectrum of patients; risk instrument not validated in a population other than the one used to derive the instrument; handling of indeterminate results not reported or inadequate; inadequate follow up - either inadequate duration or inconsistent use of standardized diagnostic screening (mammogram); instrument not derived from primary data.

Poor: Has important limitations such as inappropriate reference standard, very small sample size, very narrow spectrum of patients; not appropriate for primary care.

Applicability Assessment with PICOTS: Limitations that Reduce Applicability

Population:

  • Narrow eligibility criteria and/or high exclusion rate.
  • Large differences between demographics of study population and that of patients in the community.
  • Narrow or unrepresentative severity or stage of illness.
  • Run in period with high-exclusion rate for non-adherence or side effects.
  • Event rates much higher or lower than observed in population-based studies.
  • Study size too small to represent the population of interest.

Intervention:

  • Doses or schedules not reflected in current practice.
  • Intensity of behavioral interventions that is not likely to be feasible for routine use.
  • Co-interventions that are likely to modify effectiveness of therapy.
  • Monitoring practices or visit frequency not used in typical practice.
  • Highly selected intervention team or level of training/proficiency not widely available.

Comparator:

  • Inadequate dose of comparison therapy.
  • Use of sub-standard alternative therapy.

Outcomes:

  • Surrogate rather than clinical outcomes.
  • Failure to measure most important outcomes.
  • Failure to distinguish minor from serious adverse effects.

Timing of Outcomes Measurement:

  • Follow-up too short to detect important benefits or harms.
  • Lack of long-term follow-up for interventions requiring long-term interventions.

Setting:

  • Settings where standards of care differ markedly from setting of interest.
  • Specialty population or level of care that differs importantly from that seen in primary care.

Appendix C-2. EPC GRADE Domains and Definitions for Assessing the Strength of Evidence

DomainDefinition and ElementsScore and Application
Risk of BiasRisk of bias is the degree to which the included studies for a given outcome or comparison have a high likelihood of adequate protection against bias (i.e., good internal validity), assessed through two main elements:
  • Study design (e.g., RCTs or observational studies)
  • Aggregate quality of the studies under consideration. Information for this determination comes from the rating of quality (good/fair/poor) done for individual studies
Use one of three levels of aggregate risk of bias:
  • Low risk of bias
  • Medium risk of bias
  • High risk of bias
ConsistencyThe principal definition of consistency is the degree to which reported effect sizes from included studies appear to have the same direction of effect. This can be assessed through two main elements:
  • Effect sizes have the same sign (that is, are on the same side of “no effect”)
  • The range of effect sizes is narrow.
Use one of three levels of consistency:
  • Consistent (i.e., no inconsistency)
  • Inconsistent
  • Unknown or not applicable (e.g., single study)
As noted in the text, single-study evidence bases (even mega-trials) cannot be judged with respect to consistency. In that instance, use “Consistency unknown (single study).”
DirectnessThe rating of directness relates to whether the evidence links the interventions directly to health outcomes. For a comparison of two treatments, directness implies that head-to-head trials measure the most important health or ultimate outcomes.
Two types of directness, which can coexist, may be of concern: Evidence is indirect if:
  • It uses intermediate or surrogate outcomes instead of health outcomes. In this case, one body of evidence links the intervention to intermediate outcomes and another body of evidence links the intermediate to most important (health or ultimate) outcomes.
  • It uses two or more bodies of evidence to compare interventions A and B -- e.g., studies of A vs. placebo and B vs. placebo, or studies of A vs. C and B vs. C but not A vs. B.
Indirectness always implies that more than one body of evidence is required to link interventions to the most important health outcomes.
Directness may be contingent on the outcomes of interest. EPC authors are expected to make clear the outcomes involved when assessing this domain.
Score dichotomously as one of two levels directness
  • Direct
  • Indirect
If indirect, specify which of the two types of indirectness account for the rating (or both, if that is the case) -- namely, use of intermediate/surrogate outcomes rather than health outcomes, and use of indirect comparisons. Comment on the potential weaknesses caused by, or inherent in, the indirect analysis. The EPC should note if both direct and indirect evidence was available, particularly when indirect evidence supports a small body of direct evidence.
PrecisionPrecision is the degree of certainty surrounding an effect estimate with respect to a given outcome (i.e., for each outcome separately)

If a meta-analysis was performed, this will be the confidence interval around the summary effect size.
Score dichotomously as one of two levels of precision:
  • Precise
  • Imprecise
A precise estimate is an estimate that would allow a clinically useful conclusion.. An imprecise estimate is one for which the confidence interval is wide enough to include clinically distinct conclusions. For example, results may be statistically compatible with both clinically important superiority and inferiority (i.e., the direction of effect is unknown), a circumstance that will preclude a valid conclusion.

Printed from: Lohr K, Helfand M, Owens D, et al. Grading the strength of a body of evidence. J Clin Epidemiol in press. [PubMed: 19595577]

Appendix C-3. EPC GRADE Criteria for Assigning Strength of Evidence

GradeDefinition
HighHigh confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect.
ModerateModerate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.
LowLow confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate.
InsufficientEvidence either is unavailable or does not permit estimation of an effect.

Printed from: Lohr K, Helfand M, Owens D, et al. Grading the strength of a body of evidence. J Clin Epidemiol in press. [PubMed: 19595577]

Appendix C-4. Optional EPC GRADE Domains and Definitions for Assessing the Strength of Evidence

DomainDefinition and ElementsScore and ApplicationExplanation of Non-use in Report
CoherenceCoherence is the degree of plausibility of results in relation to epidemiology or, in some cases, biology and pathophysiology.This additional domain does not need to be described or noted unless something “implausible” has emerged, in which case EPC authors should comment on it.
Use one of two levels:
  • Coherent: the results are plausible given other epidemiologic or biologic data.
  • Not coherent: the results are not plausible given the weight of epidemiologic or biologic data.:
No “implausible” findings emerged in this report.
Dose-response associationThis association, either across or within studies, refers to a pattern of a larger effect with greater exposure (dose, duration, adherence)This additional domain should be rated if studies in the evidence base have noted levels of exposure. Use one of three levels:
  • Present: Dose-response pattern observed
  • Not present: No dose-response pattern observed (dose-response relationship not present)
  • NA (not applicable or not tested)
No multiple dose effects were tested in the trials included in this report.
Impact of plausible residual confoundersOccasionally, in an observational study, residual confounders would work in the direction opposite that of the observed effect. A case in point is when a study is biased against finding an effect and yet it finds an effect. Thus, had these confounders not been present, the observed effect would have been even larger than the one observed.This additional domain should be considered if a plausible impact of residual confounding exists.
Use one of three levels:
  • Unlikely: Confounding unlikely to explain observed effect: Plausible residual confounders are more likely to have decreased the observed effect than to have increased the observed effect
  • Possible: Confounding may explain observed effect: Plausible residual confounders are unlikely to have decreased the observed effect and could be responsible for observed effect
  • Cannot assess
Few observational studies were included and had little impact in the GRADE table.
Strength of association (magnitude of effect)Strength of association refers to the likelihood that the observed effect is large enough that it cannot have occurred solely as a result of bias from potential confounding factors.This additional domain should be considered if the effect size is particularly large.
Use one of two levels:
  • Strong: large effect size that is unlikely to have occurred in the absence of a true effect of the intervention
  • Weak: small enough effect size that it could have occurred solely as a result of bias from confounding factors
Effect sizes were not particularly large and came from well-designed RCTs.
Publication biasPublication bias indicates that studies may have been published selectively with the result that the estimated effect of an intervention based on published studies does not reflect the true effect. The finding that only a small proportion of relevant trials (or other studies) has been published or reported in a results database may indicate a higher risk of publication bias, which in turn may undermine the overall robustness of a body of evidence.Publication bias need not be formally scored. However, it can influence ratings of consistency, precision, magnitude of effect (and, to a lesser degree, risk of bias and directness). If EPCs identify unpublished trials, and if those results differ from those of published studies, they can take these factors into account in their rating for consistency and in calculating a summary confidence interval for an effect. We encourage authors to comment on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available.No unpublished trials identified. Only very large, well known trials could provide the breast cancer outcomes needed for this report.

Printed from: Lohr K, Helfand M, Owens D, et al. Grading the strength of a body of evidence. J Clin Epidemiol in press. [PubMed: 19595577]

Appendix C-5. Quality and Applicability Ratings of Included Trials

Trials author, yearCriteria for QualityRating/limitationsCriteria for ApplicabilityQuality rating for applicability
Adequate randomization?Blinding?Maintenance of comparable groups?Loss to follow-up?Measures equal, reliable, valid?Clear definition of interventionsImportant outcomes considered?Intention-to- treat analysis?PopulationInterventionComparatorOutcomesTiming of outcomes measuresSetting
Primary
Prevention
Trials
STAR
Vogel, 200612
Method not describedYes68% tamoxifen, 72% raloxifene completed study1.5% loss tamoxifen; 1.3% raloxifeneYesYesYesYesGoodIncreased risk for breast cancer; broad inclusion criteriaAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
IBIS
Cuzick, 200219
YesYes64% tamoxifen, 74% placebo completed study p<0.001; 25% completed 5 yrsNR; assume all included in analysisYesYesYesYesFair; 40% estrogen use may confoundIncreased risk for breast cancer; broad inclusion criteriaAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
NSABP P-1
Fisher, 199824
YesYes76% tamoxifen, 80% placebo completed study1.6% loss in both groupsYesYesYesYesGoodIncreased risk for breast cancer; broad inclusion criteriaAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
Royal Marsden
Powles, 199825
YesYes53% tamoxifen, 63% placebo completed study
p<0.0005
11% loss in both groupsYesYesYesYesFair; unequal use of estrogen in groupsIncreased risk for breast cancer; broad inclusion criteriaAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
Italian
Veronesi, 199828
Method not describedYes69% tamoxifen 73% placebo completed study<1% loss overallYesYesYesYesFair; hysterectomy, estrogen use may confoundIncreased risk for breast cancer; prior hysterctomyAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careFair; women in study have hysterectomy modifying risk
RUTH
Barret-Connor, 200646
YesYes80% raloxifene, 79% placebo completed studyNR; assume all included in analysisYesYesYesYesGoodHeart disease or increased heart riskAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
MORE
Cummings, 199934
YesYes78% raloxifene, 75% placebocompleted studyNR; assume all included in analysisYesYesYesYesGoodOsteoporosisAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
LIFT
Cummings, 200810
Ettinger, 200887
YesYes91% overall received 80% of dosesNR; assume all included in analysisYesYesYesYesGoodOsteoporosisAppropriateAppropriateAppropriateAppropriateMulti-center, relevant to primary careGood
Raloxifene
Trials
Cohen, 2000*73YesYesYes35% discontinued therapyYesYesYes but not all harms are reportedNRFairHealthy women average riskAppropriateAppropriateAppropriateAppropriate2 Multi-center trialsFair
Delmas, 199774YesNRYesNRYesYesYes but not all harms are reportedYesFairHealthy womenAppropriateAppropriateAppropriateAppropriateMulti-center; no US sitesPoor
Goldstein, 200576YesYesYes40% discontinued therapyYesYesYes but not all harms are reportedYesFairHealthy women with prior hysterectomyAppropriateAppropriateAppropriateAppropriateMulti-center trial; includes US sitesFair
Johnston, 2000*77YesYesYes23–42%YesYesYes but not all harms are reportedYesFairHealthy womenAppropriateAppropriateAppropriateAppropriateMulti-center trial; includes US sitesFair
Jolly, 2003*78YesNoYesNRYesYesYes but not all harms are reportedNoPoor; only includes those continuing therapyHealthy womenAppropriateAppropriateAppropriateAppropriateMulti-center; includes US sitesFair
Lufkin, 1998†79YesYesNR~10%YesYesYes but not all harms are reportedYesFairOsteoporosisAppropriateAppropriateAppropriateAppropriateMulti-centerFair
McClung, 200680YesYesNR~30%YesYesYes but not all harms are reportedNRFairHealthyAppropriateAppropriateAppropriateAppropriateMulti-center; includes US sitesFair
Meunier, 199981YesYesYes~16%YesYesYes but not all harms are reportedYesFairOsteoporosisAppropriateAppropriateAppropriateAppropriateMulti-center; FrancePoor
Morii, 200382YesYesYes~15%YesYesYes but not all harms are reportedNRFairJapan; osteoporosis narrow inclusion criteriaAppropriateAppropriateAppropriateAppropriateMulti-center; JapanPoor
Nickelson, 1999†83NRYesYes9.1% discontinuedYesYesYes but not all harms are reportedYesFairOsteoporosisAppropriateAppropriateAppropriateAppropriate2 centers; USFair
Palacios, 200484YesYesYes11–13%YesYesYes but not all harms are reportedYesFairHealthy womenAppropriateAppropriateAppropriateAppropriateMulti-center; no US sitesPoor
Walsh, 199885YesYesYes16%YesYesYes but not all harms are reportedYesFairHealth womenAppropriateAppropriateAppropriateAppropriateMulti-center; includes US sitesFair
Tibolone
Trials
OPAL;
Bots, 200189;
Langer, 200690
YesYes for treatment group; NR for other outcomesYesNo; 31% tx, 30% placeboYesYesYesYesFairHealthyAppropriateAppropriateAppropriateAppropriateMulti-center; includes US sitesFair
Landgren, 200291YesNRYesNo; 11% tx, 20% placeboYesYesYesNRFairHealthy; vasomotor symtomsAppropriateAppropriateAppropriateAppropriateMulti-center; no US sitesPoor
Gallagher, 200192YesYes for treatment group; NR for other outcomesYesNo; 34% tx, 29% placeboYesYesYesYesFairHealthyAppropriateAppropriateAppropriateAppropriateMulti-center; USFair
Swanson, 200693YesNRYesNoYesYesYesYesFairHealthy; vasomotor symtomsAppropriateAppropriateAppropriateAppropriateMulti-center; USPoor
Hudita, 200394NRNRYesNoYesYesYesNoPoorHealthy; symptomsAppropriateAppropriateAppropriateAppropriate1 Center; RomaniaPoor
Onalan, 200596YesNRNRNo; 18% tx, 9% placeboYesYesYesNoPoorHealthyAppropriateAppropriateAppropriateAppropriate1 Center; TurkeyPoor
Lundstrom, 200295YesNRYesNoYesYesOnly breast densityNoFairHealthyAppropriateAppropriateAppropriateAppropriate1 Center; SwedenPoor
Million Women Study
Beral, 200398;
Beral, 200597
NANANANoYesYesYesNAFairHealthy; symptomsAppropriateAppropriateAppropriateAppropriateMulti-centerPoor

Appendix C-6. Quality of Risk Assessment Tools

Quality Criteria
StudyPrimary care tool?Tested in secondary population?Population adequately described?Instrument adeqauately described?Appropriate criteria?Risk calculation adequately described?Results appropriately handled?Reference standard?Adequate sample size?Adequate duration of follow up?Quality Criteria
Gail, 198949YesNo*YesYesYesYesYesNo*YesYesGood
Costantino, 1999124YesYesYesYesYesYesYesYesYesYesGood
Rockhill, 2001122YesYesYesYesYesYesYesYesYesYesGood
Chlebowski, 2007125YesYesYesYesYesYesYesYesYesYesGood
Gail M, 2007126YesYesYesYesYesYesYesYesYesYesGood
Adams-Campbell, 2007127YesYesYesYesYesYesNRYesYesYesGood
DeCarli, 2006121YesYesYesYesYesYesYesYesYesYesGood
Boyle, 2004118DifficultYesYesYesYesYesYesYesYesYesFair
Chen, 2006128YesNo*YesYesYesYesYesYesYesYesGood
Barlow, 2006129YesNo*YesYesYesYesYesYesYesNoFair
Tice, 2008130YesNo*YesYesYesYesYesYesYesYesGood
Rockhill, 2003131YesYesYesYesYesYesYesNRYesYesGood
Colditz, 2000119YesNo*YesYesYesYesYesNRYesYesGood
Colditz, 2004120YesYesYesYesYesYesYesNRYesYesGood
Tyrer, 2004123YesNo*No*YesNoYesYesYesYesNRFair
Amir, 2003132YesYesNo§YesYesYesYesYesNoYesFair
*

Appropriate due to study purpose.

Logistically difficult due to an extensive dietary questionnaire.

Tyrer, 2004 did not use primary data.

§

Amir, 2003 did not use a primary care population.

Footnotes

*

Reference: Harris RP, Helfand M, Woolf SH, et al. Current methods of the US Preventive Services Task Force: a review of the process. Am J Prev Med. 2001:20(3S); 21–35.

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (4.6M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...