Methods

Kelley Tipton; Brian F. Leas; Emilia Flores; Christopher Jepson; Jaya Aysola; Jordana Cohen; Michael Harhay; Harald Schmidt; Gary Weissman; Jonathan Treadwell; Nikhil K. Mull; Shazia Mehmood Siddique

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Tipton K, Leas BF, Flores E, et al. Impact of Healthcare Algorithms on Racial and Ethnic Disparities in Health and Healthcare [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2023 Dec. (Comparative Effectiveness Review, No. 268.)

Cover of Impact of Healthcare Algorithms on Racial and Ethnic Disparities in Health and Healthcare

Impact of Healthcare Algorithms on Racial and Ethnic Disparities in Health and Healthcare [Internet].

Show details

Contents

< Prev Next >

2Methods

This evidence review followed methods outlined in the Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Methods Guide for Effectiveness and Comparative Effectiveness Reviews (hereafter the “AHRQ Methods Guide”) for Key Questions (KQs) 1 and 2.³⁹ We determined most methods a priori but needed additional input from Subject Matter Experts (SMEs) to finalize methods. A protocol was developed through a process that included collaborating with a Technical Expert Panel (TEP), Key Informants (KIs), federal partners, and public input on KQs and study eligibility criteria. For additional details, see the review protocol posted on the AHRQ Effective Health Care Program website (https://effectivehealthcare.ahrq.gov/products/racial-disparities-health-healthcare/protocol). The protocol was registered with PROSPERO (CRD42022335090).

2.1. Methods To Address Key Questions

2.1.1. Classification of Studies by Key Question

Studies were included in KQ 1 if they evaluated an algorithm’s effect on health or healthcare outcomes stratified by racial and ethnic groups (i.e., studies reporting only model fit and accuracy were excluded). Studies were included in KQ 2 if they examined a strategy’s ability to mitigate 1) racial and ethnic algorithmic bias or 2) a known racial and ethnic disparity associated with an algorithm. Studies that described both a racial and ethnic disparity associated with an algorithm, and an intervention on the algorithm to mitigate the disparity, were included in both KQ 1 and KQ 2.

2.1.2. Literature Search Strategies for Key Questions

EPC information specialists conducted a comprehensive literature search following established systematic review protocols. We searched the following databases using controlled vocabulary and text words from January 1, 2011, to February 7, 2023: Embase and Medline (via embase.com), PubMed (in-process citations to capture items not yet indexed in Medline), and the Cochrane Library. Based on guidance from SMEs, KIs, and TEP members, articles published before 2011 were considered unlikely to be contemporaneous to current algorithms. The search strategy included controlled vocabulary terms (e.g., MeSH, EMTREE), along with free-text words, related to race, ethnicity, algorithms, disparities, and inequities. Searches used a hedge to remove conference abstracts, editorials, letters, and news items; however, we retained some of these items in the final search to help inform the Contextual Questions. Information specialists independently peer reviewed searches using the Peer Review of Electronic Search Strategies Checklist. The search strategy for Embase and Medline is included in Appendix A. We also reviewed submissions to AHRQ’s Supplemental Evidence and Data portal to identify other studies meeting protocol eligibility criteria.

Information specialists also conducted grey literature searches of the following resources: Association for Computing Machinery Digital Archives, medRxiv and bioRxiv preprint servers, ClinicalTrials.gov, and websites of relevant organizations (e.g., AHRQ, American Actuarial Association, American Hospital Association Institute for Diversity and Health Equity, American Medical Informatics Association, Centers for Disease Control and Prevention, Consumer Financial Protection Bureau, Healthcare Information and Management Systems Society, U.S. Food and Drug Administration, Health Resources and Services Administration, National Institute of Standards and Technology, Office of the National Coordinator for Health Information Technology, Observational Health Data Sciences and Informatics, and others as recommended by SMEs and TEP). We hand-searched published systematic reviews to identify any studies missed by our searches. Scopus was also used to identify related publications through citation tracking.

We screened eligible records using DistillerSR (Evidence Partners, Ottawa, Ontario, Canada). After screening titles, two analysts independently screened each abstract in duplicate for eligibility. We then retrieved eligible full-text articles screened for final eligibility, again in duplicate. All disagreements were resolved by consensus discussion between the two duplicate screeners.

2.1.3. Analytic Framework for Key Questions

KQs were addressed by a systematic review of published studies and grey literature. Figure 1 presents the analytic framework that displays the interaction between major components of the evidence base, organized according to the PICOTS (Population, Intervention, Comparator, Outcome, Timing, Setting) model.

This figure depicts Key Question 1 and Key Question 2 within the context of the PICOTS described below. The box on the far left indicates the patient population: those whose healthcare could be affected by algorithms. Moving to the right, the next two boxes indicate the interventions and comparators of interest for each Key Question. Finally on the far right, the figure lists the outcomes of interest in three categories: access to care (e.g., patient use of healthcare services, direct costs to patients), quality of care (e.g., appropriateness of diagnosis, timeliness of care), and health outcomes (e.g., mortality, survival, quality of life). Along the bottom of the figure is a box indicating the settings of interest, which include hospital settings, ambulatory settings, and non-clinical sites.

Figure 1

Analytic framework for Key Questions. Abbreviations: KQ = Key Question

2.1.4. Inclusion and Exclusion Criteria for Key Questions

As suggested in the AHRQ Methods Guide,³⁹ we list eligibility criteria in several categories: publication type, study design, intervention characteristics, setting, and outcome.

2.1.4.1. Publication Criteria

We did not include abstracts or meeting presentations, which do not provide sufficient details about experimental methods to permit an evaluation of study design and conduct; they may also contain only a subset of measured outcomes.⁴⁰^,⁴¹ Also, abstracts that are published as part of conference proceedings can have inconsistencies compared with the final study publication or may describe studies that are never published as full articles.⁴²^–⁴⁵
We included studies published from 2011 to the present. Based on guidance from subject matter and technical experts, earlier articles were considered unlikely to be contemporaneous to current algorithms.
To avoid double-counting patients, when several reports of overlapping patients are available, we included outcome data only from the report with the most patients. We included data from a smaller, overlapping publication when it reported data on different racial and ethnic group(s), included an outcome not provided by the larger report, or reported longer follow-up data for an outcome.
This review’s timeline did not permit translation of non-English-language articles.

2.1.4.2. Study Design Criteria

We included only full-length research studies; thus, we excluded narrative reviews, letters, guidelines, position statements, and commentaries. We used systematic reviews only to identify individual studies as a supplement to the full literature search (described above in the Literature Search Strategy).
We considered any study design with a relevant comparison or no comparator, as described in Table 2.
We included studies with prospective or retrospective patient identification or studies that modeled potential outcomes. Modeling studies used real-world or synthetic source data for calculation of algorithmic scores and outcomes that would have resulted from using the algorithm were simulated.
For KQ 1, the study must have measured an algorithm’s effect on racial and ethnic disparities. For KQ 2, the study must have measured a mitigation strategy’s effect.

2.1.4.3. Intervention Criteria

To be considered an “algorithm,” a mathematical formula or model must combine different input variables or factors to produce a numerical score or scaled ranking or populate a classification scheme that may be used to guide healthcare decisions. We also included studies of algorithm-informed decision support tools, defined as any clinical guideline, pathway, clinical decision support intervention in an electronic health record (EHR), or operational system used by health systems and payers that is informed by an algorithm as defined above. We did not require that an algorithm explicitly use race or ethnicity as an input.
For KQ 1, the algorithm must have been applied to a patient/participant population other than the derivation population. We excluded newly developed algorithms evaluated only in a derivation population.
Three studies directly evaluated both the effect of an algorithm on racial and ethnic disparities and strategies to mitigate racial and ethnic bias; therefore, relevant data from these studies were summarized within both KQs. Additionally, a few studies were applicable primarily to one of the KQs while indirectly addressing the other KQ. These studies were analyzed with the most appropriate KQ following consensus discussion among reviewers.

2.1.4.4. Setting Criteria

For representativeness, we included only studies of patients in the United States for KQ 1. For KQ 2, we did not restrict by country as strategies to mitigate bias outside the United States may be generalizable to settings in the United States.
We included any study conducted in a clinical or nonclinical site, as described in Table 2.

2.1.4.5. Outcome Criteria

For KQ 1, a study must have evaluated whether an algorithm affects a racial or ethnic disparity in outcomes. Studies must have reported outcomes separately for two or more races or ethnicities. For KQ 2, we allowed studies that reported outcomes for only one race or ethnicity. We did not require that reported effect sizes be statistically significant or that a study control or adjust for possible confounders (confounding is addressed in our narrative appraisal of the evidence).
For KQ 1, the study must have reported health or healthcare outcomes. Studies that reported only diagnostic or prognostic accuracy without specifying clinical implications were excluded.
For both KQs, a study must have reported race and ethnicity-based outcomes in at least one of three outcome categories (access to care, quality of care, and health outcomes).

Table 2 presents criteria that guided study eligibility and categorization of outcomes, organized according to the PICOTS framework.

Table 2

PICOTS (Population, Intervention, Comparator, Outcome, Timing, Setting) for Key Questions 1 and 2.

2.1.5. Data Abstraction and Data Management

Data were extracted into Microsoft Word and/or Excel. Elements abstracted included general study characteristics (e.g., study design, setting, number of patients enrolled), patient characteristics (e.g., age, sex, race and ethnicity, clinical condition), intervention details (e.g., study objective, type of algorithm, intent of algorithm, input variables used, data sources), and outcome data.

2.1.6. Assessment of Methodologic Risk of Bias and Data Synthesis

Some included studies were prediction modeling studies, so we first considered using PROBAST (Prediction model study Risk Of Bias Assessment Tool) to assess risk of bias (ROB).⁴⁶ After piloting PROBAST in our evidence base, we determined it was not applicable because our KQs addressed the effect of algorithms (KQ 1) or the effect of mitigation strategies (KQ 2) on clinical outcomes, which are not considered in PROBAST, which focuses on algorithm development. Therefore, based on EPC guidance,³⁹ we focused ROB assessment on how well a study measured the true effect of algorithms or mitigation strategies (neither overestimates nor underestimates). While a randomized controlled trial (RCT) would be the ideal design to measure this, only one of the included studies was an RCT. No ROB tools existed for studies of the effect of algorithms, so we used an existing tool, ROBINS-I (Risk Of Bias in Nonrandomized Studies of Interventions) to assess ROB.⁴⁷ Using this tool involves rating a study’s ROB on each of seven domains, listed below, and then combining the domain-specific ratings to categorize the study as being at Low, Moderate, or High ROB. The domains are as follows:

Bias due to confounding
Bias in selection of participants into the study
Bias in classification of interventions
Bias due to deviations from intended interventions
Bias due to missing data
Bias in measurement of outcomes
Bias in selection of the reported result

Based on feedback from our TEP, there was consensus that studies evaluating algorithms’ effects on racial and ethnic disparities should undergo ROB assessment in the context of several racial and ethnic-specific factors. Therefore, there was a need to incorporate racial and ethnic equity-related considerations as part of ROB assessment. For four of the seven ROBINS-I domains, we used additional ROB signaling questions related to racial and ethnic health equity, adapted from a prior AHRQ project by another EPC:³⁸

Bias due to confounding domain: “Was a transparent rationale provided for including or removing race and ethnicity as an input variable?”
Bias in selection of participants into the study domain: “Were data on racial and ethnic groups gathered using consistent definitions or categories with adequate response options?”
Bias due to missing data domain: “Were there sufficient outcomes occurring in specific racial and ethnic groups to assess model performance separately in these groups?”
Bias in measurement of outcomes domain: “Were relevant model performance measures evaluated appropriately in racial and ethnic groups?”

A study was deemed at overall High ROB if any single domain (also considering the health equity signaling question in that domain) was judged to be at High ROB. A study was deemed at overall Low ROB if all domains were Low ROB. All others were moderate ROB.

Given variation in study designs of included studies, an acceptable response to the racial and ethnic equity-based signaling questions was “Not Applicable (N/A).” N/A ratings would not affect ROB assessment. Further, the racial and ethnic health equity question in the “bias due to missing data” domain was applied only to studies of algorithm derivation and internal validation. Therefore, this question was usually N/A for studies addressing KQ 1, all of which examined established, previously validated algorithms. Similarly, the health equity question added to the “bias due to measurement of outcomes” domain was restricted to model performance measures addressing discrimination and calibration outcomes. For KQ 1, only studies evaluating the clinical effects of algorithm use rather than measures of discrimination and calibration were eligible. Therefore, this health equity signaling question was N/A to studies addressing KQ 1. If a study reported both model performance outcomes (e.g., model calibration) and clinical outcomes, we reported only the latter.

For KQ 1, we organized the evidence into various clinical assessment categories, described the purpose of the identified algorithms, and narratively summarized the evidence with a focus on three potential results of algorithms: exacerbation or introduction of race and ethnicity disparities, reduction of existing racial and ethnic disparities, or a report of no discernible effect related to race and ethnicity. For KQ 2, synthesis focused on the various types of mitigation strategies identified. We analyzed the extent of different mitigation approaches, examined and classified their key features, reviewed evidence of their effectiveness when available, and summarized interventions and approaches identified for mitigation of racial and ethnic bias.

2.2. Methods for Contextual Questions

2.2.1. Contextual Questions 1-3

In addition to the literature searches conducted to address the KQs, we conducted supplemental searches to identify studies, standards, frameworks, white papers, and other relevant resources that addressed Contextual Questions (CQs) 1, 2, and 3. We also reviewed responses to AHRQ’s Request for Information (RFI)³² and discussions with SMEs, TEP, and KIs to inform our analysis of the CQs.

2.2.2. Contextual Question 4

The algorithm development-to-clinical implementation lifecycle involves multiple steps, each of which has the potential to introduce racial and ethnic bias. The conceptual model in Figure 2 guided our analysis and helped describe and summarize mechanisms through which racial and ethnic bias can be introduced and result in disparities in access, quality, and health outcomes. This conceptual model is informed by the Sociotechnical Model for Studying Health Information Technology in Complex Adaptive Healthcare Systems⁴⁸ and the conceptual model for biases in healthcare proposed by Rajkomar et al.⁴⁹

Race and ethnicity biases can be introduced at any step in the algorithm development-to-implementation process. Figure 2 organizes this process into two major steps: algorithm development (Figure 2a) and algorithm translation, dissemination, and implementation (2b). Table 3 details potential racial and ethnic biases that can be introduced during the algorithm development phase.

This figure depicts algorithm/decision tool development-to-clinical implementation lifecycle and the mechanisms through which bias can be introduced. The process is organized into two phases with multiple steps. The first phase, algorithm development, involves four steps: problem formulation, data selection and management, model training and development, and validation and performance evaluation. The second phase focuses on algorithm translation, dissemination, and implementation, and involves several cyclical steps. Algorithms are operationalized into decision tools or clinical processes, often through health information technology tools such as electronic medical records and consumer-facing tools. Clinicians then use algorithms or decision tools in patient care, which may introduce or amplify implicit or explicit racial or ethnic biases. Data feedback loops then perpetuate the cycle by incorporating patient-specific indicators back into the algorithm. Finally, the figure acknowledges the major influence on these processes by external stakeholders including vendors, payers, and policymakers.

Figure 2

Conceptual model for understanding racial and ethnic biases introduced during algorithm/clinical decision-making tool development, translation, dissemination, and implementation.

Table 3

Examples of racial and ethnic biases that can be introduced during algorithm development.

Racial and ethnic biases can be introduced de-novo during dissemination and implementation or carried over from the development phase. Dissemination focuses on the spreading of knowledge and evidence by passively informing audiences. Implementation is a more active initiative that focuses on integrating and incorporating guidance into clinical workflow, often with technological support. We outlined three vulnerabilities in which racial and ethnic bias can be newly introduced during implementation (Figure 2[b]). Racial and ethnic bias can be introduced first during translation, which is the process of operationalizing algorithms into decision tools or clinical processes. Interaction with an algorithm can result in racial and ethnic bias when a clinician is presented with guidance during care but chooses not to act. Implicit/explicit bias might occur, for example, when a clinician determines, on behalf of a mixed-race patient, which race-category to document in an EHR. Use of consumer-facing health information technology (HIT) may contribute to additional racial and ethnic biases. Examples are HIT design and language choices that do not account for differences in healthcare literacy, numeracy, and language. Furthermore, racial and ethnic bias can result when an algorithm is not updated as the evidence base evolves or changes.

The method by which algorithms are disseminated and implemented provides additional vulnerabilities for introduction of racial and ethnic bias. We organized dissemination and implementation methods into hierarchical tiers, each based on the increased impact on outcomes. Standard dissemination is defined as non-HIT-supported methods for providing guidance to clinicians. Standard dissemination requires a clinician to be aware of the existence of guidance, understand the guidance and patient applicability, and understand how to integrate guidance into care. Systems-level dissemination is defined by the use of HIT to reach clinicians, such as through a cloud-based clinical pathways library.⁵⁰ This has a potentially larger impact on outcomes than standard dissemination, as use of HIT may increase the number of clinicians who use the algorithm.⁵¹ Systems-level implementation is defined as the translation and integration of algorithms into clinical workflow, to display guidance at the right time, through the right system, to the right person, and in the right format to have the greatest likelihood to affect patient care and outcomes.

Racial and ethnic biases introduced during algorithm development can also be amplified, such as when an algorithm is incorporated in an EHR and clinicians interpret algorithm-based guidance with implicit or explicit biases. The magnitude and impact of racial and ethnic biases depend on the dissemination and implementation method (Figure 2[b]) as well as the interaction between the clinician user, dissemination and implementation method, and patient.

To inform CQ 4, we identified six algorithms, not evaluated in the studies included in KQs 1 or 2, to examine their potential impact on health and healthcare disparities. In selecting the algorithms, we considered a variety of patient populations, clinical conditions, types of algorithms, settings, and end-users. We prioritized algorithms by considering disease prevalence and burden in addition to conditions for which racial and ethnic disparities in healthcare and/or health outcomes are well-documented.

To assess the potential effects of the algorithms identified for CQ 4, we examined the health and intermediate outcomes delineated in Table 2. We also described development and validation methods and reported algorithm accuracy measures. We also documented whether algorithm developers explicitly considered potential racial and ethnic bias (e.g., by examining algorithm performance by race and ethnicity) or used any strategies that might mitigate racial and ethnic bias. Finally, we described key components of dissemination and implementation strategies used by algorithm developers and end-users and estimated the effects of these dynamics on racial and ethnic disparities. Findings for CQ 4 are available in the Results section of this report.

2.2.2.1. CQ 4 Sample Algorithm Identification and Selection

We employed five distinct approaches for identifying sample algorithms for CQ 4. Figure 3, step 1, depicts the flow and organization for these activities. First, we identified conditions with the highest disease burden and/or extreme racial and ethnic disparities in outcomes by examining available sources, such as the Centers for Disease Control and Prevention (CDC) mortality and morbidity reports and AHRQ’s National Healthcare Quality and Disparities Reports.⁵²^,⁵³ Second, we reviewed findings of the searches for the KQs and performed supplemental searches as needed to identify algorithms and studies relevant to these conditions. Third, we reviewed our discussions with KIs, SMEs, and the TEP related to specific algorithms recommended for inclusion. We contacted select experts for follow-up when needed. Fourth, we reviewed responses to the RFI³² and public posting of the KQs. Fifth, we queried select vendors to identify critical or high-use algorithms.

Results from each of the algorithm selection approaches were collated and duplicates removed. We constructed a database of algorithms from this pool and added key data, such as type of algorithm, intent of algorithm, developer/vendor, intended user, patient population, clinical condition, setting, and anticipated evidence base (e.g., citations). We used an iterative, consensus-driven approach to select the final six samples. Finally, we identified relevant and representative exemplars by study type (e.g., development, validation, implementation, comparative effectiveness) for each algorithm in the sample.

2.2.2.2. Data Abstraction

For each algorithm in the CQ 4 sample, we abstracted technical specifications, such as input variables used, datasets used for development and validation, and types of outcomes produced. We also included, when available, details about processes used for development and validation along with outcome data. Finally, we documented, when possible, information about the extent of use in clinical practice, dissemination, and implementation activities (e.g., incorporated in a guideline or EHR), and years in use or since publication. Additional variables were abstracted depending on findings.

This figure depicts the steps for identifying algorithms for evaluation in Contextual Question 4. The first step was to develop a curated list of potential algorithms by identifying diseases and conditions with the highest disease burden, and reviewing suggestions from the the Technical Expert Panel, Key Informants, Subject Matter Experts, algorithm vendors, and the public response to the Request For Information. In the next step, we removed duplicate or overlapping algorithms. In the third step we selected a specific set of algorithms for analysis, informed by expert input. Finally, we identify the best available study, or study examplar, representing each selected algorithm.

Figure 3

Framework for sample algorithm selection and data abstraction. Abbreviations: Kl = Key Informant; RFI = Request for Information; SME = Subject Matter Expert; TEP = Technical Expert Panel

2.2.2.3. Algorithm Evaluation

Each sample algorithm was evaluated qualitatively and quantitatively, as feasible, to determine the likelihood of contributing to racial and ethnic disparities. We used existing evaluation tools, identified emerging standards, and identified gaps and deficiencies with our SMEs, TEP, KIs, and other stakeholders related to assessing racial and ethnic bias in algorithms. Descriptive data for each algorithm were summarized.

2.3. Peer Review and Public Commentary

Experts in clinical care, health equity, and bioinformatics along with individuals representing stakeholder and user communities provided external peer review of this evidence review; AHRQ and an EPC program associate editor also reviewed draft reports. The draft report was posted on the AHRQ website for 4 weeks to elicit public comment. A disposition of comments table will be posted on the EHC website 3 months after AHRQ posts the final systematic review.

Bookshelf ID: NBK598797

Contents