2Methodology

Publication Details

This evidence report is based on a systematic review of the literature. Our EPC formed an evidence review team consisting of pediatricians and EPC methodological staff to review the literature and perform data abstraction and analysis. The evidence review team held several meetings and teleconferences with external technical experts representing the American Academy of Pediatrics, the American Academy of Family Physicians, the National Association of Pediatric Nurse Practitioners, the Center for Quality of Care Research and Education, Harvard School of Public Health, and the organization, Parents of Infants and Children with Kernicterus. The EPC and its panel of external technical experts refined key questions proposed by the AAP and identified issues central to this report. A comprehensive search of the medical literature was conducted to identify the evidence available to address the questions. For this evidence report, we compiled evidence tables of study features and results, appraised the methodological quality of the studies, assessed the correlations of the predictors and outcomes, summarized the results, and performed meta-analyses when there were sufficient data.

This section documents the methods and procedures that were used to develop this evidence report. It begins with a description of the scope of the research questions and proceeds to a detailed description of the techniques and approaches that were used in the literature review.

Key Questions

Question 1: What is the relationship between peak bilirubin levels and/or duration of hyperbilirubinemia and neurodevelopmental outcome?

Question 2: What is the evidence for effect modification of the results in question 1, by gestational age, hemolysis, serum albumin, and other factors?

Question 3: What are the quantitative estimates of efficacy of treatment for: 1) reducing peak bilirubin levels (e.g., number-needed-to-treat (NNT) at 20 mg/dl to keep TSB from rising); 2) reducing the duration of hyperbilirubinemia (e.g., average number of hours by which time TSB greater than 20 mg/dl may be shortened by treatment); and 3) improving neurodevelopmental outcomes?

Question 4: What is the efficacy of various strategies for predicting hyperbilirubinemia, including hour-specific bilirubin percentiles?

Question 5: What is the accuracy of transcutaneous bilirubin measurements?

Literature Search and Review Parameters

This section describes the search terms, strategies, and databases that were used in the literature retrieval; the article screening and selection process; methods that were used for developing the data extraction forms, abstracting data, and reviewing and analyzing the literature; and the results of the literature review.

Search Strategies

We searched the MEDLINE® database on September 25, 2001 for publications from 1966 to the present using relevant MeSH terms (“hyperbilirubinemia”, “hyperbilirubinemia, hereditary”, “bilirubin”, “jaundice, neonatal”, “kernicterus”) and text words (“bilirubin”, “hyperbilirubinemia”, “jaundice”, “kernicterus”, “neonatal”). The abstracts were limited to human and English studies focusing on newborns between birth and one month of age. In addition, the same textwords used for the MEDLINE® search were used to search the Pre- MEDLINE® database. The strategy yielded 4,280 MEDLINE® and 45 Pre-MEDLINE® abstracts. We consulted domain experts and examined relevant review articles for additional studies. A supplemental search for case reports of kernicterus in reference lists of relevant articles and reviews was also performed.

Screening and Selection Process

In our preliminary screening of abstracts, we identified over 600 potentially relevant articles for questions 1, 2, and 3. To handle this large number of articles, we devised the following scheme to address the key questions and to ensure that the report was completed within the time and resources constraints. We included only studies that measured neurodevelopment or behavioral outcomes (except for question 3 part-1, for which we evaluated all studies addressing the number-needed-to-treat (NNT) issue regardless of whether the study reported these outcomes). For the specific question on quantitative estimates of treatment efficacy, all studies concerning therapies designed to prevent hyperbilirubinemia (generally defined as bilirubin ≥ 20 mg/dl) were included in the review. The inclusion and exclusion criteria for the systematic review were discussed in several teleconferences of the EPC evidence review team and technical experts. The criteria underwent several revisions before their final acceptance by the panel members. The final screening criteria for inclusion and exclusion of articles are described below.

Inclusion Criteria

The target population of this review was healthy, full-term infants. For the purpose of this review, we included articles concerning infants who were at least 34 weeks estimated gestational age (EGA) at the time of birth. From studies that reported birthweight rather than age, infants whose birthweight was greater or equal to 2,500 grams were included. This cut-off was derived from findings of the National Institute of Child Health and Human Development hyperbilirubinemia study (Bryla, 1985), in which none of the 1,339 infants greater than or equal to 2,500 grams was less than 34 weeks EGA. Articles were selected for inclusion in the systematic review based on the following additional criteria:

  • Key Question 1 or 2 (risks association)
    • Population: Infants ≥ 34 weeks EGA or ≥ 2,500 grams
    • Sample Size: More than 5 subjects per arm
    • Predictors: Jaundice or hyperbilirubinemia
    • Outcomes: At least one behavioral/neurodevelopmental outcome reported in the article
    • Study Design: Prospective cohorts (more than 2 arms), prospective cross-sectional study, prospective longitudinal study, prospective single-arm study, or retrospective cohorts (more than 2 arms).
    Case reports of kernicterus:
    • Population: Kernicterus case
    • Study Design: Case reports with kernicterus as a predictor or an outcome
    • Definitions of kernicterus: Acute phase of kernicterus (poor feeding, lethargy, high-pitched cry, increased tone, opisthotonus, seizures), kernicterus sequelae (motor delay, sensorineural hearing loss, gaze palsy, dental dysplasia, cerebral palsy, mental retardation), necropsy finding of yellow-staining in the brain nuclei.
  • Key Question 3 (treatments)
    Number-Needed-to-Treat (NNT) question:
    • Population: Infants ≥ 34 weeks EGA or ≥ 2,500 grams
    • Sample Size: More than 10 subjects per arm
    • Treatments: Any treatment for neonatal hyperbilirubinemia
    • Outcomes: Serum bilirubin level ≥ 20 mg/dl or frequency of exchange transfusion specifically for bilirubin level ≥ 20 mg/dl
    • Study Design: Randomized or non-randomized controlled trials
    All other issues:
    • Population: Infants ≥ 34 EGA or ≥ 2,500 grams
    • Sample Size: More than 10 subjects per arm for phototherapy; any sample size for other treatments
    • Treatments: Any treatment for neonatal hyperbilirubinemia
    • Outcomes: At least one neurodevelopmental outcome was reported in the article
  • Key Question 4 or 5 (diagnosis)
    • Population: Infants ≥ 34 EGA or birthweight ≥ 2,500 grams
    • Sample Size: More than 10 subjects
    • Reference Standard: Laboratory-based serum bilirubin

Exclusion Criteria

Case reports of kernicterus were excluded if they did not report serum bilirubin level, or gestational age and birthweight.

Results of Screening of Titles and Abstracts

Preliminary screening identified 663 out of a total of 4,560 abstracts located through the literature search described above. There were 158, 174, 99, 153, and 79 abstracts for question 1, 2, 3, 4 and 5 respectively.

Screening of Full-Text Articles

After full-text screening (according to the inclusion and exclusion criteria described above), 138 of total 253 retrieved articles were included in this report. There were 35 articles in the correlation section (questions 1 and 2), 28 articles of kernicterus case reports, 21 articles in the treatment section (question 3), and 54 articles in the diagnosis section (questions 4 and 5). There were inevitable overlaps because treatment effects and neurodevelopmental outcomes were inherent in the study designs. Below is a summary of the four-step selection process.

Literature Selection Process by Topic.

Table

Literature Selection Process by Topic.

Reporting the Results

Articles that passed the full-text screening were grouped according to topic and were carefully analyzed in their entirety. Data were abstracted onto the data extraction form that had been specially designed for each topic (see Appendix B for example of data extraction forms). Extracted data were synthesized into evidence tables.

The evidence found for the key questions is summarized in three complementary forms. The evidence tables provide detailed information about key features of the study design and results of all the studies reviewed. In addition, narrative description and tabular summary of the strength and quality of the evidence of each study are provided for each question. For question 5, meta-analyses were performed to provide quantitative estimates of test performance.

A total of six evidence tables are included in this evidence report. Two evidence tables were created for question 3 because of different types of outcomes analyzed. One table each was created for question 4 and question 5. For case reports of kernicterus, one evidence table was created for recording relevant data for each kernicterus case.

Summarizing the Evidence of Individual Studies

Grading of the evidence can be useful for indicating the overall methodological quality of a study. While a simple evidence grading system using a single scale may be desirable, the “quality” of evidence is multi-dimensional, and a single metric cannot fully capture information needed to interpret a clinical study. We believe that information on individual components of a study contribute more to the evaluation of evidence by deliberating bodies than a single summary score. The evidence-grading scheme used here assesses four dimensions that are important for the proper interpretation of the evidence:

  • study size
  • applicability
  • summary of results
  • methodological quality

Applicability, also known as generalizability or external validity, addresses the issue of whether the study population is sufficiently broad to be generalizable to the population at large. Individual studies are often unable to achieve broad applicability due to restricted study population characteristics and a small number of study subjects (Lau, Ioannidis, and Schmid, 1997). In this evidence report, because of the relative homogeneity (primary focus being healthy newborns) of the study populations, applicability is not explicitly graded for each study. Instead, where applicable, studies are grouped together to form more similar subgroups for analyses or discussion.

Study Size

The study sample size is used as a measure of the weight of the evidence. A large study provides a more precise estimation of the treatment effect but does not automatically confer broad applicability unless the study included a broad spectrum of patients. Very small studies, taken individually, cannot achieve broad applicability. But several small studies that enrolled diverse populations, taken together, may have broad applicability. The study size is included as a separate dimension used to assist the assessment of applicability. For summarizing all studies, this would be the number of studies and the total number of patients in these studies.

Methodological Quality

Methodological quality or internal validity addresses the design, conduct, and reporting of the study. Some of the items belonging to this entity are widely used in various “quality” scales and for randomized controlled trials. They usually include items such as concealment of random allocation, treatment blinding, and handling of dropouts. Because different types of study designs are used to address different questions and for consistency in the interpretation across different designs, we defined a three category scale to report the methodological quality of the studies in the evidence report: A (least bias), B (susceptible to some bias), or C (likely to have large bias). These criteria are described below.

  • Criteria for evaluating the methodological quality of studies are that assess association (questions 1 and 2):
    1. Prospective. Complete methods and results (including inclusion/exclusion criteria). Proper control/comparison group, correct analyses performed.
    2. Prospective or retrospective. Not all criteria of A. Some deficiencies; however, unlikely to cause major bias.
    3. Prospective or retrospective. Significant design or reporting errors, large amount of missing information or potential bias.
  • Criteria for evaluating the methodological quality of studies that assess effects of treatments (question 3):
    1. Randomized controlled trial. Complete methods and results (including inclusion/exclusion criteria) described. Proper randomization and/or blinding, and correct analyses performed.
    2. Non-randomized controlled trial or other prospective design (prospective cohort or case-control study). Proper selection of control group. Not all criteria of A. Some deficiencies; however, unlikely to cause major bias.
    3. Retrospective or no control group. Significant design or reporting errors, large amount of missing information or significant potential bias.
  • Criteria for evaluating the methodological quality of studies that assess diagnostic test performance (questions 4 and 5):
    1. Prospective. Complete methods and results (including inclusion/exclusion criteria) described. Proper reference standard used and correct analyses performed.
    2. Prospective or retrospective. Not all criteria of A. Some deficiencies; however, unlikely to cause major bias.
    3. Prospective or retrospective. Significant design or reporting errors, large amount of missing information or bias.

Definition of Terminology in This Report

  • Confounders (for Key Question 1 only): (1) An ideal study design to answer Question 1 would be to follow two groups, jaundiced and normal infants, without treating any infant for a current or consequent jaundice condition, and observe their neurodevelopmental outcomes. Therefore, any treatment received by the subjects in the study was defined as a confounder. (2) If subjects had known risk factors of jaundice, such as prematurity or low birth weight, the risk factors were defined as confounders. (3) Any disease condition other than jaundice was defined as a confounder. (4) Since bilirubin level is the essential predictor, if the study did not report or measure bilirubin levels for the subjects, lack of bilirubin measurements was defined as a confounder.
  • Acute phase of kernicterus: poor feeding, lethargy, high-pitched cry, increased tone, opisthotonus, seizures.
  • Chronic kernicterus sequelae: motor delay, sensorineural hearing loss, gaze palsy, dental dysplasia, cerebral palsy, mental retardation.

Brainstem Auditory Evoked Potential (BAEP) or Brainstem Auditory Evoked Response (BAER)

  • This test is generally used for screening of newborn hearing and the report uses the following definition:
    “It is recorded as five to seven waves. Waves I, III and V can be obtained consistently in all age groups. Waves II and IV appear less consistently. The latency of each wave (time of occurrence of the wave peak after stimulus onset) increases, and the amplitude decreases with reductions in stimulus intensity or loudness. Developmental change occurs in the latency of the various waves; latency decreases with increasing age, with the earliest waves reaching mature latency values earlier in life than the later waves. It is used as an audiometric test, providing information about the ability of the peripheral auditory system to transmit information to the auditory nerve and beyond, and it is used in the monitoring of central nervous system pathology. It is not accurate in predicting neurologic recovery and outcome. (Behrman, Kliegman, Jenson, 2000).”

Statistical Analyses

In this report, two statistical analyses were performed when there was sufficient data: the number needed to treat (NNT) and receiver operating characteristics (ROC) curve.

Number Needed to Treat (NNT)

NNT, expressing the benefit of an active treatment over a control, was calculated to quantify the efficacy of treatment for neonatal hyperbilirubinemia. NNT can be used either for summarizing the results of a therapeutic trial or for individualized medical decision-making. For key Question 3 in this report, NNT can be interpreted as the number of newborns needed to be treated at 20 mg/dl to keep the TSB in one newborn from rising.

The absolute risk reduction (ARR) is the difference between the event rate in the treatment group and the event rate in the control group. It is the denominator in the NNT calculation. We report 95% confidence intervals along with all estimates.

Image er-neonatalf19.jpg

Receiver Operating Characteristics (ROC) Curve

ROC curves were developed for individual studies of key question 4 if multiple thresholds of a diagnostic technology were reported. The areas under the ROC curves (AUC) were calculated to provide an assessment of the overall accuracy of the test and allow indirect comparisons with other tests.

Meta-analyses of Diagnostic Test Performance

Meta-analyses were performed to quantify the transcutaneous bilirubin measurements where the data were sufficient. We used three complementary methods for assessing diagnostic test performance: summary receiver operating characteristics (SROC) analysis, independently combined sensitivity and specificity values, and meta-analysis of correlation coefficients. All meta-analysis schematic and statistics were reported in the Meta-Analyses chapter.

Summary Receiver Operating Characteristics (SROC) Analysis

The SROC method assumes that the variability in the reported sensitivity and specificity values from different studies is due to different cut-off values applied (Moses, Shapiro, and Littenberg, 1993). Each study provides a pair of sensitivity and specificity values to the analysis. It uses a regression method to fit a curve that best describes the data in the ROC space. We used the unweighted SROC method because it is probably less biased than the weighted regression method (Irwig, Macaskill, Glasziou et al. 1995).

The areas under different SROC curves can also be calculated and compared across technologies. However, the range of sensitivity and specificity values from studies in a meta-analysis of diagnostic tests is often limited, and extrapolation of the SROC analysis beyond the values of actual data is not reliable. Most of the technologies we examined have narrow reported ranges of sensitivity or specificity values. Therefore, we did not calculate the area under the SROC curve for any of the technologies.

Independently Combined Sensitivity and Specificity Values

When there is little variability in the test results—studies appeared to be operating at similar thresholds and reported similar results—SROC analysis provided little additional information. In this case, separately averaged sensitivity and specificity values across studies will give similarly useful summary information.

We combined the sensitivity and specificity values of the tests across studies using a random-effects model to estimate the average values. A random-effects model incorporates both the within-study variation (sampling error) and between-study variation (true treatment-effect differences) into the overall treatment estimate. It gives a wider confidence interval than the fixed-effects model (which considers only within-study variability) when estimates are based on heterogeneous results.

When each is combined separately, sensitivity and specificity tend to underestimate the true test sensitivity and specificity. They are nonetheless useful estimates of the average test performance and provide an indication of the approximate test operating point for most of the studies. The appropriateness of this method can be verified by inspecting the location of the combined estimates and noting the distance of the estimates from the SROC curve. In our experience, the random effects-averaged sensitivity and specificity results are close to the unweighted SROC curve and well within the confidence intervals of each other. Average sensitivity and specificity results also serve as useful baseline test performance values for the decision and cost-effectiveness analysis.

Meta-analysis of Correlation Coefficients

Correlation coefficients measure the correlation of one diagnostic test to another, but do not provide any information about the clinical utility of the diagnostic test. Also, they are inadequate for measuring the accuracy of a test in estimating serum bilirubin levels for two reasons. Although correlation coefficients (r) measure the association between transcutaneous bilirubin and “standard” serum bilirubin measurements, the correlation coefficient is highly dependent on the distribution of serum bilirubin in the study population selected. Second, correlation measures ignore bias and measure relative rather than absolute agreement (Bland and Altman, 1986). Nonetheless, a large number of studies continued using the correlation coefficient but only some of them did sensitivity and specificity analyses for examining the test accuracy.

Meta-analyses were performed to compare transcutaneous bilirubin (TcB) against laboratory essay of total serum bilirubin (TSB) measurements, where the data were sufficient. These studies provided the correlation coefficient for TcB vs. TSB, but several analytic issues arose regarding potential duplication of information. To address these issues, the following rules were developed and applied:

Rule 1. Whenever a study provided not only data for the whole population but also data splitting the population into subgroups of patients, only the one entry about the whole population was used to avoid duplicating the data. However, if the subgroups of patients did not meet our inclusion criteria described before, the whole population data was dropped and only the qualifying subgroup data was entered.

Rule 2. When a study gave data for separate conditions on the same subjects and then also an aggregate of all conditions, only aggregate data was used to avoid duplication.

Rule 3. Studies were indexed by different measurement sites or measurement protocols (metrics). When several different metrics were provided for a specific comparison, all were retained and entered in the respective metric-specific subgroup analyses. However, only one metric was retained for the overall synthesis to avoid duplication. The order of preference was: forehead, sternum, other sites (preferring the most often used site).

We combined the correlation coefficient across studies using a random-effects model. The random-effects model incorporates both within and between study variations. Such models yield more “conservative” (i.e. wider) confidence intervals. Subgroup analyses on different factors that might affect the correlation coefficient of the test and gold standard were also performed.

Since using the repeated measurements in one subject would overestimate the correlation coefficient, the number of measurements was replaced by the number of patients whenever it was available. We report 95 percent confidence intervals for all estimates.

Software and Statistical Analyses

Statistical analyses using the SROC curve method and combining sensitivity and specificity using the random-effects model was performed using “Meta-Test” version 0.6. The computer program was developed by the EPC director (Dr. Lau) and is available to the public. Meta-analysis of correlation coefficient using a random-effects model was performed using Comprehensive Meta-Analysis™ version 1.0.23 (Biostat™ Inc.). We report 95 percent confidence intervals along with all estimates.

Analyses for Kernicterus Case Reports

Where data was available, we extracted serum bilirubin measurements on admission or at the time of diagnose of kernicterus. We also extracted the peak serum bilirubin levels. From these data, we created histograms on the distribution of serum bilirubin levels. We performed subgroup analyses of these cases to see if there are factors related to the patterns of serum bilirubin distribution.