Chapter 2Methods

Publication Details

In this chapter, we document the procedures that the Vanderbilt Evidence-based Practice Center used to develop this comprehensive evidence report on the treatment of OAB in women. We first describe our strategy for identifying articles relevant to our five key questions, our inclusion/exclusion criteria, and the process we used to abstract relevant information from the eligible articles and generate our evidence tables. We also discuss our criteria for grading the quality of individual articles and for rating the strength of the evidence as a whole. Finally, we describe the peer review process.

Literature Review Methods

Inclusion and Exclusion Criteria

Our inclusion/exclusion criteria were developed in consultation with the TEP, to capture the literature most tightly related to the key questions. Criteria are summarized below.

Table 1. Inclusion and Exclusion Criteria.

Table 1

Inclusion and Exclusion Criteria.

We excluded studies that (1) were not published in English; (2) did not report information pertinent to the key questions; (3) had fewer than 50 female participants [at enrollment]; (4) had less than 75 percent female participants or failed to report results by gender; and (5) were not original studies.

For this review, the relevant population for all key questions was women with overactive bladder, defined as “idiopathic urinary urgency and frequency with or without associated urge urinary incontinence, not related to neurogenic conditions or as a result of (incontinence) surgery.” The same inclusion/exclusion criteria were applied to identify papers for treatment-related key questions. We applied additional restrictions for inclusion and exclusion of incidence and prevalence publications for KQ1. To inform this question about the epidemiology of OAB and/or its component symptoms, we required that the study methods specify a population base a priori from which a sample of individuals was drawn as a representative selection to estimate the true proportion of prevalent and incident cases of OAB in the larger population. Additional information is provided in the results for KQ1. For KQ5, we required that publications provide data on direct costs in United States dollars for treatments reviewed in this report.

Treatment studies that included at least 75 percent women were included. This level was selected with expert input to avoid restricting to studies with only female participants as a large proportion of this literature includes both men and women, and to establish a threshold below which the difference in underlying processes (e.g., BPH in men) might substantially influence treatment effects. Studies with lower proportions of women were included if they presented results separately for women; or indicated that an interaction with gender was tested and found not to exist; or gender was controlled in the analysis. Publications about incontinence that did not distinguish between, or present results by urge, stress, and/or mixed incontinence were excluded.

Literature Search and Retrieval Process

Databases. We employed multi-term search strategies to retrieve research about treatment of overactive bladder in women, including exploration of three databases: PubMed, MEDLINE®, EMBASE, and the Cumulative Index to Nursing and Allied Health Literature (CINAHL). We also hand-searched the reference lists of relevant articles to identify additional studies for review.

Search terms. Controlled vocabulary terms served as the foundation of our search in each database, complemented by additional keyword phrases to represent the myriad ways in which overactive bladder is referred to in the clinical literature. We also employed indexing terms within each of the databases to exclude undesired publication types (e.g., reviews, case reports, CME handouts) and items published in languages other than English.

Tables 2, 3 and 4 outline our search terms and the yield from each database. Our searches were executed between April and October, 2008, and were not limited by date. From PubMed, we identified 2,400 items for further review; EMBASE yielded 318 items, including 310 already identified in PubMed and 8 unique items; CINAHL yielded 264 citations, including 240 duplicates with PubMed and 24 new articles for review.

Table 2. PubMed search strategies (last updated October 1, 2008).

Table 2

PubMed search strategies (last updated October 1, 2008).

Table 3. EMBASE search (OVID) (last updated October 1, 2008).

Table 3

EMBASE search (OVID) (last updated October 1, 2008).

Table 4. CINAHL search (EBSCO) (last updated October 1, 2008).

Table 4

CINAHL search (EBSCO) (last updated October 1, 2008).

Yield of literature searches. Figure 2 presents the yield and results from our searches. Beginning with a yield of 2,559 articles, we retained 232 articles covering 221 studies that we determined were relevant to answer our key questions and met our inclusion/exclusion criteria.

Figure 2. Disposition of articles for the treatment of overactive bladder (OAB).

Figure 2

Disposition of articles for the treatment of overactive bladder (OAB). KQ = key question *The number of articles addressing each key question and those excluded exceed the total number of articles in each category because some of articles fit into multiple (more...)

Article selection process. Once we identified articles through the electronic database searches, review articles, and bibliographies, we examined abstracts of articles to determine whether studies met our criteria. Two reviewers separately evaluated the abstracts for inclusion or exclusion, using an Abstract Review Form (Appendix B). If one reviewer concluded that the article could be eligible for the review based on the abstract, we retained it. The group included three physicians (KH, DB, RW), and two senior health services researchers (MM, SM).

Of the entire group of 2,559 articles, 586 required full text review. For the full article review, two reviewers read each article and decided whether it met our inclusion criteria, using a Full Text Inclusion/Exclusion form. Reasons for article exclusion are listed in Appendix D.

Literature Synthesis

Development of Evidence Tables and Data Abstraction Process

The staff members and clinical experts who conducted this review jointly developed the evidence tables. We designed the tables to provide sufficient information to enable readers to understand the studies and to determine their quality; we gave particular emphasis to essential information related to our KQs. We based the format of our evidence tables on successful designs used for prior systematic reviews.

The team was trained to abstract by abstracting several articles into evidence tables and then reconvening as a group to discuss the utility of the table design. We repeated this process through several iterations until we decided that the tables included the appropriate categories for gathering the information contained in the articles. A priori, with the technical expert panel, a hierarchy of baseline characteristics and outcome measures was developed: UUI episodes, urgency, incontinence, voids, nocturia, QoL, urodynamic measures, and adverse events. All team members shared the task of initially entering information into the evidence tables. Another member of the team also reviewed the articles and edited all initial table entries for accuracy, completeness, and consistency. The two abstractors reconciled disagreements concerning the information reported in the evidence tables. The full research team met regularly during the article abstraction period and discussed global issues related to the data abstraction process. In addition to outcomes related to treatment effectiveness, we abstracted all data available on harms. Harms encompasses the full range of specific negative effects, including the narrower definition of adverse events.

The final evidence tables are presented in their entirety in Appendix C. Studies are presented in the evidence tables alphabetically by the last name of the first author. When possible, studies resulting from the same study population were grouped into a single evidence table. A list of abbreviations and acronyms used in the tables appears at the beginning of that appendix.

Synthesis of the Evidence

A series of spreadsheets was created to support systematic tabulation and assessment of study characteristics including key study population characteristics, number of participants by group, treatment received, length of followup, age of participants by group, outcomes measured and outcomes. This allowed us to identify common threads in reporting across publications.

Within the pharmaceutical treatment studies all unique trial arms were entered into a spreadsheet with exact values to two-decimal points as available for baseline measures, followup measures, difference from baseline, and for the related statistical indicators of precision (such as standard deviations, standard errors, or confidence bounds). This was done to facilitate calculation of weighted averages and to support meta-analysis.

Conduct of meta-analysis. Descriptive statistics were computed and examined for homogeneity among studies. Studies that reported weekly rates for UUIs and voids were standardized to daily rates. When only ranges of continuous variables were reported (instead of standard deviations), we estimated the standard deviations by dividing the range by four.5 Study results were combined and summarized using two meta-analysis techniques, weighted averages and fixed effects regression models.6 In particular, minimum variance weighted averages of the mean daily decrease in UUI and voids per arm were computed using weights that were inversely proportional to their standard errors. To borrow strength across arms, we used fixed effect regression models with robust standard errors (to account for the clustering by study) and weighted the study arms inversely proportional to their standard errors of the mean. Each arm was treated as a fixed effect and study was not included in the model except in the sense that the clustering was addressed by the robust standard errors. Fixed effects models were also adjusted for mean age and proportion of women in each arm. We used STATA 10.0 and R statistical packages for computations.

Summary tables within this report. Each of the pharmacologic agents in this report is available in a least one dose form for clinical use. As part of the process of Food and Drug Administration (FDA) approval each has been determined superior to placebo for at least one facet of the syndrome, e.g., urge urinary incontinence, frequency of urination, symptoms of urgency, or nocturia. The experience of having overactive bladder is a constellation of these self-reported events, symptoms, and the impact that they have on an individual’s life. Thus measures of quality of life, interference with daily activities, degree of distress from symptoms, and satisfaction with the outcomes of treatment are also common and helpful metrics in this literature. Where common measures are available across studies using roughly comparable assessments (i.e., similar index questions, time intervals, etc), we have compiled tables to summarize outcomes of treatments.

Given the content of the literature, this means that the majority of the information included in tables is for the outcomes of number of urge incontinence episodes per day and number of voids per day. (Studies with weekly or other metrics that could be converted to daily metric are included.) Because momentum in drug development within related classes of drugs has been toward daily dosing, the pharmacologic treatments are arranged from highest dose at lowest frequency of administration (daily) to lower doses and greater administration frequency (twice or more daily). Placebo arms from the same trials in which the drugs were evaluated are included within the table, or for areas in which the literature is large, in companion placebo outcome tables. The number of weeks of treatment and timing of followup outcome assessment were the same in these trials. The weeks of treatment column is therefore comparable to weeks at evaluation of outcomes.

Summary tables include data from distinct clinical trial arms in which the drug and dose or other type of treatment were evaluated for the related outcome. As a result, a single study may contribute more than one treatment arm as well as a placebo arm to the summary tables. For pharmacologic treatment we included only study arms in which no dose adjustment was allowed. Because many studies are dose finding with multiple drug arms or are direct comparisons of pharmacologic agents, there are more drug arms than placebo arms for virtually all drugs and treatment types.

Quality Rating of the Individual Studies

Rating the quality of individual articles. We developed our approach to assessing the quality of individual articles based on our prior experience with conducting systematic reviews.

Internal validity. The criteria for assessing internal validity were as follows:

Randomized allocation to treatment. This assessment combines randomization and method of randomization into a single criterion with a three-point scale.

Rationale: By randomly assigning groups to the intervention of interest, other factors that may confound the results are equally distributed between groups (assuming a large enough sample size). This equal distribution minimizes the chances of over- or underestimation of treatment effect based on unequal distribution of confounding factors.

If randomized, we also evaluated the study for randomization methods, using the rationale described in Matchar and colleagues, 2001.190

Rationale: “Pseudo-randomization” methods may be susceptible to bias, as demonstrated by evidence of unequal distribution of subject characteristics191 and larger effect sizes compared with studies using more rigorous methods.192 In addition, methods of allocation concealment are also important in preventing bias (e.g., use of prepared sealed envelopes).

We combined these elements into a single operational definition, as described below:

Operational definition: Criterion met if randomization methods were not susceptible to bias, such as computer-generated numbers in sealed sequentially numbered envelopes (+). Criterion not met by studies that either used methods more prone to bias, such as alternate medical record numbers, or did not describe randomization methods or methods of allocation concealment (−). Criterion not applicable if treatment was not randomly allocated (NA).

Masking.

Rationale: Masking, also known as blinding, refers to the concealment of treatment allocation from the care provider, the assessor, and the patient.193 In certain trials, particularly surgical trials, masking the patient or the surgeon from the treatment allocation can be challenging or impossible. Similarly, masking the assessor assigned to record immediate post-procedural outcomes such as wound healing can also be difficult. Nevertheless, when possible, masking prevents expectations from influencing findings.

Operational definition: Criterion was met if assessors and participants were masked to treatment or group (+). Criterion was not met if either care provider, assessor, or patient were not masked (−). Criterion not applicable if treatment was not randomly allocated.

Adequate description of patients and control selection criteria.

Rationale: Patient characteristics that might affect outcomes (such as severity of symptoms, duration of symptoms, failure of prior treatment, or medical comorbidities) are likely to differ between two interventions. If these differences are not characterized, then erroneous conclusions may be drawn.

Operational definition: Criterion met if (a) inclusion and exclusion criteria for participation in the study were well described.

We expected that the study population should be adequately described to make clear the potential for confounding in the analysis. We expected the study authors to adequately describe the study population such that it could theoretically be reproducible by another investigator. We expected comparable methods to be used to identify and screen participants across exposure or treatment groups.

Description of loss to followup.

Rationale: Failing to account for patients lost to followup may lead to erroneous conclusions, especially if the loss to followup is related to either the underlying disease or the intervention (e.g., patients seeking care elsewhere because of continuing symptoms or unacceptable side effects of treatment).

Operational definition: Criterion met for adequate followup (+) if (a) loss to followup was explicitly reported and (b) no more than 20 percent of any study arm was lost to followup. Those studies with less than 10 percent lost to followup were given an extra (+). Studies with greater than 20 percent lost to followup were considered inadequate for this measure (−).

Description of dropout rates.

Rationale: Dropout rates may reflect differences in clinically important variables, such as side effects or treatment response. Failure to account for dropouts may result in erroneous conclusions similar to those seen with failure to account for loss to followup.

Operational definition: Criterion met if (a) patients dropping out of the study prior to completion were reported and (b) no more than 10 percent in any study arm left the study for reasons related to the study intervention or withdrawal of consent.

Power calculation provided.

Rationale: Many studies, especially case series, lack sufficient power to detect clinically important differences in outcomes or patient characteristics.

Operational Definition: Criterion met if a power calculation (pre or post) was provided.

Recognition and description of statistical issues.

Rationale: Use of inappropriate tests may lead to misleading conclusions. For example, variables such as number of voids per day or costs are often not normally distributed; use of means instead of medians when data may be affected by outlying observations can be misleading.

Operational definition: Criterion met if (a) appropriate statistical tests were used (e.g., nonparametric methods for variables with nonnormal distributions, or survival analysis techniques to account for loss to followup and dropouts) and (b) potential study limitations regarding design and analysis were discussed. Criterion not met if (a) inappropriate statistical tests were used or (b) study limitations were not discussed. An intention-to-treat (ITT) analysis was required of clinical trials.

External validity. The criteria for assessing external validity were as follows:

Baseline characteristics: We created a composite score for adequacy of the description of baseline characteristics. At minimum, we expected age and baseline OAB status to be presented. If either of these were omitted, criteria were not met. If the authors provided additional information above and beyond age and OAB status at baseline on any of the following, they were awarded an additional +: race/ethnicity, BMI, parity, menopausal status, prior treatment/surgery, duration of symptoms.

Required elements:

Description of age of study population.

Rationale: The outcomes of many interventions are affected by patient age. Age is especially important in studies related to reproductive health in women and associated with rates of overactive bladder.

Operational definition: Criterion met if summary statistics of subject age were given by comparison group. Criterion not met if summary statistics were not given.

Baseline OAB status.

Rationale: The baseline level of severity of OAB could affect the likelihood of successful treatment. Furthermore, definitions of OAB are not consistent across studies and may include different combinations of urgency, frequency, and incontinence that could affect interpretation of the outcomes. Therefore, we sought to determine whether studies defined OAB status by ICS or other criteria, by UUI alone or by combinations of UUI, urgency and frequency.

Operational definition: Criterion met if symptoms of OAB were presented by study group.

Length of followup.

Rationale: Outcome measures may vary depending on when they are obtained. Description of when outcomes were measured facilitates comparison between studies. We considered three months to be a minimally acceptable period of followup for observing effectiveness of treatment for OAB.

Operational definition: Criterion met if the study followed participants for at least three months, with an extra point provided for greater than or equal to six months.

Adequate description of methods used for outcome measurement.

Rationale: Comparison between studies requires common methods of measurement, which in turn requires adequate description of the methods used to assess comparability.

Operational definition: Criterion met if (a) methods used to measure outcomes were adequately described or referenced (e.g., 2002 ICS; QoL scales), (b) definitions were given (e.g., description of outcomes classified as “adverse events”), or (c) outcomes were unambiguous (e.g., reduction in number of voids per day). Criterion not met if (a), (b), or (c) was not present.

Adequate description of validity and reliability of outcome measurement.

Rationale: Measurements of outcomes are only useful if changes in the outcome being measured are reflected in changes in the measurement (validity) and if these changes are reasonably consistent between the same observer measuring at different times or between different observers (reliability). For example, changes in a scale to assess menstrual blood flow should correlate with some other physiological measure of menstrual blood loss, and this correlation should be consistent when different women apply the same scale.

Operational definition: Criterion met if (a) a description of the methods used to assess validity and reliability of at least one outcome measure was provided, (b) a reference to another article documenting validity and reliability was provided, or (c) only unambiguous outcomes were included as primary outcomes. Criterion not met if (a), (b), or (c) was not present.

Adequate description of the intervention provided to subjects.

Rationale: The ability to replicate study results is dependent on adequate description of methods. Additionally, readers should be aware of aspects of clinical care that might influence outcomes.

Operational definition: Criterion met if (a) a detailed description of the therapy (dose, dosing schedule, protocols for behavioral interventions, and route of administration for medications and/or techniques for invasive therapies) was provided; (b) a reference to another publication describing the procedure was provided; or (c) statistical adjustment was made for likely sources of variation in clinical care (e.g., site where care was given, type of specialist providing care, individual provider, dose and timing).

Criterion not met if (a), (b), or (c) was not provided.

Table 5. Scoring algorithm for quality rating of individual studies.

Table 5

Scoring algorithm for quality rating of individual studies.

Strength of Available Evidence

Our scheme follows the criteria applied in earlier systematic reviews of systems for rating the strength of a body of evidence.194, 195 That system includes three domains: quality of the research, quantity of studies (including number of studies and adequacy of the sample size), and consistency of findings. Two senior investigators assigned grades by consensus.

We graded the body of literature for each key question and present those ratings as part of the discussion in Chapter 4. The possible grades were:

I. Strong: The evidence is from studies of strong design; results are both clinically important and consistent with minor exceptions at most; results are free from serious doubts about generalizability, bias, or flaws in research design. Studies with negative results have sufficiently large samples to have adequate statistical power.

II. Moderate: The evidence is from studies of strong design, but some uncertainty remains because of inconsistencies or concern about generalizability, bias, research design flaws, or adequate sample size. Alternatively, the evidence is consistent but derives from studies of weaker design.

III. Weak: The evidence is from a limited number of studies of weaker design. Studies with strong design either have not been done or are inconclusive.

IV. No evidence: No published literature.

External Peer Review

As is customary for all systematic evidence reviews done for AHRQ, this report was reviewed by a wide array of individual outside experts in the field, including our TEP, and from relevant professional societies and public organizations. AHRQ also requested review from its own staff. The Scientific Resource Center sent 11 invitations for peer review. Reviewers included clinicians (e.g., urologists, urogynecologists, gynecologists, geriatricians, family medicine physicians, and nurse practitioners), representatives of federal agencies, advocacy groups, and potential users of the report.

The Scientific Resource Center charged peer reviewers with commenting on the content, structure, and format of the evidence report, providing additional relevant citations, and pointing out issues related to how we had conceptualized and defined the topic and KQs. We also asked reviewers to complete a peer review checklist. The Scientific Resource Center received eight responses in addition to comments from AHRQ staff. The individuals listed in Appendix E gave us permission to acknowledge their review of the draft. We compiled all comments and addressed each one individually, revising the text as appropriate.