Trial and Error

Trial and Error: How to Avoid Commonly Encountered Limitations of Published Cardiovascular Clinical Trials

Randomized controlled clinical trial is the gold standard scientific method for the evaluation of diagnostic and treatment interventions. Such trials are cited frequently as the authoritative foundation for evidence-based management policies. Nevertheless, they have a number of limitations that challenge the interpretation of the results. The strength of evidence is often judged by conventional tests that rely heavily on statistical significance. Less attention has been paid to the clinical significance or the practical importance of the treatment effects. One should be cautious that extremely large studies might be more likely to find a formally statistically significant difference for a trivial effect that is not really meaningfully different from the null. Trials often employ composite end points that, although they enable assessment of nonfatal events and improve trial efficiency and statistical precision, entail a number of shortcomings that can potentially undermine the scientific validity of the conclusions drawn from these trials. Finally, clinical trials often employ extensive subgroup analysis. However, lack of attention to proper methods can lead to chance findings that might misinform research and result in suboptimal practice. Accordingly, this review highlights these limitations using numerous examples of published clinical trials and describes ways to overcome these limitations, thereby improving the interpretability of research findings.

Randomized controlled trial (RCT) is the highest level of scienti?c progress in clinical medicine. Although the randomization process minimizes the imbalance in measured and unmeasured confounding variables, thereby allowing one to conclude causation, not just association, a number of limitations nevertheless serve to challenge the interpretation of the results. Three technical limitations include:

Significance of clinical significance Vs Statistical significance

Less attention has been paid to the clinical significance or the practical importance of the treatment effects. Strength of evidence is often judged by conventional tests that rely heavily on statistical signi?cance and estimation of con?dence intervals (CIs).

Composite end points are often used to increase the proportion of outcome events, thereby reduce requisite sample size. Although this practice improves trial ef?ciency and statistical precision, it entails a number of shortcomings that can undermine the scienti?c validity of the conclusions drawn from these trials.

Additional exploratory subgroup analyses are frequently performed without suf?cient attention to the reliability of these subordinate analyses. This leads to the reporting of chance ?ndings that encourage suboptimal patterns of practice. In this review, we highlight each of these limitations using numerous examples of published clinical trials and propose practical ways to avoid them and help to improve the interpretation of the published findings.

Emphasize Clinical Importance Over Statistical Significance

The conventional approach to assess the strength of the association between an intervention and outcome (the evidence) focuses on p values and confidence interval (CI).1-4 A p value or observed signi?cance level provides a measure of the inconsistency of the data with respect to a speci?c hypothesis. In clinical trials, investigators pre-specify a signi?cance level (most commonly 0.05) that represents the maximum probability they will tolerate of rejecting a hypothesis when it is in fact true. Some have suggested that p values provide a measure of the strength of the evidence against the null hypothesis; the smaller the p value, the stronger the evidence against the null hypothesis. For example, Sterne and Smith3 suggest that a p value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that a p value 0.001 does. In contrast, others have cautioned that because p values are dependent on sample size, a p value of 0.001 should not be interpreted as providing more support for rejecting the null hypothesis than one of 0.05.4

In contrast to the well-established standards for decisions regarding statistical signi?cance, no particular guidelines exist for deciding what magnitude of difference is “clinically signi?cant” or “practically important”.1-6 The latter decision depends upon the quantitative magnitude of the treatment effect and the associated context—the seriousness and frequency of the outcome of interest and the bene?t-risk-cost pro?le. Because of the inherently subjective and context-speci?c nature of these judgments, investigators have been reluctant to establish ?xed boundaries for what constitutes a clinically signi?cant difference.

An unintended consequence of this lack of an established standard has been an erroneous tendency to equate statistical signi?cance with clinical signi?cance. In some instances, statistically signi?cant results may not be clinically important (e.g., small differences in studies with large sample size), and conversely, statistically insigni?cant results do not completely rule out the possibility of clinically important effects (e.g., large differences in studies with small sample size).6 Ideally, assessment of both statistical and clinical signi?cance should be used to appraise the strength of the evidence and to aid in optimal utilization of therapeutic interventions in clinical practice.

For example, Consider the TACTICS–TIMI 18 (Treat Angina With Aggrastat and Determine Cost of Therapy With an Invasive or Conservative Therapy–Thrombolysis In Myocardial Infarction 18) trial, a randomized trial of early invasive versus early conservative management of patients with acute coronary syndromes (ACS).7 In designing the trial, the investigators powered the study to detect a 25% relative risk reduction, presumably representing their estimate of a minimum clinically important difference (MCID) in outcome. Upon conducting this trial, a total of 177 events (15.9%) were observed among 1,114 patients assigned to early invasive management versus 215 events (19.4%) among 1,106 patients assigned to early conservative management.7 The relative risk reduction for this 3.5% absolute difference was 18% (95% CI: 2% to 32%), and this was determined to be statistically signi?cant (p 0.028). The investigators thereby concluded that early invasive management is superior to early conservative management. However, the key question that the thoughtful clinician is interested in is, “What is the probability that early invasive management is associated with a ‘clinically-important’ bene?t over early conservative management?” Simply stated, is the 18% risk reduction observed in this study “clinically important?”

Sackett6 has proposed the use of confidence intervals to answer this question. According to this approach, if the summary treatment effect is large enough to exclude values smaller than the MCID, and not just the null value of zero, then the treatment provides both a statistically and clinically significant bene?t.

Another index, the number needed to treat (NNT) or the number needed to harm, has also been used to assess clinical importance.8 These indexes and their 95% CIs are calculated as the inverse of the absolute risk differences. In general, treatment interventions are deemed to be clinically important when the number needed to harm is >> NNT or when the NNT is < 50. However, these judgments are context-speci?c (with respect to disease and outcome severity) and are in?uenced by the bene?t-risk-cost pro?le of intervention and the duration of follow-up.

Alternatively, use of Bayesian analysis to estimate probabilities for a range of clinically important treatment effects is advocated.1 Brie?y, such inferences are predicated on Bayes’ theorem, which postulates that the posterior probability of any given hypothesis is directly related to its prior probability based on previous knowledge and the empirical evidence generated from within the study. Unlike the frequentist approach, however, the Bayesian approach allows one to specify the probability for any given threshold of clinical importance.1,5,9 Although CI is often interpreted in this manner (as in Sackett’s approach,6 the correct interpretation of a 95% CI from a frequentist perspective is that 95% of all CI limits derived from an unlimited number of repeated experiments would contain the true parameter. It does not actually ascribe any probability to the value of the parameter itself.5

Bayesian analysis computes the probability for any given threshold in terms of the area under the probability density curve,1,5,9 and can display this probability graphically across a range thresholds. Bayesian analysis helps clarify whether the bene?t is clinically important. Given these posterior probabilities and the important side effects of cost and bleeding, some clinicians might opt not to use an invasive strategy, despite its statistically signi?cant bene?t. In contrast, if invasive management were inexpensive and safe, clinicians might still decide to use it, even though there is a < 95% chance that it provides an important magnitude of bene?t. Thus, Bayesian analysis provides a straightforward, patient-, physician-, and context-speci?c statement of clinical importance and complements the frequentist analysis in improving the interpretation of the data and informing clinical decision making.1,5,9

compares the statistical signi?cance and clinical importance of treatment interventions for ACS based on the results of RCTs and meta-analyses.10-14 A relative reduction of 15% in recurrent adverse events was considered clinically important.15 Statistically signi?cant differences were observed for 3 of 5 interventions—aspirin being the only treatment intervention providing both statistically signi?cant and clinically-important bene?ts. Two examples highlight a disconnect between statistical signi?cance and clinical importance. Whereas treatment with unfractionated heparin was not statistically signi?cant despite a large risk reduction (owing to relatively small sample size),11 the probability of > 15% risk reduction was 87%, i.e. statistically not signi?cant but may be clinically important). In contrast, although treatment with platelet glycoprotein IIb/IIIa inhibitor was statistically signi?cant despite a modest risk reduction (due to large sample size),13 the probability of > 15% risk reduction was only 4%, i.e. statistically signi?cant but not clinically important.

In summary, while statistical signi?cance tells us whether a difference is likely to be real, it does not place that reality into a meaningful clinical context by telling us whether the difference is small or large, trivial or important. A formal evaluation of clinical importance (using frequentist CIs, the NNT and the number needed to harm indexes, or Bayesian probabilities), given the overall risk-bene?t-cost pro?le of each therapeutic intervention, should be included in the analysis, interpretation, and presentation of the results of clinical trials.

Hence, it might seem convenient to de?ne this threshold as the minimum detectable difference employed in the design of the trial. Unfortunately, however, the minimum detectable difference is often selected by trialists on purely pragmatic grounds such as ?nancial constraints, restrictions in available candidates, and limitations in follow-up duration—allowing the design of a trial with frugal sample size requirements—and does not necessarily re?ect the MCID from the perspective of the practitioner or the patient. The optimal thresholds of clinical importance might well vary from disease to disease, treatment to treatment, outcome to outcome, physician to physician, and patient to patient. Thus, lower thresholds of importance might attach to the reduction in mortality or serious irreversible morbid events such as Q-wave myocardial infarction (MI) or stroke versus higher thresholds for reduction in reversible and less serious morbid events such as recurrent hospitalization, asymptomatic peri procedural troponin elevation, or refractory ischemia. Ideally, assessment of both statistical signi?cance and clinical importance should aid in optimal utilization of therapeutic interventions in clinical practice.

Employ Appropriate Composite End points

Composite end points are measurable events that lie on a Pathophysiologic spectrum; aspects of the same underlying biologic process to quantify the overall treatment effect. Composite end points are frequently used in clinical trials,16-22 with 1 recent survey reporting that 37% of the 1,231 trials published over 7 years used composite outcomes.21 Such use reduces the sample size and cost requirements of clinical trials and is thereby thought to improve trial ef?ciency and helps facilitate the formal evaluation and ultimate availability of effective new treatments.
The typical cardiovascular trial combines “hard” but infrequent end points such as death, Q-wave MI, disabling stroke, and emergency coronary artery bypass graft surgery (CABG) with “soft” but more frequent end points such as reintervention, periprocedural MI (e.g., biomarker elevation), recurrent angina, and rehospitalization. Because of their greater frequency, these less important disparate outcomes often drive the effect of therapy on the composite. A systematic review of 114 cardiovascular trials that used composite end points reported a moderate to large gradient in the hierarchy of clinical importance of component events in nearly 40% of the trials.19 Of the 27 trials that reported a statistically signi?cant difference in the composite outcomes, only 7 were driven by hard outcomes.

Major adverse cardiac events (MACE) are arguably the most commonly used composite end point in cardiovascular research.22 There is no consensus de?nition of MACE, yet its use has become virtually pervasive in cardiovascular research in the last 2 decades. At the broadest level, de?nitions of MACE in use today include end points that re?ect both the safety (death, MI, stroke) and effectiveness (target vessel revascularization [TVR], restenosis, recurrent ischemia, rehospitalization) of various treatment approaches. Three recent literature reviews revealed that although death and MI were included in most de?nitions of MACE,19,21,22 inclusion of the remaining components was highly variable, thereby contributing to signi?cant heterogeneity across trials. Even the de?nition of MI was not consistent. Very few trials use the more reliable Q-wave criterion versus the less reliable non–Q-wave and/or cardiac biomarker criterion to de?ne MI.22 Such use opens investigators and sponsors to the charge of “gaming” their trials by in?ating the number of (arguably unimportant) outcome events, thereby increasing statistical power.22 Because varying de?nitions of composites such as MACE lead to substantially different conclusions, some have called for a reappraisal of their use.22

The construction of the composite end point is generally based on the premise that each component end point is interchangeable. However, for this assumption to be valid, 3 criteria need to be ful?lled:18-21

Each component should be of comparable clinical importance
Each component should occur with similar frequency
Each component should be similarly sensitive to treatment intervention. All 3 criteria are seldom ful?lled.

The unconventional use of a composite ef?cacy and safety outcome poses even greater challenges in the assessment of noninferiority. Typically, the noninferiority claim is con?ned to ef?cacy alone. Although combining ef?cacy and safety into 1 composite outcome in?ates the event rate and thereby enhances trial feasibility, it can often be misleading because drugs that are relatively ineffective but safer can be made to appear as good as or even better than effective drugs.23 This is illustrated in the REPLACE-2 (Randomized Evaluation in Percutaneous Coronary Intervention Linking Angiomax to Reduced Clinical Events) and the ACUITY (Acute Catheterization and Urgent Intervention Triage Strategy) trials, in which the difference in major bleeding events (statistically signi?cant 43% and 47% relative reductions, respectively, in favor of bivalirudin) exceeded the difference in MI (statistically nonsigni?cant 13% and 9% relative increase, respectively), thereby biasing the assessment of noninferiority in favor of bivalirudin compared with its active comparator.23

Heterogeneity in treatment effects across the component events has important regulatory implications. For example, the composite end point in the evaluation of losartan in the LIFE (Losartan Intervention for End Point Reduction in Hypertension) study was driven by the impact on nonfatal stroke only. Thus, the subsequent regulatory labeling of losartan was restricted to prevention of nonfatal stroke and not the original claim for the triple end point.

Inappropriate use of composite end points sometimes leads to an unfounded illusion of bene?t. The use of the MACE composite end point in the bare-metal stent versus sirolimus-eluting stent trial SIRIUS (Sirolimus-Eluting Balloon Expandable Stent in the Treatment of Patients With De Novo Native Coronary Artery Lesions)24 (Table 2 ) could erroneously lead one to conclude that the sirolimus-eluting stent is signi?cantly better at reducing death, MI, stent thrombosis, and target lesion revascularization in totality, even though the statistically signi?cant effect on MACE is driven primarily by a reduction in target lesion revascularization alone. The potential for misleading conclusions depending on the study-speci?c de?nition of MACE is not trivial.22

The common statistical approach of using equal weights to combine disparate constituent components of a composite end point is decidedly counterintuitive. A potential solution to this problem is to assign meaningful weights to the components. Such approaches, however, can be highly subjective.19,23,25,26 Thus, these definitions need to be prospectively agreed upon based on clinical and statistical considerations that impact the power of the test and

However, there are sample size implications, and the strategy should be prospectively de?ned.

Precisely, composite end points can be of value if both their requirements and their limitations are respected, and if the results are reported in a straightforward manner, providing all information necessary for proper interpretation of the trial. Suggested recommendations for their appropriate use in clinical trials are summarized in Table 2.

Subgroup analysis means any evaluation of treatment effects (bene?t or harm) for a speci?c end point in subdivisions of the study population de?ned by various nonrandom baseline characteristics. As the number of RCTs has dramatically increased over the last 3 decades, the exploration of treatment effects in patient subgroups has also simultaneously increased. While such analyses may provide useful information for the care of patients and for future research, they also introduce analytic and interpretive challenges that can lead to overstated results, misleading conclusions, and suboptimal care.

Several reviews have highlighted problems in the reporting of subgroup analyses.27-31 Assmann et al.28 reported shortcomings of subgroup analyses in 50 trials published in 1997 in 4 leading medical journals. More recently, Parker et al.,29 who reviewed 67 cardiovascular trials published between 1980 and 1997, and Herna? ndez et al.,30 who reviewed 63 cardiovascular trials published in 2002 and 2004, noted the same problems. Chief among them include a lack of pre-speci?cation, and testing of a large number of subgroups without the use of statistically appropriate adjustment for interactions and multiple comparisons. Because a fairly large number of subgroup analyses are often undertaken, the potential for false positive errors is quite common. The collective probability of a false positive error (A) can be computed from the equation: A = 1 - (1 - a)x , where x is the number of independent subgroup analyses and a is the false positive error for each individual subgroup analysis (usually 0.05).32 For example, if 20 subgroup analyses are conducted, the collective probability of at least 1 false positive error is 0.64. Conversely, because these analyses are often underpowered because of small sample size, false negative errors are also common. Finally, these analyses are usually nonrandomized, resulting in imbalances in prognostic factors in subgroups. For these reasons, subgroup analyses should be considered exploratory for informing future research and not conclusive to guide clinical practice.

A trial is typically designed to detect an effect in the whole population and that the most reliable estimate of a subgroup’s results is still the overall results, not the estimate of a particular subgroup. The principal value of subgroup analysis is to assess the robustness of the primary conclusions by demonstrating consistency within the subgroups, not to demonstrate inconsistencies in one or another arbitrary subgroup. Subgroup ?ndings should be regarded with suspicion unless they are independently con?rmed. Failure to recognize the capriciousness of random variation often leads to premature acceptance of the results, risking the adoption of inferior or unnecessarily costly treatments.16

Despite repeated discussion of the potential problems associated with subgroup analysis and published guidelines to improve the quality of subgroup analyses, a recent analysis of 97 trials published in the New England Journal of Medicine in 2005 and 2006 showed that problems and ambiguities persist.33 In approximately two-thirds of the published trials, it was unclear whether any of the reported subgroup analyses were pre-speci?ed or post hoc. In more than one-half of the trials, it was unclear whether interaction tests were used, and in approximately one-third of the trials, within-level results were not presented in a consistent way. Recommendations on when and how subgroup analyses should be conducted and reported are shown in Table 3. The goal is to avoid unwarranted data dredging and increase the clarity and completeness of the information reported, thereby improving the interpretability of the ?ndings.
Conclusion
The randomized controlled clinical trial has become the gold standard scienti?c method for the evaluation of diagnostic and treatment interventions. However, there are a number of limitations that challenge the interpretation of the results of these trials. Careful attention to this caveat is not only key for critical evaluation of the published literature, but it also has implications for the care and treatment of patients, and for the development and implementation of practice guidelines and reimbursement policy.