Does Drug X Really Work?

Evaluating Medical Evidence

The internet is filled with ads promoting various drugs, vitamins, and supplements. How do you ever really know that they work?  In this essay I will show you how scientists answer that question.

(Want a brief answer? Vitamins and supplements don't work — dump 'em! (with very few exceptions))

I just spent the past week writing a review of TRANSCEND, a new health book by Ray Kurzweil and Terry Grossman — the book advocates lots of supplements.  The key question it (implicitly) raises is "should one believe the claims in it or not?" How does one arrive at pharmacologic truth?

My PhD thesis project at Stanford (the RX Project) was devoted to precisely this topic. RX was an early experiment (1976 to 1986) in automated data mining.

It took as input a huge collection of observations that had been made on thousands of patients over a decade and combed that database for possible causal links.

A causal link or relationship means that A causes B.  A could be a treatment, for example a drug.  B might be a side-effect or a desired effect (like longer life).

In designing the RX Project I led a small team of statisticians and computer scientists that needed (as a sideline) to address precisely the question of this essay: "how do we ever know that A causes B?"

Never mistake correlation (ie, association) for causation!

 (One of Tom Jech's delightful cartoons on fallacious thought.)

(Also note: when trying to get into Pandora's Box, be sure to cover your fallacy with a conundrum!)

How would one know that a given drug or food or habit or even exercise works? The obvious answer is to try it. If it makes you feel better, stronger, faster, calmer, more energetic then it works.  If it doesn't or harms you, then dump it. This evidence is direct, incontrovertible, and not lightly dismissed.  It is using your body as the original scientist.

Unfortunately, life is not that simple.  How about drugs that make you feel great now but are rapidly destructive: cocaine, amphetamines, or narcotics? How about drugs that make you feel good now but are destructive long-term: nicotine or alcohol? 

And, of course, many drugs fall into the category "neutral to negative." That is, feeling nothing or feeling bad now in exchange for possible long-term benefit. The negative feeling might be having to swallow cod-liver oil or having to pay hundreds of dollars a year for a drug, herb, vitamin, or supplement.  Also, the long-term benefit may be imperceptible: bones that fracture less readily, arteries that are open, or less risk of cancer.

What is true of drugs also applies to foods, habits, and even exercise. As I walk the aisles of Safeway I find less and less that is healthy, although every product has been carefully designed to taste good. We spend billions for foods that are convenient and taste great, but that ultimately contribute to the obesity and health care crisis in the United States and elsewhere.

Even the seemingly incontrovertible habit of EXERCISE cannot automatically be assumed to be beneficial. Some of my friends run or bike over a hundred miles a week. While that's testimony to their glowing health, it's not clear that it promotes their longevity. How about the wear and tear factor? How about free radicals wrecking havoc? How about thousands of calories pouring through their arteries?

Massive exercise (like running marathons) cannot be automatically and stupidly assumed to be beneficial!

So, how do we find out whether something is beneficial? The obvious answer is to do a study. Have a thousand people take vitamin C for ten years and see whether they are healthier or have lived longer than a control group. Or, look at folks who have run marathons for decades and compare them to a control group.  Easy, huh?

No. It's not easy.  The basic problem is that there is infinite variability that may explain why one person who is fat and smokes lives to be a hundred and another who is a lean vegetarian dies at age 30. Each of our ten trillion cells is different from everyone else's.

So, how do you arrive at medical truth?  In this little essay I can just scratch the surface. My aim is to show you the kind of evidence that health scientists require to evaluate a drug or other treatment. Absent that level of evidence you must be SKEPTICAL of every health claim you see or hear.


How do you grade or rate medical evidence? (How about - it works ! Or, dump it !), If you must reduce the evidence to a single expression my favorite scale is to assign it a letter grade like this.

Letter Grade:

A — Strong Scientific Evidence that the drug works

B  — Good Scientific Evidence

C  —  Unclear or conflicting scientific evidence

D  —  Fair negative scientific evidence

F  —  Strong negative scientific evidence (that the drug does NOT work)

This scale is called the Jadad scale or score and has these merits:  1) It is widely used. 2) Its correspondence to school grades is easy to understand.  3) A five point scale is just enough (like Goldilocks).  (My RX Project used a ten point validity scale, because, as you'll see, the C Grade is a huge grab bag.)

Note that negative scientific evidence (rating D or F) does NOT mean that the drug is harmful.   It DOES mean that we have strong proof that the drug does NOT work.  The question of harm is an entirely separate matter.

Also, we may simply be unable to evaluate the effectiveness of a drug because of a lack of human data.  Note that a lack of evidence means we cannot make any claim whatsoever (although the internet pitchmen do it anyway.) This scale addresses the AMOUNT of scientific evidence, the QUALITY of that evidence,  the EFFECT SIZE, and the CONSISTENCY of the evidence.

Note that the main focus is on FORMAL STUDIES on PEOPLE as opposed to anecdotal reports, folklore, animal studies, or even what experts think.

Where are we at so far? Here's where. 

Forget anecdotal reports (it worked for my friends, I saw it on tv or on the internet).  Even practice standards may be wrong (4 out of 5 doctors recommend Bayer). Folklore, a testimonial, or even test tube verification is just the first step in a thousand mile journey toward a scientific conclusion.

Note that even Grade A evidence may always be overturned, and a Grade C drug may later be upgraded (or more likely, downgraded) by more study. Scientific evidence is not religious dogma.   It's always capable of being falsified or modified by further evidence. One bunny rabbit skeleton in billion year old rock would be headline news, because it flies in the face of the theory of evolution.

Also the accumulated evidence may not be relevant to you. You might be older or younger than the study group or different in other ways.

The Randomized Control Trial (RCT)

  The gold standard for proof that a drug works is the double blind randomized control trial (RCT).

The researchers assign patients to two groups: the study which gets the treatment  and the control group which gets a placebo (a look-alike treatment).  Double blind means that neither the patients nor the researchers know which patients are in which group. Randomized means the assignment is done using computer-generated random numbers.

Nothing can replace an RCT.  Here's why.  Unless people are randomly assigned in a clinical study, it is always possible that some outside factor may account for the difference in outcome between the study group (given the drug) versus the control group (taking the placebo). In a sufficiently large randomized trial ALL extraneous factors - both known and unknown - are automatically equalized between the groups (at least in theory). (I'm even highly skeptical of most RCTs. Many times the MDs or other evaluators can guess which treatment the participant is getting — and that skews their evaluation. This is especially true where evaluations are subjective, as in neuropsychiatric trials. My bias is that the treatments and trials are all worthless (or worse)!)

It is easiest to illustrate this by considering a much weaker kind of study: a cohort study. Consider a study comparing marathon runners to couch potatoes to find out whether the marathoners live longer.

For every marathoner in the study include in the control group a person of the same age, sex, race, and health history. Now follow them for twenty years and count the number of deaths in each group or the number of heart attacks or whatever. Even if you were to show that the marathoners enjoyed longer lifer or fewer heart attacks, how could you ever refute the claim that the difference was due to genetics or hardiness or cleaner living or special diet or supplements or occupation or any of an infinitude of variables.

That infinitude of confounding variables can only be controlled by random allocation.  By randomly allocating people to the marathon group you control for genetics and hardiness and all the other spurious variables. Randomization is the only means for demonstrating that the benefit was conferred by the study variable alone. (Good luck trying to randomly assign people to marathoning versus sitting on a couch, and doing it double blind.  It can't be done.)

Criteria for Assessing Strength of Scientific Evidence

Level of Evidence Grade


A (Strong Scientific Evidence)

Statistically significant evidence of benefit from >2 properly randomized trials (RCTs), OR evidence from one properly conducted RCT AND one properly conducted meta-analysis, OR evidence from multiple RCTs with a clear majority of the properly conducted trials showing statistically significant evidence of benefit AND with supporting evidence in basic science, animal studies, or theory.

B (Good Scientific Evidence)

Statistically significant evidence of benefit from 1-2 properly randomized trials, OR evidence of benefit from >1 properly conducted meta-analysis OR evidence of benefit from >1 cohort/case-control/non-randomized trials AND with supporting evidence in basic science, animal studies, or theory. This grade applies to situations in which a well designed randomized controlled trial reports negative results but stands in contrast to the positive efficacy results of multiple other less well designed trials or a well designed meta-analysis, while awaiting confirmatory evidence from an additional well designed randomized controlled trial.

C (Unclear or conflicting scientific evidence)

Evidence of benefit from >1 small RCT(s) without adequate size, power, statistical significance, or quality of design by objective criteria,* OR conflicting evidence from multiple RCTs without a clear majority of the properly conducted trials showing evidence of benefit or ineffectiveness, OR evidence of benefit from >1 cohort/case-control/non-randomized trials AND without supporting evidence in basic science, animal studies, or theory, OR evidence of efficacy only from basic science, animal studies, or theory.

D (Fair Negative Scientific Evidence)

Statistically significant negative evidence (i.e., lack of evidence of benefit) from cohort/case-control/non-randomized trials, AND evidence in basic science, animal studies, or theory suggesting a lack of benefit.This grade also applies to situations in which >1 well designed randomized controlled trial reports negative results, notwithstanding the existence of positive efficacy results reported from other less well designed trials or a meta-analysis. (Note: if there is >1 negative randomized controlled trials that are well designed and highly compelling, this will result in a grade of "F" notwithstanding positive results from other less well designed studies.)

F (Strong Negative Scientific Evidence)

Statistically significant negative evidence (i.e. lack of evidence of benefit) from >1 properly randomized adequately powered trial(s) of high-quality design by objective criteria.*

Lack of Evidence

Unable to evaluate efficacy due to lack of adequate available human data.

If you look carefully at the above table, you see that the quantity and quality of the evidence is roughly U-shaped.  It takes a lot of high quality evidence to demonstrate a benefit from a drug as well as to shoot it down. A weak study or lack of evidence is not a reason to believe that a drug or other intervention does not work.

The low point in the U-shape is the Grade C evidence: unclear or conflicting evidence. Look at the final two clauses in that giant OR statement that defines Grade C evidence: evidence only from one or more non-randomized studies WITHOUT supporting evidence from basic science (test tubes), animal studies, or theory … OR evidence ONLY from basic science (test tubes) , animal studies, or theory (with NO studies in human beings).

Unfortunately this category Grade C evidence is a huge scrap heap that includes every worthless drug, vitamin, supplement, or other intervention that you read about on the internet.

In Category C are the snake oils that make up the kit bag of every medical charlatan  in history. (Scientific lingo and a white coat are not what distinguish scientists from snake oil merchants. The key distinguishing feature is their use of evidence.)

the condition or disease for which the treatment was given; the study design (ie was it an RCT or Case Series or Cohort Study etc.); the literature citation; the number of patients in the study; whether the result was statistically significant (ie could it have simply been due to chance); the overal QUALITY of the study; the Magnitude of Benefit (ie, was the effect size large, medium, or small) the Absolute Risk Reduction (if cancer was reduced by half in the treated group, this would be 50%) Number Needed to Treat; Comments (usually dosage and duration of the trial)

So, what should you as a patient do with this level of evidence?  I think the answer is clearly wait. Hold off.  The evidence simply does not justify taking this treatment for the studied indications. For each of the listed diseases there are treatments that work better. Also, we have not even begun to address the side-effects or cost of this treatment.

For any given treatment what we're really hoping to see is a letter grade A or B and repeated high quality evidence of large beneficial effects.  Examples of these would be insulin for diabetes or penicillin for pneumonia - large irrefutable effects in study after study.

Unfortunately, for each Nobel prize winning therapy like insulin or penicillin there are thousands of drugs that fall into the Grade C unclear or conflicting scrap heap.  Most may gradually fall into disuse or disrepute: a very few may graduate to a Grade B or even A as further evidence accumulates.

For readers looking for more on the topic of evidence-based medicine , I recommend that Wikipedia article. There I see that Adrian Smith, President of the Royal Statistical Society, recommends evidence-based medicine as an exemplar for all public policy.

Before closing,  I'd like to tip my hat to the National Center for Complementary and Alternative Medicine (NCCAM). NCCAM is a branch of the National Institutes of Health (NIH) whose most important function is to collect and disseminate information on complementary and alternative medicines.  Occasionally they will also  conduct large scale clinical trials of drugs that have caught the attention of the public.  For example, they conducted a large trial of glucosamine and chondroitin (previously used to alleviate joint pain and preserve knee cartilage.) Their study showed that the combo drug did not work. (Following that, I threw mine in the garbage.)

For anyone looking for unbiased, scientific information on specific alternative medicine therapies, I recommend NCCAM's website.  Here is their Health Topics A-Z  index .