Why do you do what you do when you do it?

Initially, we do it because we’re taught to. It comes down to us from on high, from instructors or textbooks — when you see x, do y — and we’re simply expected to learn it, memorize it, and recite it back. Then, once in the field, to follow it mechanically.

But before long, if we’re to become more than just medical Roombas, we really have to start asking Why. It’s not because we’re difficult children, or to satiate our curiosity. It’s because even at the best of times, the rules can’t address every situation. And in order to make intelligent, appropriate decisions when the circumstances aren’t clear and simple, we need to understand the underlying principles behind the rules we learn. We need to understand both the potential value and potential harm of the interventions we provide. We need to understand the meaning and importance of specific assessment findings. We need to be students of reality, and the human body, rather than of arbitrary rules.

In order to do all this, we need to be able to read research. Medical research is where these answers come from; it’s where we learn what works, and how well, and what importance to attach to the things we see. To read research, though, we need to understand the basic statistical methods they use.

Statistics is a big, big topic, and I don’t have a strong background in it, so if you really want to dive into this, take a class. The analytical and regressive methods used to crunch the data in a study are something we won’t touch here. But we do need to understand a few basic terms, because they’re central to how the results of a study are presented — in other words, if you’re looking for answers, this is the language in which they’re written. So although the idea of a post about statistics may sound as appealing as a brochure on anal ointment, bear with me; this won’t be too painful, and it’s information you can use over and over and take to your grave. Right now, let’s talk about numbers used to describe accuracy of diagnostic signs.

### Sensitivity and Specificity

Take a certain test. It could be anything. A clinical finding. A laboratory test. Even a suggestive element from a patient history. Call it Test X.

Let’s say that this test is linked to a certain patient condition, Condition Y. Something bad. Something we want to find. In fact, Condition Y is the whole reason we’re looking at Test X.

What would make Test X a good test for Condition Y? Well, when the test says “You have Condition Y!”, then you should really have it. And if it says “You don’t have Condition Y!”, then indeed, you shouldn’t have it. It doesn’t have to be perfect. But it should be pretty good — otherwise, what’s the point in using the test? If it doesn’t tell us something we didn’t know before, we might as well ignore it.

When the test says “You have Condition Y,” and you really *do* have it, we’ll call that a **true positive**. When the test says, “You don’t have Condition Y,” and indeed you don’t have it, we’ll call that a **true negative**. Those are the findings we want; we want the test to tell us the truth, so we can base our treatments and decisions on reality.

On the other hand, when the test says, “You have Condition Y,” but you DON’T have it — in other words, an error, the test got it wrong — we call that a **false positive**. We thought you were positive, but whoops, you’re actually fine. And when the test says, “You don’t have Condition Y,” but it turns out that you *do*, we’ll call that a **false negative**, or a miss. The test cleared you, but it missed the badness; you actually do have the condition. These are the screw-ups.

How many true positives and true negatives does our test yield, versus how many false positives and false negatives? This determines how good our test is, how faithful to reality. The perfect test would have 100% true results, either positive or negative depending on the patient’s condition: if you have Condition Y, the test is positive, and if you don’t have Condition Y, the test is negative. There would be zero false positives or false negatives.

The worst possible test would have about 50% true and 50% false results. There would be no correlation between the test results and having the condition. In fact, it would be pointless to call this a test for Condition Y; we might as well flip a coin and call that Test X, because it would be just as useful.

Okay, so how do we determine the accuracy of a test? We take a bunch of patients, some of whom have Condition Y, and some of whom don’t, and we run them through Test X like sand through a sieve. Then we see which patients the test flagged, and see how accurate it was. (Obviously, we’ll need a way of knowing for sure who has Condition Y; this is usually done by a separate, “gold standard” test with known reliability. Correlation between Test X and the gold standard is what we’re examining here. Why not just use gold standards tests on all patients? Generally these are difficult, invasive, time-consuming, and expensive procedures — not appropriate for everyone, and certainly not of much use in the field.)

We’ll come up with a couple of figures. One is the test’s **sensitivity**. This describes how well our test picked up Condition Y; how alert was it, how often did it pick up what we’re looking for? If you have Condition Y, how likely is the test to say you have it? How many sick patients slipped past? If our test has 100% sensitivity, it will have zero false negatives; it will never miss, will never fail to flag a patient with Condition Y. A test with 0% sensitivity is blind; it will never notice Condition Y at all.

The other statistic is the test’s **specificity**. This describes how selective our test is, how cautiously it sounds its alarm. If you don’t have Condition Y, how likely is the test to say you don’t have it? Will it ever be fooled, and wrongly think that you do? A test with 100% specificity will never produce a false positive; if it shouts positive, it’s never wrong. On the other hand, a test with 0% specificity will never be right; it’s the boy who cried wolf.

Together, sensitivity and specificity describe a test’s accuracy. Intuitively, you can see how the two parameters might often work against each other; we can make a test that is extremely “paranoid,” and will catch almost everything — high sensitivity — but will also flag a great many false positives — low specificity. (Heck, we could just make a flashing red light that said “POSITIVE!” every single time, and we’d never miss anyone — of course, it’d have so many false positives that it’d be useless.) Conversely, we can make a test which is extremely judicious and selective, and when it says “positive,” we can trust that it’s probably right — high specificity — but it’ll miss a lot of true positives — low sensitivity.

Ideally, we’d like a test with high sensitivity *and* high specificity. But when that’s not possible, then at least we need to understand how to interpret the results.

For instance, a test with high sensitivity is very good for *ruling a condition out*. Because it almost always catches Condition Y, if the test says “nope, I just don’t see it here,” then that’s very trustworthy; if the patient *did *have it, the test probably would’ve caught it. Think **SnOut**: a test with good **S**ensitivity that comes back **n**egative rules a condition **Out**.

Example: pinpoint pupils. For the patient with altered mental status, this is a very sensitive indicator of opiate use; almost everyone with a large amount of opiates in their system will present with small pupils. However, it’s not very specific, because many people will have small pupils without using narcotics (for instance, due to bright lighting). So if you *don’t* see pinpoint pupils, that finding rules *out* opiate overdose with fairly good reliability.

On the other hand, a test with high specificity is very good for ruling a condition *in*. Because it’s almost never wrong, if it says you do have Condition Y, you can take that to the bank. Think **SpIn**: a test with good **S**pecificity that comes back **p**ositive rules a condition **In**. (Thanks to Medscape for these mnemonics.)

Example: a pulsating abdominal mass is an extremely specific finding in abdominal aortic aneurysm. Very few other conditions can cause such a pulsating mass, so if you find one, you can pretty reliably say that the patient has a AAA. However, many AAA patients will not have such a mass, so this is not very sensitive. But if you *do* find a pulsating mass, this rules AAA *in *fairly well.

### Warning: Scary Statistics Ahead

Okay, that wasn’t so bad, was it?

Here’s where things get a little weirder. If you’re barely hanging on to the thread so far, you have permission to stop reading now.

Sensitivity and specificity are the most commonly used parameters describing the accuracy of a test. They’re properties of the test itself, so you can hang those numbers on it and they won’t change on you.

However, anyone who’s studied Bayesian statistics will understand that the true accuracy of our test is not only a factor of the test, but also depends on the *prevalence* of Condition Y in the population. If Condition Y is exceptionally rare in the patient group we’re looking at, then *even if Test X is very specific, it will produce a large number of false positives.* Conversely, if Condition Y is exceptionally common, then *even if Test X is very sensitive, it will produce a large number of false negatives.*

The reasons for all of this are complex. (For some additional reading, see here, and here.) But the general gist is this: if Condition Y is very unlikely to be present (either because it’s generally uncommon, such as scurvy; or because it’s an improbable diagnosis for the individual patient, such as an acute MI in an 8-year-old), then even if your test “rules it in,” it will still be unlikely. The positive test made it *more* likely, but it was so improbable to begin with, the odds didn’t change very much. And if Condition Y is very probable (such as a healthy heart in an asymptomatic teenager), then even if your test “rules it out,” the odds still support its presence.

What this all means is that in order to answer our real questions, we need another measure. The **positive predictive value** (PPV) and **negative predictive value** (NPV) are the answer, and really, these figures are what we’re after. The PPV answers: given a positive test result, how likely is the patient to have the condition? The NPV answers: given a negative test result, how likely is the patient to lack the condition? In other words, in a real patient, how likely is the test result to be correct?

The trouble is that PPV and NPV aren’t just characteristics of the test; as we saw above, they also depend on the prevalence of the condition, or the “pre-test probability.” What this means is that although the study you’re reading may report predictive values, they are not necessarily applicable to your patient. They’re only applicable to the patient population that was studied. Now, if your patient is similar to that population — in other words, has about the same pre-test probability of the condition as they did — then the predictive values should be correct. If not… not so much.

So do we have any more tricks? We have one more: **likelihood ratios**. Likelihood ratios factor out pre-test probability, producing a simple ratio that describes *how much the test changed the probability.* For instance, suppose we have a patient who we judge has a 10% probability of having Condition Y. We apply a test with a positive likelihood ratio of 5, and it comes up positive. What’s that mean? The math is a little bit roundabout, because we need to convert probability (a percentage of positive outcome out of *all* possible outcomes) into odds (a fraction of positive outcome over *negative* outcome): 10% is the same as 1:9 odds. 1/9 times 5 is 5/9, and if we convert that back to a percentage (positive outcome over total outcomes, or 5/14), we have the result: about 36%. The patient now has a 36% chance of having Condition Y. Conversely, suppose it came up negative, and the test had a negative likelihood ratio of .1. The post-test probability (by the same calculation) is now only around 1%.

It’s a simple device that would be far more intuitive without the odds vs. probability conversion, but suffice to say that a likelihood ratio of 1 (1:1) changes nothing, higher than 1 is a positive test (1–3 slightly so, around 5–10 is a useful test, and over 10 is highly suggestive), and less than 1 is negative (1–.5 just barely, around .5–.1 decently, and under .1 is strongly negative.) Try plugging numbers into this calculator to experiment — or drag around the sliders in the Diagnostics section at The NNT. The only bad news is that you still need to know the pre-test probability, but the good news is that you can come up with your own estimate, rather than having an inappropriate one already included in the predictive values.

How to come up with pre-test probabilities? Well, research-derived statistics do exist for various patient groups… but realistically, in the field, you will need to wing it. Taking into account the whole clinical picture, including history, physical exam, and complaints, how high-risk would you deem this patient? You don’t need to be exact, but you should be able to come up with a rough idea. Now, apply your test, and consider the results — about how likely is the condition now? If at any point, you have enough certainty (either positive or negative) to make a decision, then do it; there’s no point in tacking on endless tests if they won’t change your treatment.

Anybody still breathing? We’ll talk about odds ratios, NNT, and other intervention-related numbers another time.

[*Edit 5/15/13: the follow-up post on outcome metrics is posted at Lit Whisperers, our sister blog*]

## Recent Comments