Category Archives: P value

What do you mean statistically significant?

gratisography-231H(advice)Today’s post is brought to you by the letter P and the number .05.  That’s right, we’re going to dig into what it really means to be “statistically significant.”

Setting the stage…imagine you wanted to educate primary care physicians about a guideline update for the management of hypertension. To do so, envision a CME roadshow of one-hour dinner meetings hitting all the major metros in the US. And to determine whether you had any impact on awareness…how about a case-based, pre- vs. post-activity survey of participants?

Once all the data is collected, you tabulate a score (% correct) for each case-based question (ideally matched for each learner) from pre to post. Now…moment of truth, you pick the appropriate statistical test of significance, say a short prayer, and hit “calculate”. The great hope being that your P values will be less than .05. Because…glory be! That’s statistical significance!

So let’s take this scenario to its hopeful conclusion. What does it really mean when we say “statistical significance”?

Maybe not quite what you thought.

You see…statistical tests of significance (eg, chi square, t test, Wilcoxon signed-rank, McNemar’s) are hypothesis tests. And the hypothesis (or expectation) is that the two comparison groups are the same. In this case, the hypothesis is that the pre and post activity % correct (for each question) of your CME participants are equivalent. So…when you cross the threshold into “statistical significance” you’re not saying “Hey, these groups are different!” Instead you’re saying, “Hey, these groups are supposed to be the same, but the data doesn’t support that expectation!” Which, if said quickly, sounds like the same thing…but there’s a very important distinction. Statistical tests of significance do not test whether two groups are different, they test whether two groups are the same. You may jump to the conclusion that if they aren’t the same, they must be different, but statistically, you have no evidence to that point.

Yes…that is confusing. Which is probably why it gets glossed over in so many reports of CME outcomes. In actuality, you should think of P value as a sniff test. If you expect every flower to smell like a rose, P value can tell you if it does, but if P is < .05 (indicating the data doesn’t support that expectation), you can’t make any assumption about the flower’s true scent. You’d need other tests to isolate the actual smell. Same thing with our CME example…a P <.05 indicates that we can’t confirm expectation that pre and post-activity % correct for a given question are equivalent, but it doesn’t tell us that pre and post are different in any substantive way. It’s simply a threshold test…if we find statistical significance, the correct interpretation should be “Hey, I didn’t expect that, we should look into this further”.  And that’s when other tests come into play (eg, effect size).

In summary, P value is not an endpoint. Be wary of any outcome data punctuated solely by P values. Hence, the image for this post (which is from Gratisography – a truly wonderful resource of free images): are you really getting what you expected from P values?

Leave a comment

Filed under CME, Outcomes, P value, Pre vs. Post

Statistical analysis in CME

Statistics can help answer important questions about your CME.  For example, was there an educational effect and, if so, how big was it?  The first question is typically answered with a P value and the second with an effect size.

If this were 10 years ago, you’d either be purchasing some expensive statistical software or hiring a consultant to answer these questions.  Today (thank you Internet), it’s simple and basically free.

A step-by-step approach can be found here.



Filed under CME, CMEpalooza, Cohen's d, Effect size, P value, Statistical tests of significance, Statistics

P values – controlling for multiple comparisons

A p-value tells you whether to accept or reject a given hypothesis.  In CME, however, we often don’t have a hypothesis.  Sure, we expect physicians who participate in CME to have changes in competency, performance or maybe even their patients’ health, but we’re not very good at testing this directly via a single hypothesis (as compared to clinical drug trials).   Our typical approach is to give CME participants a survey containing several knowledge/self-efficacy/case questions and then run t- or chi-square tests to see if they answer differently pre vs. post (or even post vs. a control group).  This results in a p-value for each question, which means that each question is essentially a hypothesis.  If you’re going to have more than one hypothesis in a single study, you need to control for multiple comparisons.  This is because each additional hypothesis applied to a single study increases the likelihood that any one difference uncovered is due to chance (as opposed to a true difference between the comparison groups).

For example….if you conduct a single statistical test and use the conventional p-value (.05), there is only a 5% chance that you’ll reject your null hypothesis (i.e., find that a difference exists between groups) and be incorrect.  But if you have a 20-question survey and you’re conducting a statistical test for each question, you now have a 64% chance of making one or more false findings (the formula from which this was derived can be found here).

Although I’d first recommend not conducting multiple comparison, there aren’t many viable alternatives for most CME providers and such an approach can have value for hypothesis-generation.  That being said, a simple way to address the multiple comparison issue is via the Simes-Hochberg correction [1,2].

Here are the steps:

  1. After you’ve run all your statistical tests, order all P-values from high to low.
  2. If the highest p-value is < .05,  stop here, all tests are significant.
  3. If the second highest p-value is less than < .025 (which is .05/2), then stop here, all following tests are significant.
  4. If the third highest p-value is less than .017 (which is .05/3), then stop here, all following tests are significant.
  5. And so on, comparing the p-value with .05 divided by it’s ranking among all multiple comparison p-values.

Here’s an example (note that 5 comparisons were significant prior to the multiple comparison correction, after which none of the comparisons maintained statistical significance):


P value


Adjusted p-value

Question 1




Question 2




Question 3




Question 4




Question 5




Question 6




Question 7




Question 8




Question 9




Question 10





1. Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986;73:751–54.

2. Hochberg Y. A sharper Bonferroni procedure for multiple significance testing. Biometrika 1988;75:800–02.

Leave a comment

Filed under P value, Statistics