A p-value tells you whether to accept or reject a given hypothesis. In CME, however, we often don’t have a hypothesis. Sure, we expect physicians who participate in CME to have changes in competency, performance or maybe even their patients’ health, but we’re not very good at testing this directly via a single hypothesis (as compared to clinical drug trials). Our typical approach is to give CME participants a survey containing several knowledge/self-efficacy/case questions and then run t- or chi-square tests to see if they answer differently pre vs. post (or even post vs. a control group). This results in a p-value for each question, which means that each question is essentially a hypothesis. If you’re going to have more than one hypothesis in a single study, you need to control for multiple comparisons. This is because each additional hypothesis applied to a single study increases the likelihood that any one difference uncovered is due to chance (as opposed to a true difference between the comparison groups).
For example….if you conduct a single statistical test and use the conventional p-value (.05), there is only a 5% chance that you’ll reject your null hypothesis (i.e., find that a difference exists between groups) and be incorrect. But if you have a 20-question survey and you’re conducting a statistical test for each question, you now have a 64% chance of making one or more false findings (the formula from which this was derived can be found here).
Although I’d first recommend not conducting multiple comparison, there aren’t many viable alternatives for most CME providers and such an approach can have value for hypothesis-generation. That being said, a simple way to address the multiple comparison issue is via the Simes-Hochberg correction [1,2].
Here are the steps:
- After you’ve run all your statistical tests, order all P-values from high to low.
- If the highest p-value is < .05, stop here, all tests are significant.
- If the second highest p-value is less than < .025 (which is .05/2), then stop here, all following tests are significant.
- If the third highest p-value is less than .017 (which is .05/3), then stop here, all following tests are significant.
- And so on, comparing the p-value with .05 divided by it’s ranking among all multiple comparison p-values.
Here’s an example (note that 5 comparisons were significant prior to the multiple comparison correction, after which none of the comparisons maintained statistical significance):
Comparison |
P value |
Rank |
Adjusted p-value |
Question 1 |
.43 |
1 |
.05 |
Question 2 |
.37 |
2 |
.025 |
Question 3 |
.28 |
3 |
.017 |
Question 4 |
.18 |
4 |
.0125 |
Question 5 |
.07 |
5 |
.01 |
Question 6 |
.05 |
6 |
.0083 |
Question 7 |
.04 |
7 |
.0071 |
Question 8 |
.04 |
8 |
.0063 |
Question 9 |
.03 |
9 |
.0056 |
Question 10 |
.01 |
10 |
.005 |
References:
1. Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986;73:751–54.
2. Hochberg Y. A sharper Bonferroni procedure for multiple significance testing. Biometrika 1988;75:800–02.