What does it mean when your CME participants score worse on a post-test assessment (compared to pre-test)?
Here are some likely explanations:
- The data was not statistically significant. Significance testing determines whether we reject the null hypothesis (null hypothesis = pre- and post-test scores are equivalent). If the difference was not significant (ie, P > .05), we can’t reject this assumption. If the pre/post response was too low to warrant statistical testing, the direction of change is meaningless – you don’t have a representative sample.
- Measurement bias (specifically, “multiple comparisons”). This measurement bias results from multiple comparisons being conducted within a single sample (ie, asking dozens of pre/post questions within a single audience). The issue with multiple comparisons is that the more questions you ask, the more likely you are to find a significant difference where it shouldn’t exist (and wouldn’t if subject to more focused assessment). Yes, this is a bias to which many CME assessments are subject.
- Bad question design. Did you follow key question development guidelines? If not, the post-activity knowledge drop could be due to misinterpretation of the question. You can learn more about question design principles here.
I’ve talked a lot about effect size: what it is (here), how to calculate it (here, here and here), what to do with the result (here and here)…and then some about limitations (here). Overall, I’ve been trying to convince you that effect size is a sound (and simple) approach to quantifying the magnitude of CME effectiveness. Now it’s time to talk about how it may be total garbage.
All this effect size talk includes the supposition that the data from which it is calculated is both reliable and valid. In CME, the data source is overwhelming survey – and the questions within typically include self-efficacy scales, single-correct answer knowledge tests and / or case vignettes. But how do you know that your survey questions actually measure their intention (validity) and do so with consistency (reliability)? CME has been repeatedly dinged for not using validated measurement tools. And if your survey isn’t valid (or reliable), why would your data be worth anything? Effect size does not correct for bad questions. So maybe next time you’re touting a great effect size (or trying to bury a bad one), you should also consider (and be able to document) whether you’ve demonstrated the effectiveness of your CME or the ineffectiveness of your survey.
So what can be done? Well, you can hire a psychometrist and add complicated-sounding things like “factor analysis” and “Cronbach’s alpha” to your lexicon (yell those out during the next CME presentation you attend…and then quickly run of the room). Or (actually “and”), you can start with sound question-design principles. The key thing to note, no amount of complex statistics can make a bad question good – so you need to know the fundamentals of assessing knowledge and competence in medical education. Where do you get those? Here are some suggestions to get you started:
- Take the National Board of Medical Examiners (NBME) U course entitled: Assessment Principles, Methods, and Competency Framework. This is an awesome (daresay, the best) resource for anyone assessing knowledge and competence in medical education. Complete this course (there are 20 lessons, each under 30 minutes) and you’ll be as expert as anyone in CME. You can register here. And it’s free!
- Check out Dr. Wendy Turell’s session entitled Tips to Make You a Survey Measurement Rock Star during the next CMEpalooza (April 8th at 1:30 eastern). This is her wheelhouse – so steal every bit of her expertise you can. Once again, it’s free.