This past Thursday, I gave a short presentation on effect size at the SACME Spring Meeting in Cincinnati (a surprisingly cool city, by the way – make sure to stop by Abigail Street). Instead of a talk about why effect size is important in CME, I focused on its limitations. My expectation was feedback about how to refine current methods. My main concerns:
- Using mean and standard deviation from ordinal variables to determine effect size (how big of a deal is this?)
- Transforming Cramer’s V to Cohen’s d (is there a better method?)
- How many outcome questions should be aggregated for a given CME activity to determine an overall effect? (my current minimum is four)
The SACME slide deck is here. I got some good feedback at the meeting, which may lead to some changes in the approach I’ve previously recommended. Until then, if you have any suggestions, let me know.
Statistics can help answer important questions about your CME. For example, was there an educational effect and, if so, how big was it? The first question is typically answered with a P value and the second with an effect size.
If this were 10 years ago, you’d either be purchasing some expensive statistical software or hiring a consultant to answer these questions. Today (thank you Internet), it’s simple and basically free.
A step-by-step approach can be found here.
You have been calculating an effect size for each of your CME activities, right? And now you have a database full of activities with corresponding effect sizes for say, knowledge and competence outcomes. Sound familiar? Anyone…anyone…Bueller? Okay, for the one straggler, here’s a refresher:
- What is effect size? (link)
- How to calculate effect size (link)
- Reporting effect size (link)
- Effect size – other methodologic/statistical considerations (link)
Now that we’re all on the same page, let’s move on to the next question…what exactly is a “good” effect size? Well, you would first start with Cohen (Cohen. J. . Statistical power analysis for the behavioral sciences [2nd ed.]. Hillsdale, NJ: Lawrence Earlbaum Associates), who identified the following general benchmarks: 0.2 = small effect, 0.5 = medium effect, and 0.8 = large effect. Although effect size is relatively new to CME, thankfully more specific effect size data is available. Starting with recent literature (specifically, meta-analyses), the following effect sizes have been reported:
It’s important to note that these effect sizes are the result of mixed measurement methods (and that measurement approach influences effect size), but they are certainly more relevant than Cohen’s benchmarks (and we know that Cohen wouldn’t take offense, because refining effect sizes through repeated measurement in a given area is exactly what he recommended).
In regard to repeated measurement, we have been measuring knowledge- and competence-level effect sizes for a variety of CME activities over the past two years. In the next post, I’ll be publishing our effect size results for a variety of live and enduring material formats. I’d love to hear how these results jive with your findings.
CE Measure just published our manuscript regarding effect size in CME assessment. In it, we compare traditional tests of statistical significance (e.g., t-test) with effect size measures and then provide a step-by-step guide for calculating Cohen’s d (one of the more popular effect size measures).
Check it out. Like this blog – all the information is free.
Over the previous three posts, I introduced effect size, discussed its calculation and interpretation, and even provided an example of how you can use effect size to demonstrate the effectiveness of your overall CME program. My intention was to present a method for CME assessment that is both practical and powerful.
For those a bit more statistically savvy, you likely noticed that my previous effect size example focused on paired, ordinal data. That is, I used a pre- vs. post-activity survey (i.e., paired) comprised of rating-scale (i.e., ordinal) questions. I chose this path because it’s fairly common in CME outcome assessments and it’s the most straightforward calculation of Cohen’s d (which was the effect size measure of interest).
Here are some other scenarios:
- If you’re using pre- vs. post-activity case-based surveys, you’re now working with paired, nominal (or categorical) data that has most likely been dichotomized (e.g., transformed into correct/evidence-based preferred answer = 1, all other responses = 0). In this case, the road to effect size is a bit more complex (i.e., use McNemar’s to test for statistical significance, calculate an odds ratio[OR], and convert the odds ratio to Cohen’s d). Of note, an OR is itself an effect size measure, and converting this to Cohen’s d is optional. The formula for this conversion is d = ln(OR)/1.81 (Chinn S: A simple method for converting an odds ration to effect size for use in meta-analysis. Statistics in Medicine 2000, 19:3127-3131).
- If you’re using post-activity case-based surveys administered to CME participants and a representative control group, you’re now working with unpaired, nominal data (that is typically dichotomized into correct answer vs. incorrect answer). In this case, you’ll use a chi-square test (if the sample is large) or Fisher’s exact test (if the sample is small) and also calculate a Cramer’s V. You’ll then need to convert Cramer’s V to Cohen’s d (which you can do here).
If you’ve been doing this, or any other analysis incorrectly (as I have in the past, often do in the present, and bet on in the future). Don’t fret. Statisticians are constantly pointing out examples of faulty use of statistics in the peer-reviewed literature (even in such prestigious journals as JAMA and NEJM). Keep making mistakes, it means you’re moving forward.
In the previous two posts, I introduced effect size, walked through an effect size calculation and provided some insight regarding interpretation. Now I want to quickly identify one application of effect size data: ACCME reaccreditation.
ACCME criterion 11 states: The provider analyzes changes in learners (competence, performance, or patient outcomes) achieved as a result of the overall program’s activities/educational interventions. I can only imagine the pages and pages of material heaped on ACCME reviewers in response to this criterion. How can you succinctly describe the effectiveness of a CME program, consisting of hundreds of activities over a two-, four- or six-year period? Oh yeah, effect size.
If you remember from the last post, effect sizes can be aggregated across activities as long as the education outcome measurement (EOM) approach remains the same. So assume you’re a healthcare system that regularly produces RSS, conferences and eLearning activities. Furthermore, assume your standard EOM approach across these activities is to measure self-reported utilization of clinical tasks related to CME activity content. If you’ve been calculating an effect size for each of these activities, you can aggregate the effect size scores across all of these activities to come up with a single effect size for competence (Level 4 outcome). Compare this effect size to the benchmarks identified in the previous post (e.g., 0.2 = small, 0.5 = medium, and 0.8 = large) and you have data-based evidence of your overall program effectiveness at this outcome level (see example Figure).
Taking it a step further, you can stratify effect size by format type, which would tell you how effective your eLearning was in relation to conferences or RSS (see example Figure 2). You can even further stratify by topic focus to see how effective your primary care CME was in relation to rheumatology-based CME, for example.
Now you’re responding to criterion with just a few figures and explanatory paragraphs. And you’re using good data to do so. Maybe now the next reaccreditation review won’t look so scary.
In the previous post, I introduced effect size (more specifically, Cohen’s d) as a statistical tool that can answer whether a CME activity was effective, as well as quantify the magnitude of this effectiveness and allow for comparisons of effectiveness across CME activities. Using Cohen’s d, a CME provider can report the effectiveness of an annual meeting in affecting, for example, participant competency (Level 4 outcomes) and then compare the magnitude of effect to previous year’s meetings and/or other CME activities of similar format or topic focus. Ultimately, a CME provider can determine benchmarks for effectiveness at each outcome level (or for each educational format) to quickly diagnose the performance of each CME activity. That sort of info comes in real handy for accreditation review and for communicating with sponsors (but that will be the focus of the next post).
So, all that being said, it’s now time to discuss how to actually calculate a Cohen’s d. One caution: you will not need a statistician, an advanced grasp of mathematics, or any specialty certification…if you can calculate (or more likely, use MS Excel to calculate) an average, standard deviation and have access to the Internet, you’re good.
I’ll set the stage with a common example: assume that you are a CME provider who just produced a 2-hour, mixed didactic-interactive case discussion regarding advances in the detection, evaluation and treatment of high blood cholesterol in adults. You used a paper-based survey (administer both pre- and post-activity) to measure participants self-reported utilization (on a 5-point scale) of clinical tasks related to the CME activity content. Each survey consisted of eight assessment items (i.e., clinical tasks). Now you want to summarize this pre- vs. post-activity data into a single effect size. The steps for such are as follows:
- Calculate a mean rating and standard deviation for each assessment item in the pre-survey.
- Calculate a mean rating and standard deviation for each assessment item the post-survey.
- Type “effect size calculator” into Google and click any of the identified links (I like to use this one).
- Enter the data from items #1 and 2 (above) into the effect size calculator.
- Behold the effect size for your activity!
There is one more step…interpretation. For that, you need to be aware of the following:
- Cohen’s d is expressed in standard deviation units. Accordingly, a Cohen’s d of 1.0 indicates that one standard deviation separates the pre-activity average rating vs. the post-activity average rating (with the post-activity rating being greater).
- Cohen’s d is proportional. Therefore, a Cohen’s d of 1.0 is twice the magnitude of a Cohen’s d of .5 (or half the magnitude of a 2.0).
- There is no upper or lower bound to the possible range of Cohen’s d. The maximum expected range of Cohen’s d is from -3 to +3, but the majority is expected to fall within -1 to +1.
- Benchmarks are used to assess the magnitude of a Cohen’s d. Based on repeated measurement, benchmarks (or expected ranges of Cohen’s d) can be established in a given area (e.g., mixed, didactic-interactive CME). In areas where benchmarks remain to be established, the following preliminary benchmarks can be used to assessed magnitude of effect: 0.2 (small), 0.5 (medium) and 0.8 (large) (Cohen 1988).
- You can compare the Cohen’s d from one activity to the d from any other activity that used a similar outcome assessment method (i.e., case-based survey).
- You can aggregate Cohen’s d across activities (i.e., take an average d across all of your eLearning activities, or all of your cholesterol-focused CME – assuming you used the same outcome assessment method for these activities [see item #5 above]).
And just like that, you are now proficient in calculating and interpreting effect size in CME. I told you this would be easy. Now go forth and make this look hard to all of your competition.
Reference: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Erlbaum, Hillsdale, NJ.