Category Archives: Power calculation

Losing Control

CME has been walking around with spinach in its teeth for more than 10 years.  And while my midwestern mindset defaults to “don’t make waves”, I think it’s officially time to offer a toothpick to progress and pluck that pesky control group from the front teeth of our standard outcomes methodology.

That’s right, CME control groups are bunk. Sure, they make sense at first glance: randomized controlled trials (RCTs) use control groups and they’re the empirical gold standard.  However, as we’ll see, the magic of RCTs is the randomization, not the control: without the “R” the “C” falls flat.  Moreover, efforts to demographically-match controls to CME participants on a few simple factors (eg, degree, specialty, practice type and self-report patient experience) fall well short of the vast assemblage of confounders that could account for differences between these groups. In the end, only you can prevent forest fires and only randomization can ensure balance between samples.

So let’s dig into this randomization thing.  Imagine you wanted to determine the efficacy of a new treatment for detrimental modesty (a condition in which individuals are unable to communicate mildly embarrassing facts).  A review of clinical history shows that individuals who suffer this condition represent a wide range of race, ethnicity and socioeconomic strata, as well as vary in health metrics such as age, BMI and comorbidities.  Accordingly, you recruit a sufficient sample* of patients with this diagnosis and randomly designate them into two categories: 1) those who will receive the new treatment and 2) those who will receive a placebo.  The purpose of this randomization is to balance the factors that could confound the relationship you wish to examine (ie, treatment to outcome).  Assume the outcome of interest is likelihood to tell a stranger he has spinach in his teeth.  Is there a limit to the number of factors you can imagine that might influence an individual’s ability for such candor?  And remember, clinical history indicated that patients with detrimental modesty are diverse in regard social and physical characteristics.  How can you know that age, gender, height, religious affiliation, ethnicity or odontophobia won’t enhance or reduce the effect of your treatment?  If these factors are not evenly distributed across the treatment and control groups, your conclusion about treatment efficacy will be confounded.

So…you could attempt to match the treatment and control groups on all potential confounders or you could take the considerably less burdensome route and simply randomize your subjects into either group.  While all of these potential confounders still exist, randomization ensures that both the treatment and control group are equally “not uniform” across all these factors and therefore comparable.  It’s very important to note that the “control” group is simply what you call the population who doesn’t receive treatment.  The only reason it works is because of randomization.  Accordingly, simply applying a control group to your CME outcome assessment without randomization is like giving a broke man a wallet – it’s so not the thing that matters.

Now let’s bring this understanding to CME.  There are approximately, 18,000 oncology physicians in the United States.  In only two scenarios will the participants in your oncology-focused CME represent an unbiased sample of this population: 1) all 18,000 physicians participate or 2) at least 377 participate (sounds much more likely) that have been randomly sampled (wait…what?).  For option #2, the CME provider would require access to the entire population of oncology physicians from which they would apply a randomization scheme to create a sample based on their empirically expected response rate to invitations in order to achieve the 377 participation target.  Probably not standard practice.  If neither scenario applies to your CME activity, then the participants are a biased representation of your target learners.  Of note, biased doesn’t mean bad.  It just means that there are likely factors that differentiate your CME participants from the overall population of target learners and, most importantly, these factors could influence your target outcomes.  How many potential factors? Some CME researchers suggest more than 30.

Now think about a control group. Are you pulling a random sample of your target physician population?  See scenario #2 above.  Also, are you having any difficulty attracting physicians to participate in control surveys?  What’s your typical response rate?  Maybe you use incentives to help?  Does it seem plausible that the physicians who choose to respond to your control group surveys would be distinct from the overall physician population you hope they represent?  Do you think matching this control group to participants based on just profession, specialty, practice location and type is sufficient to balance these groups?  Remember, it not the control group, it’s the randomization that matters.  RCTs would be a lot less cumbersome if they only had to match comparison groups on four factors.  Of course, our resulting pharmacy would be terrifying.

So, based on current methods, we’re comparing a biased sample of CME participants to a biased sample of non-participants (control) and attributing any measured differences to CME exposure.  This is a flawed model.  Without balancing the inherent differences between these two samples, it is impossible to associate any measured differences in response to survey questions to any specific exposure.  So why are you finding significant differences (ie, P < .05) between groups?  Because they are different.  The problem is we have no idea why.

By what complicated method can we pluck this pesky piece of spinach?  Simple pre- versus post-activity comparison.  Remember, we want to ensure that confounding factors are balanced between comparison groups.  While participants in your CME activity will always be a biased representation of your overall target learner population, those biases are balanced when participants are used as their own controls (as in the pre- vs. post-activity comparison).  That is, both comparison groups are equally “non-uniform” in that they are comprised of the same individuals. In the end, you won’t know how participants differ from non-participants, but you will be able to associate post-activity changes to your CME.

1 Comment

Filed under Best practices, CME, Confounders, Control groups, Needs Assessment, Outcomes, Power calculation, Pre vs. Post

What sample size do I need?

I was recently asked the following:

Do you have any information on the sample size needed to obtain statistical significance for surveys?

That depends on the type of survey.  If you’re looking for sample size necessary for a needs assessment survey, you can find clear instructions here.  For a comparative assessment (e.g., participants pre- vs. post-CME activity or CME participants vs. representative control group), the necessary sample size would be determined by a  power calculation…but don’t worry about how to do a power calculation, odds are it doesn’t fit your assessment.

A very helpful explanation of power calculations by Professor Mean (think “average” not “unpleasant”) can be found here.   Professor Mean details three things needed for a power calculation:

  1. a research hypothesis,
  2. a standard deviation for your outcome measure, and
  3. an estimate of a clinically relevant difference for this outcome measure.

The standard CME assessment is as follows: participants in a CME activity are given a survey (this survey consists of case-based questions, likert-scale questions, or both) and their responses to this survey are compared pre- vs. post-participation, post-participation vs. the responses of a representative non-participant group, or both.  Other than the umbrella expectation that CME participants will respond better to each question after CME exposure (i.e., more in accordance with the educational messages of the CME activity), there is seldom a specific hypothesis defined (see power calculation criteria #1 above).  You could argue that each survey question is a hypothesis, in which case you would need to be able to identify a standard deviation (criteria #2) and clinically relevant difference (criteria #3) for each.  If you’re using a likert scale survey, what’s the standard deviation for self-efficacy in performing a diabetic foot exam?  And if a physician’s self-efficacy climbs 1-point, is that clinically relevant?  If you’re using a case-based instrument, what’s the standard deviation for prescribing a LDL-lowering drug in a patient with 0-1 risk factors for CHD and a LDL level of > 190 mg/dL?  Can you imagine having to answer these questions for every CME assessment instrument for every CME activity?  I can’t.  Which is why I/we don’t/shouldn’t worry about power calculations.

The purpose of a power calculation is to conserve resources and protect people from harm.  In regard to clinical drug trials, each subject added to your study increases both expense and exposure to potentially harmful treatment.  Clearly a calculation to identify the minimum number of study subjects is useful in this setting.  In CME, we want to educate as many physicians as possible and each additional physician educated should decrease the amount of harm experienced by their patients.  Power calculations don’t make sense in CME planning, and we shouldn’t pretend otherwise.

Now for the best part…go ahead and run stastical tests on your survey data.  If your results achieve statistical significance, then you had adequate power.  That doesn’t mean your assessment isn’t without methodologic flaws…just that power isn’t one of them.  If your results don’t achieve statistical significance, then you have two conclusions: 1) in this assessment, there was not a difference between CME participants and the comparison group, and 2) the inability to detect a difference could be due to an insufficient number of assessment participants.

I know it sounds smart to talk about power calculations, but in most cases the truth is exactly the opposite.  Next time you hear someone claiming they did a power calculation for a CME assessment – ask them to answer to each of Dr. Mean’s three criteria.


Filed under Power calculation, Sample size