The Coordinators of Psychiatric Education (COPE) is a committee consisting of all 16 Canadian psychiatric residency training program directors and a designated resident from each program. The COPE exam has been a formative exam taken every other year by Canadian psychiatric residents in postgraduate years (PGY) 2 to 5 and is meant to help residents prepare for practice and their specialty certification exam in psychiatry. The impetus for a formative residency in-training exam has precedent in other medical specialty training including anesthesia (1), internal medicine (2, 3), general surgery (4), and obstetrics and gynecology (5), with evidence of predicative validity for performance on certifying exams. Likewise, in the United States, results of the Psychiatry Resident In-Training Examination (PRITE) have been found to correlate moderately (r=0.67) with scores on the American Board of Psychiatry and Neurology Part 1 Examination (6) and highly with faculty/preceptor clinical evaluations (r=0.82–0.89, depending on level of training) (7).
In previous iterations of the COPE exam, acceptability, reliability, and validity were not assessed, nor were standard setting techniques used to establish minimum performance levels. The exam was dependent on the content and quality of questions submitted from year to year by program directors. The COPE exam was reconstructed in 2006 using assessment best practices, including developing an exam blueprint, setting performance standards, making the exam annual, developing a question bank, and assessing the reliability and validity of the exam on an ongoing basis.
In reconstructing the COPE exam, the PRITE was used as a model. Created by The American College of Psychiatrists, the PRITE has been administered annually since 1979 with good reliability (Kuder-Richardson reliability coefficient for the global score in psychiatry >0.90) and construct and content validity documented in the literature (8). The PRITE is formative for psychiatric residents, with a copy of the examination and the correct responses supported by references released to the candidates each year after the examination has been scored. Nearly all psychiatric residents in the United States take the exam multiple times during training, and it serves not only as a preparatory aid for board exams but also as a means of providing feedback to residents about their knowledge compared with other psychiatric residents at the same level of training (9). As mentioned, predictive validity has been demonstrated with respect to scores on the American Board of Psychiatry and Neurology Part 1 Examination (6) and faculty/preceptor clinical evaluations (7).
Although the reconstructed COPE exam follows the general model and principles of exam development characterizing the PRITE, the content of the two exams differs. One-third of the PRITE consists of questions related to neurology, and the PRITE has a greater emphasis on pharmacology over psychotherapy. In contrast, the COPE exam is based on the Royal College Objectives of Training in Psychiatry (10). The impetus for a Canadian in-training exam also derives from the current lack of performance standards with which to compare Canadian resident performance on the PRITE, as well as the costly nature of the PRITE for Canadian psychiatric residents ($150 per resident). These factors make the PRITE less than ideal for Canadian programs, and as such, there has been an ongoing desire to improve and standardize the COPE exam. The overarching goal of our study is to evaluate the reliability (internal consistency) and validity (construct and content) of the reconstructed COPE exam. As part of this evaluation, there is particular interest in resident performance relative to a minimum performance level, along with the ability of the exam to discriminate between PGY groups based on the hypothesis that resident performance should improve with increasing levels of training.
The reconstructed COPE exam was planned to consist of 100 multiple-choice, single best-answer questions testing knowledge, comprehension, application, and analysis of psychiatric material (11). Initially, 22 content areas were identified; these were reduced to 18 areas during exam development. Weightings were given to each content area in an attempt to reflect the practice of a general psychiatrist. The weightings yielded an approximate number of exam questions dedicated to each content area, which was further subdivided into four subsections relating to basic science, etiology, and epidemiology; assessment and diagnosis; biological therapies (pharmacotherapy and ECT); and psychotherapy.
The proposed exam blueprint and a blank blueprint (content areas but no assigned weightings) were distributed directly to all 16 Canadian psychiatry program directors. The program directors were asked to provide written, voluntary, signed consent and then complete a blank blueprint by filling in assigned weightings for each content area. A similar process of validation and potential modification has been utilized in development of the PRITE (12).
After the blueprint was reviewed, 100 new multiple-choice questions were developed, because there was no exam bank to reference. The questions were primarily based on knowledge content from core psychiatry textbooks (13, 14) and practice guidelines from The Canadian Journal of Psychiatry and The American Journal of Psychiatry. The questions were reviewed by other psychiatrists with academic appointments in the Department of Psychiatry at the University of Calgary for relevance, content, and wording and to ensure that the answers were correct with appropriate literature references. The exam was then translated into French.
Standard setting of the exam was completed using a modified Angoff method to determine a minimum performance level for each question. Briefly, after discussing the intent of the exam and reviewing the blueprint, five psychiatrists were asked to define the characteristics of a group of examinees that would barely pass the exam—a minimally competent group. This rater training process is important to ensure that the judges have a clear understanding of the expectations of the exam and the performance of the examinees. The five judges then considered each question and estimated a percentage of the minimally competent group that would answer correctly. Scores were recorded and presented to the five judges for discussion and revision. The final values were averaged for each question and the minimum performance level for the exam was the sum of those averages (15–17). The aim was for an absolute cutoff mark of 70%, to approximate that used by the Royal College of Physicians and Surgeons of Canada (RCPSC) for the psychiatry certification examinations. The minimum performance level set for the 2007 COPE exam was 71.2%.
The research protocol was approved by the University of Calgary Research and Ethics Review Board prior to distribution. After complete description of the study, written, voluntary, and signed consent was obtained from all residents taking the exam. Each resident was assigned a unique identification number to ensure confidentiality. All psychiatric residents (PGY-2–5) from each of the Canadian psychiatric residency programs were subsequently administered the exam by their respective programs under usual test conditions. Answers to the 100 multiple-choice questions were completed on computer score sheets, sealed in provided envelopes, and returned to the researchers along with demographic data. A total of 436 exams were returned to the researchers, with 402 having valid consents and demographic data suitable for analysis. Because the exam is formative in nature, a copy of the exam and the answer key supported by references and explanations were released to the candidates after the examination had been scored. Likewise, each resident’s score was provided, along with the mean and standard deviations for each PGY as a means of providing feedback.
Preliminary data analyses were a combination of descriptive and inferential statistics. Score distributions were calculated, including mean overall scores with standard deviations for each PGY group along with comparisons made between PGY groups. Test reliability (Cronbach alpha) and validity (ANOVA) was assessed. Item analysis to determine the difficulty of individual items (p<0.25 to identify very difficult items, p>0.95 for easy items), item discrimination (via point-biserial correlation, r<0.20), and distractor effectiveness were performed.
Four out of 16 Canadian psychiatric residency training directors confirmed the current blueprint, and two made minor modifications, resulting in no mean changes to the initial blueprint. The exam blueprint is shown in Table 1.
Table 2 shows the PGY-2–5 distributions, along with mean examination scores and standard deviations (SD) for the 402 Canadian psychiatric residents who took and returned the exam with valid consents and demographic data suitable for analysis. The overall mean score ± SD for all residents was 69.6%±8.5%. For the PGY-2 group, the mean scores were 65.3%±7.9%; for PGY-3, 68.3%±7.3%; for PGY-4, 71.2%±7.2%; and for PGY-5, 77.4%±6.9%. A total of 44 residents (11% of all residents) scored 80% to 89%, and only one resident scored above 90% (91%). Out of the 402 residents completing the exam, 217 (54%) scored below the 71.2% minimum performance level. For the PGY-5 group, of particular importance given their impending RCPSC certification exam that same year, 56 of 71 residents (79%) scored above the established minimum performance level.
Overall, Cronbach alpha was 0.79, suggesting good internal consistency. When considered by PGY group, the Cronbach alpha results were slightly lower, ranging from 0.70 for PGY-3 to 0.74 for PGY-2 (Table 2). The standard error of measurement (SEM) was used to calculate the 95% confidence intervals. Due to error in this exam (SEM), any examinee’s score could fall within ±7.72 points of the reported score.
To determine if mean scores differed across PGY groups, a one-way ANOVA was conducted with post hoc tests (Tukey’s honestly significant difference). Notably there were significant differences in total scores between each of the PGY groups, with consistently better performance with increasing time in residency (F=42.68, df=3, 397, p<0.001). The PGY-5 group performance was significantly better than for the PGY-4 (p<0.001), PGY-3 (p<0.001), and PGY-2 (p<0.001) groups; the PGY-4 group performance was significantly better than for the PGY-3 (p=0.03) and PGY-2 (p<0.001) groups; and the PGY-3 (p=0.01) group performance was significantly better than for the PGY-2 group. These trends are both in keeping with the a priori hypothesis that more training predicts higher scores, as well as suggestive of good discrimination between levels of training with respect to exam performance.
Analyses of the difficulty of individual items, item discrimination, and distractor effectiveness for the exam revealed four questions that exhibited poor item performance. Two items were originally miskeyed, so they were corrected and all the exams remarked, reflecting the scores presented in the Results section with acceptable difficulty, discrimination, and distractor effectiveness based on the cutoff values outlined in the Methods section. Of the other two poor performing items (19% and 11% mean correct responses), the first was found to be difficult yet still discriminate, while the second was both difficult and did not discriminate. For the purpose of this analysis, they were retained, but the first question was reworked for the examination bank and the second was deleted.
Initial psychometric assessment of the COPE exam suggests that it is reliable and valid with the ability to discriminate between residents based on time in training. The exam demonstrated acceptable reliability (α=0.79), which is consistent with, but not as high as, the psychiatry component of the PRITE (KR-20=0.91), based on 237 questions (8). Analyses of the difficulty of individual items, item discrimination, and distractor effectiveness for the COPE exam revealed that only two questions exhibited poor item performance and thus likely only minimally diminished the overall reliability. For the exam analysis, the decision to keep these items was made because they provided formative feedback to both the candidates and the exam developers, but one item was deleted from the bank and the other was reworked.
Perhaps the most impressive findings from this exam administration are related to the apparent discrimination between PGY group performances. Without exception, the groups with advanced training performed significantly better than those with less training time, in keeping with the a priori hypothesis. Though such differences are intuitively expected and such discrimination is generally regarded as a desirable feature of an in-training exam, it is worth noting that other medical in-training examinations, including the PRITE (6), have not necessarily shown such consistent delineation of performance based on training time. These findings from the COPE exam are suggestive of good construct and content validity. Moreover, these findings are also in keeping with a fundamental objective of the RCPSC for residents to have graded expectations commensurate with increasing clinical knowledge and abilities (10). One potential source of bias is that residents with higher levels of training likely have written more in-training exams during their residency and therefore perform better than those with less training. This is based in part on familiarity with these kinds of exams and exists despite the fact that the exam was reconstructed and consisted of questions not previously utilized.
In addition to performance comparisons between PGY groups, scores were also compared to the minimum performance level of 71.2% derived using a modified Angoff method. Data from the RCPSC indicates that for 2007, 93% of candidates who trained in Canadian psychiatric residency programs and attempted the psychiatry certification examination for the first time were successful (18). In our study, 79% of PGY-5 residents (56 of 71) exceeded the minimum performance level—lower than the percentage exceeding the minimum performance levels for the RCPSC exam. Any direct comparison of the COPE exam versus the RCPSC certification exam is difficult at this stage, given the distinct exam conditions and factors at play, including, but not limited to, different groups of PGY-5s writing the exams and different timing, with the COPE exam being administered in the early part of the PGY-5 year, when most residents are likely early on in the intensive preparation phase for the certification exam and have yet to attend any formal preparatory course. As mentioned previously, there is no indication of whether the COPE exam predicts performance on the RCPSC certification exam, and this remains an avenue for future investigation.
Overall, our study provides preliminary support that the reconstructed COPE exam demonstrates adequate reliability and validity, including showing some capacity to discriminate between levels of training. These findings provide an important foundation for moving forward with plans to make the exam annual, develop an evolving question bank, and reassess the psychometric properties of the exam on an ongoing basis.
At the time of submission, the authors reported no competing interests.