Norcini JJ, Lipner R, Downing SM: How meaningful are scores on a take-home recertification examination? Acad Med 1996; 71:S71—S73
Norcini JJ, Lipner RS: Recertification: is there a link between take-home and proctored examinations? Acad Med 1999; 74:S28—S30
In these two articles, Norcini and his colleagues report on the American Board of Internal Medicine's experience with the self-assessment component of a three-step recertification process. The first step involves verification of clinical competence via documentation of licensure and current certification in basic life-saving or advanced cardiac life support. The second component, a take-home modular examination that is described in more detail below, must be passed for admission to the final examination. This is a proctored, closed-book examination designed to evaluate knowledge that physicians should have without consulting medical resources.
The self-assessment modules contain primarily case-based multiple-choice items requiring synthesis and judgment to arrive at the correct response. The items focus on medical advances over the past decade and on well-established principles of patient care. Educational materials and references are not provided.
In the first study, performance on two 60-item modules, one in general-internal medicine (GIM), taken by 177 candidates, and one in critical-care medicine (CCM), administered to 156 candidates, was analyzed. For the GIM module, the mean item difficulty was 0.69, and the median level of item discrimination was 0.24. For the CCM module, these were 0.72 and 0.16, respectively. The generalizability coefficient was 0.78 for the GIM module and 0.72 for the CCM module.
The mean raw score on the GIM module was 41.6±6.9; on the CCM module, it was 43.3±5.2. Fifty percent of those who took the GIM module fell below the passing score of 71%, and only 14% of the CCM examinees met the pass/fail standard of 83%. Of those who retook the modules, more than 90% passed.
The authors concluded that the traditional indices of item performance (difficulty and discrimination) could be used to assess the quality of the test items, although the discrimination indices were somewhat lower than those obtained from high-stakes, proctored examinations. The test scores were as reliable as 60-item tests used for licensure and certification typically are.
Performance on the GIM module was correlated with performance on the initial certifying examination in GIM (r=0.34; P=0.001) and with percentage of time spent in the practice of general-internal medicine (r=0.21; P=0.004). For the CCM module, performance was correlated with scores on the initial certifying examination (r=0.18; P=0.03), with performance on the CCM certifying examination (r=0.32; P=0.001), and with percentage of time spent in the practice of the subspecialty (r=0.19; P=0.02).
In the second study, the performance of 637 CCM candidates who took three modules in the self-evaluation phase of subspecialty recertification and a 120-item final examination was analyzed. Successful examinees were surveyed about their satisfaction with the recertification program, and 62% responded to the questionnaire.
These candidates required 3—11 attempts (mean=4.7±1.5) to pass all three self-assessment modules, whose reliabilities ranged from 0.76 to 0.78. The raw mean score on the final examination was 96.6±8.6, and 95% passed. The test reliability was 0.77. The mean scores from the first attempt at the modules were significantly correlated with the final examination score (r=0.33; P<0.001).
Of the survey respondents, 89% agreed that the recertification process had been of personal and professional value to them and that the self-assessment component had provided a valuable learning experience.
The authors caution that these findings may not hold as familiarity with recertification increases and candidates become more familiar with the items. At this point, they conclude that self-evaluation is a useful component in the ABIM's recertification process.
Bordage G: Why did I miss the diagnosis? some cognitive explanations and educational implications. Acad Med 1999; 74:S138—S143
In this review, based on 188 papers and a dozen books, Bordage focuses on two types of diagnostic error: faulty detection of clinical features and faulty triggering of diagnostic hypotheses. On the basis of his findings, he makes some recommendations for teaching "introduction to clinical medicine" (ICM) courses.
The first type of error occurs in the data-gathering phase, when findings are not obtained, for example, taking an incomplete history of present illness or incomplete physical examination, or when findings are misinterpreted, for example, when signs or symptoms are misidentified.
The second type of error occurs at the data integration phase, during which findings are collated to develop diagnoses. Examples of errors in data integration include assigning the wrong weight or importance to a finding, using nondiscriminatory or normal findings to support a diagnosis, and over-reliance on laboratory results. Bordage identified several themes from his review of the literature on errors in data-gathering and data integration:
"Time alone does not lessen the incidence of certain errors; diagnostic errors can have important adverse consequences (e.g., delayed or inappropriate treatment, unnecessary investigations, complications, or patient anxiety); diagnosis rests on a few key discriminatory findings; and house officers are reluctant to show their thinking (p S140)."
He then elaborates on the underlying cognitive causes for faulty detection or triggering. These include having no previous instances to refer to, overestimating certain features, and having no mental representation of the problem.
Bordage recommends that "ICM courses should move away from teaching countless head-to-toe maneuvers (as many as 150) to impart a much smaller, more focused, and discriminating set of symptoms and signs (p S140)." He emphasizes the importance of practice and of creating a nonjudgmental and supportive atmosphere to decrease stress and future mistakes.
Williamson DM, Bejar II, Hone AS: ‘Mental model’ comparison of automated and human scoring. JEduc Meas 1999; 36:158—184
Williamson and his colleagues compared the ratings yielded by two different methods used to score the Architect Registration Examination (ARE). This examination consists of six multiple-choice modules and three simulations of architectural-design tasks, all of which are computer-administered.
A mental model approach was used to develop an automated scoring system for the design vignettes. The system involved rating each feature of the solution as Acceptable (A), Indeterminate (I), or Unacceptable (U). These features were grouped into blocks of criteria that were rated, and finally an overall evaluation for the vignette on the same three-point scale was derived.
The computer-generated scores for a total of 3,613 examinee performances across 15 vignettes were compared with those produced by two three-member panels of experienced graders. The two systems produced identical scores for 69% of the design solutions. There was a one-category disagreement for 17% (A—I or I—U), and a two-category disagreement (A—U) for 14%.
A subset of the pairs of scores that were discordant was reviewed by the human scorers. Of these 298 cases, there was perfect agreement for 43% after review. There remained a one-category disagreement for 28% of the solutions and a two-category disagreement for 30%. Six sources for these human—computer disagreements were identified: 1) subjective criteria; 2) objective criteria; 3) tolerances/weighting; 4) details; 5) examinee task interpretation; and 6) unjustified.
The authors noted that, "The human graders found some types of automated scoring evidence to be incontrovertible while at other times they felt the human holistic scores captured solution nuances overlooked by the ‘mental model’ in automated scoring (p 177)."
They concluded that the automated scoring system sufficiently captures expert human judgment for it to be useful; furthermore, it has several advantages, including reproducibility, consistency, objectivity, reliability, and efficiency. These results are encouraging because they demonstrate that it is possible to develop automated scoring systems for complex, constructed-response test items.