Hodges B, Regehr G, McNaughton N, et al: OSCE checklists do not capture increasing levels of expertise. Acad Med 1999; 74:1129—1134
Hodges and his colleagues conducted a study to compare the outcomes associated with two different approaches to scoring examinations involving standardized patients (SPs), also known as Objective Structured Clinical Examinations or OSCEs. The two scoring methods were checklists and global rating scales. Previous research had suggested that scores based on checklists could not distinguish among practitioners with different levels of expertise.
The authors argue that professionals go through five stages in their development: novice, advanced beginner, competence, proficiency, and expertise. Each stage is characterized by a distinct form of problem-solving. For example, novices typically collect large amounts of data in no particular order, which are then synthesized to arrive at a problem solution. Experts gather much more focused information and are often able to quickly and automatically arrive at a conclusion without going through a formal problem-solving process. They concluded that, "while the checklists used in OSCEs probably reflect the approach of novices to clinical problems, it is unlikely that such measures can detect the complex and hierarchical problem-solving characteristic of the experienced clinician. (p 1130)"
To further investigate this hypothesis, 42 subjects were recruited; one-third were medical students who had completed a 6-week psychiatry clerkship, one-third were family-practice residents, and one-third were practicing family practitioners. They each interviewed two SPs for 15 minutes. The setting was a new patient visit in a family-practice office setting; one SP had symptoms of panic disorder with agoraphobia, and the other had symptoms of obsessive—compulsive disorder. A physician examiner completed a 22-item checklist for each case as well as a global rating form consisting of five 5-point scales (knowledge and skills, empathy, coherence, verbal expression, and nonverbal expression). Diagnostic accuracy was rated at 2 minutes and at the end of the interview on a 3-point scale.
In addition, different sets of instructions were given to the subjects. One set was naturalistic, and the subjects were told to do what they would usually do in caring for a patient. For the other set, the examinees were told that they would be evaluated with a checklist, and a high score would be obtained by covering as many of the items as possible.
For the global ratings, analysis of variance revealed that the practitioners scored better than the residents or medical students. The type of instructions had no effect, and there was no interaction of instruction and level of training. For the checklists, the experienced clinicians scored lower than the residents or students. Again, the type of instructions had no significant effect, nor was there an interaction with level of training. For all groups, diagnostic accuracy was higher at the end of the interviews than at 2 minutes. However, level of training had no significant effect on diagnostic accuracy.
Hodges et al. concluded that although checklists may be appropriate in some assessment situations, they may not be useful for evaluating physicians at more advanced levels of training. Because global ratings also have limitations, they suggested that other approaches to scoring need to be investigated, especially those that take into account types of questions asked, sequence of questions, and degree to which questions reflect the formation of a diagnostic hypothesis.
Zuriff GE: Extra examination time for students with learning disabilities: an examination of the maximum potential thesis. Applied Measurement in Education 2000; 13:99—117
Zuriff reports that approximately 50,000 students with learning disabilities enter college each year, representing 3% of first-year students. They constitute the fastest-growing group of individuals receiving disability accommodations, and the most frequently provided accommodation is extended test time, typically 1.5 to 2 times the standard examination time. The rationale for extended test time is that learning-disabled students process certain kinds of information more slowly than their nondisabled peers, and hence their performance is impeded on timed examinations. It is noted that extra time need not be granted if speed of work is part of what is being assessed by an examination.
The authors suggest that at least three issues need to be addressed when evaluating any request for a testing accommodation. These are the following: Will alterations in testing conditions change the skills being measured? Will taking the examination under altered conditions change the meaning of the resulting scores? Would nondisabled examinees benefit if allowed the same accommodation? (p 101) If any of these questions are answered in the affirmative, then an accommodation may not be appropriate because the validity of the test results could be affected.
The third issue is addressed in more detail through the review of five studies comparing the performance of learning-disabled and nondisabled students under standard time and extended time conditions. These studies all used performance on the Nelson-Denny Reading Test as the dependent variable. That instrument yields a vocabulary score, a reading comprehension score, and a total score. Some of the studies also used other performance measures, so there was a total of 32 score comparisons.
For 31 of the 32 measures, those with learning disabilities performed better under the untimed conditions. The non—learning-disabled groups performed better on 16 of the 32 measures under untimed conditions. The author concluded that since there is evidence to suggest that nondisabled examinees would also benefit if allowed the same accommodation, then the practice of granting extra time appears to violate one of the criteria for test integrity. Until further research can be done, two interim solutions are put forth: indicating on transcripts when a test score was obtained under nonstandard testing conditions and allowing generous time allotments to all examinees.
Engelhard G Jr, Davis M, Hansche L: Evaluating the accuracy of judgments obtained from item review committees. Applied Measurement in Education 1999; 12:199—210
Englehard and his colleagues created a set of 75 multiple-choice questions that contained item flaws of various types. The questions covered subject matter relevant for grades 3 through 8. A group of 39 experienced test developers received further item-review training and were then asked to determine which of 16 flaws, if any, were present in these items.
The flaws were divided into two broad categories. The cultural flaws were related to whether an item unfairly favored examinees based on gender, ethnic group, religious background, regional location in the state of Georgia, socioeconomic status, or type of community (e.g., rural, urban, suburban), or whether those with handicaps or who were members of any ethic group were portrayed in an unfavorable light.
The technical flaw categories included accuracy of item content, complexity of format and language, requiring prior knowledge outside of area being tested, appropriateness of topic and difficulty, mechanical errors (e.g., misspellings, unclear graphics), and key errors.
Overall, these judges accurately identified 90% of the item flaws, with the most accurate judge identifying 94%, and the least accurate, 83%. Cultural flaws were easier to identify (mean accuracy rate=95%) than technical flaws (mean accuracy rate=85%). Given the crucial role that item reviewers play in the development of quality examinations, the authors concluded that these results were very encouraging.
Hurtz GM, Hertz NR: How many raters should be used for establishing cutoff scores with the Angoff method? a generalizability theory study. Educational and Psychological Measurement 1999; 59:885—897
The Angoff method appears to be the most widely used approach to setting pass/fail standard for multiple-choice tests. In this method, subject matter experts estimate the probability that a minimally competent examinee will get each item correct. These estimates are summed and averaged across items to produce a pass/fail cutoff score for each judge and then averaged across judges to produce the final standard. The purpose of this study was to estimate the optimal number of judges necessary for setting pass/fail standards with this method.
Data were drawn from eight occupational licensure examinations administered in California. The number of items ranged from 100 to 240, and the number of judges ranged from 6 to 11. All judges were licensed in their respective occupations and participated in rater training. Generalizability analyses were used to generate dependability indices for each set of ratings.
The indexes of dependability varied across the examinations, ranging from a low of 0.66 to a high of 0.87. The number of raters necessary to meet the dependability criterion for cutoff scores set by the researchers was 15, although it was lower (<10) for half of the examinations. The authors also recommended that the raters broadly represent their professions, including different specialty areas.