Wilkerson L, Irby DM: Strategies for improving teaching practices: a comprehensive approach to faculty development. Academic Medicine 1998; 73:387—396
The authors begin their review of faculty development by highlighting some of the current demands being made of medical school faculty to adapt to such innovations as computer technology, problem-based learning, and small group discussion and asking how faculty members can best be helped to develop these, and other, new skills. The authors describe the evolution of faculty development programs and outline an approach that addresses the differing needs of faculty members as they progress through their careers.
Until the mid-century, teaching expertise was construed primarily as content expertise, and the primary means for enhancing faculty skills were academic leaves, sabbaticals, research funding, and travel to professional meetings.
Gradually, teaching came to be seen as a related but separate skill that was primarily acquired by observation and experience. In the 1970s, behavioral learning theories predominated, and faculty development efforts emphasized skill acquisition, such as writing clear objectives, giving feedback to learners, and producing quality test items. By the 1980s, a new paradigm held sway—cognitive learning theory—which posited that learning required the active construction of meaning. Faculty development programs now emphasized understanding how students learn and using content and instructional expertise to craft meaningful learning activities. In the 1990s, the emphasis shifted to learning as the social construction of meaning, and teaching-improvement activities have focused on increasing the reflective capacity of instructors and socializing medical students into their new knowledge community.
The authors also review the effectiveness of various faculty development interventions, including workshops, teaching evaluations combined with consultation, and fellowship programs and propose a comprehensive approach to faculty development. When faculty members join the academy, they need to become familiar with the full range of their responsibilities, the expectations for promotion, and basic teaching techniques. As faculty mature, they are likely to play broader roles in their institutions and will need programs that address curriculum and organizational development, leadership skills, and educational research, as well as further development of their instructional skills.
Wilkerson and Irby conclude that "Faculty development targeted to the several roles of faculty members is the key to academic vitality. Such activities are essential to the creation of a collegial learning community that values inquiry and innovation, that provides for both personal growth and corporate leadership, and that creates organizational vitality and success." (p. 394)
Norcini JJ, Shea JA: The credibility and comparability of standards. Applied Measurement in Education 1997; 10:39—59
Norcini and Shea argue that there are a number of acceptable methods to set pass/fail standards and that it is impossible to validate the "correctness" of a cut score. Hence, it is more fruitful for test developers to garner evidence that supports the credibility and comparability of standards. The authors describe several types of evidence and research related to both of these issues.
With regard to credibility, their main criterion is who set the standards, based on qualifications and the number of experts involved as well as how diligently they performed their assignments. Another is the method used, which should produce an absolute, rather than a normative or a relative, standard. A third criterion is evidence that the standard is realistic, as indicated by positive relationships with other measures of competence and acceptance by stakeholders.
Norcini and Shea then address the issue of the comparability of standards across different forms of the same test. Examinees should not be at an advantage or a disadvantage because they encountered an easier or a more difficult version, respectively. The test forms should be constructed to be as similar in content as possible. The equating procedure used to adjust scores and/or standards across forms must be appropriately applied and should yield similar pass/fail rates for homogeneous subgroups of examinees.
These types of evidence are summarized in a checklist, and relevant research is cited to guide those who are developing, using, and/or evaluating standards, particularly in the contexts of licensure and certification.
Impara JC, Plake BS: Standard setting: an alternative approach. Journal of Educational Measurement 1997; 34:353—366
The authors discuss issues related to the use of the Angoff method for setting pass/fail standards for multiple-choice tests. This method, arguably the most widely used, has three stages. First, judges are trained, usually beginning with a discussion of what a given instrument is designed to assess and the characteristics of the target population for whom the cut score is to be set. This target population is often described as the minimally competent, just competent, or borderline candidates.
For each test item, judges are asked to estimate what percent of this group should get the item correct. After working through a set of practice items, the panelists are asked to make the same judgments for actual test items, after which they are provided feedback that usually consists of how the total group of examinees, not just the target group, did on the items. These estimates are summed and averaged for each judge and then averaged across judges to produce the cut score, and the resulting pass/fail rates are also reviewed. Finally, the judges revise their performance estimates based on this feedback, and the final pass/fail point is calculated.
Impara and Plake note that Angoff also suggested a variation on this approach in which judges are asked to think of a minimally competent candidate whom they know and to indicate whether this candidate will answer each item correctly. The yes responses are summed to produce a pass/fail score for each judge and then averaged across judges to produce the final standard. The authors undertook two studies to see whether there was any support for their hypothesis that this method might be superior because it poses an easier cognitive task for judges who often report that they have difficulty conceptualizing a group of minimally competent candidates and in estimating that group's performance on items.
In both of their studies, judges were randomly assigned to two groups, one of which used the traditional Angoff method and one of which used the yes/no method. The first study involved 18 elementary school mathematics teachers and a 46-item test. The second study involved 20 elementary school mathematics teachers and an 89-item test.
In both studies, the two methods yielded similar cut scores. However, there was more variance in the cut scores from the Angoff groups than from the yes/no groups. Teachers in both groups expressed similar levels of confidence in the method they used and in the resulting performance standard. Judges who had had prior experience with the Angoff method described the yes/no method as easier to understand and use.
These findings led the authors to conclude that the yes/no method has substantial promise as an approach to standard setting, because it may be easier for judges to implement and it may yield more valid standards.
Feldt LS: Can validity rise when reliability declines? Applied Measurement in Education 1997; 10:377—387
Feldt challenges the oft-stated deduction from classical test theory that decreased reliability will result in decreased validity. In recent years, advocates of performance-based assessment have almost always argued that the lower reliability of the scores yielded by their instruments, compared with scores from multiple-choice tests, was offset by increases in validity.
Feldt describes two conditions under which lowered reliability might reasonably be expected to be associated with increased criterion-related validity, defined as a Pearson product-moment correlation coefficient between test scores and performance on a criterion measure. These are if a test is altered by eliminating elements that cause construct-irrelevant variance in examinee performance and if a test and the criterion measure the same set of factors, but the weighting of factors varies in the two measures.
In the two examples he presents using hypothetical data, the validity coefficient indeed increased, as the reliability coefficient decreased, because of the enhanced relationship between the two variables. In one data set, the reliability coefficient fell from 0.70 to 0.60, and the validity coefficient rose from 0.40 to 0.49; in the second study, reliability declined from 0.72 to 0.57, and the validity increased from 0.38 to 0.51.
Feldt concludes that test authors often assess lower order cognitive skills because these can typically be measured more reliably than higher order skills, and as a consequence validity may suffer. These results need to be demonstrated with actual data, and if they hold, will provide support for the claims of those who favor performance tests over multiple-choice instruments.