Oral evaluations under standardized conditions can assess knowledge, skills, and attitudes not tested by written examinations (1—3). An oral and a written exam are required for certification (3) by the American Board of Psychiatry and Neurology (ABPN). The ABPN oral requires the candidate to 1) interview a patient in the presence of two examiners; 2) "present the case" to, and answer questions from, the examiners; and 3) observe a videotaped interview, present to, and answer questions from another pair of examiners. Moderate interrater reliability (κω=0.54—0.56) for this exam has been demonstrated (1).
The psychiatry Residency Review Committee (RRC) (4) expects that each resident will be examined in an ASO or other clinical format at least twice during the residency. Training in the skills assessed by this format may be of particular value (5) for international medical graduates (IMGs), who have lower pass rates on ABPN than do U.S. graduates (AMGs). To our knowledge, there have been no systematic studies of the reliability and validity of ASOs conducted within psychiatry residencies and no published reports of collaboration between residency programs in administering orals. This paper describes a two-year collaboration between two residencies that assessed interrater and test-retest reliability and concurrent, predictive, construct, face, and content validity of ASOs. Additionally, resident and faculty satisfaction with the orals, as well as resident performance, were examined.
The Stritch School of Medicine at Loyola University (LU) and Finch University of Health Sciences/The Chicago Medical School (FUHS) jointly administered ASOs to their residents. At both schools, anticipated benefits of the ASO were to give educational (formative, preparatory) feedback to residents about their interviewing and presentation skills and to give them practice in, and desensitize them to, ASOs (5—7). Other valuable program features included 1) the post-exam meeting of each trainee with his or her examiners to receive same-day oral and written feedback; and 2) the post-exam meeting among all the residents and examiners to discuss the process.
Anticipated benefits of the two-school collaboration were 1) to better recreate examination conditions by including at least one co-examiner who did not know the trainee; 2) to enable each department to improve its ASO and teaching methods by observing the other faculty at work; and 3) to enable the two faculty who were not yet boarded to learn about the process from an examiner's perspective. In addition to the ASO's formative purposes, FUHS used it for summative (rating, administrative) purposes, worth 2% to 5% of the resident's Spring quarterly departmental rating.
Psychiatry residents enrolled full time at either LU or FUHS were chosen for participation as part of residency requirements. Ninety-one residents were eligible for the study. Thirteen residents did not participate because 11 were on vacation and 2 were ill. No residents were excluded from the study. Therefore, 78 PGY-1 to PGY-4 psychiatry residents participated in the ASO over the two-year study period, 40 of whom participated both years. In 1996, 53 residents took the ASO, and in 1997 65 residents took it.
Forty-seven faculty members administered the ASO; 35 faculty in 1996 and 41 in 1997. Examiners were selected if they were "core teachers" (i.e., strongly invested in resident education) and experienced in administering ASOs.
Residents took the ASO at their respective schools. The ASOs were given in March 1996 and March 1997 for LU residents and in April 1996 and April 1997 for FUHS residents. In 1996, all residents interviewed a live patient and then were examined by a senior examiner (professor or associate professor) from one school and a junior examiner (assistant professor or instructor) from the other program. For logistic reasons at FUHS in 1997, a same-school (FUHS) examiner pair examined six residents. In 1997, all junior (PGY-1 and PGY-2) residents were examined as described above. Seniors (PGY-3 and PGY-4) at each site watched an identical videotaped interview and were then examined by two faculty members to simulate the video portion of the ABPN orals.
Examiners received a 30-minute orientation, in which the instructions were similar to those given in the refresher session for ABPN examiners. Among the instructions were that a passing grade was to be given to an examinee who could demonstrate competent and safe practice, although not necessarily "excellent or outstanding practice." Each examiner independently gave a preliminary grade (1), which could be pass (competent), conditional (neither clearly passing nor clearly failing) or fail (not competent). If the examiners gave the same preliminary grade to the resident, it became the resident's final grade. If the grades differed, the pair negotiated a final grade. If they could not agree, a "floating" senior examiner (chair or residency director), who observed part of each exam, helped determine the grade. To maximize reliability, faculty rated residents by the same standard as is used for ABPN candidates, regardless of the resident's level of training, in the following areas: MD-patient relationship; conduct of interview; organization and presentation of data; phenomenology, differential diagnosis, and prognosis; biopsychosocial treatment; and resident's strengths and weaknesses. Each pair rated two or three candidates. Patients, selected for having observable psychopathology and being willing and able to be interviewed, gave written informed consent.
Anonymous surveys of resident perceptions of the ASO were distributed in May 1996 and May 1997. Faculty perceptions of the 1997 exam were surveyed in June 1997. To assess concurrent and predictive validity of the ASOs, mean quarterly department ratings (outstanding, good, adequate, or failing) for academic year (AY) 1995—96 and 1996—97 and Psychiatry Resident In-Training Examination (PRITE) scores from October 1995 and 1996 were obtained for all but a few residents.
At each school, quarterly department ratings for each resident were determined by averaging the weighted grade given by each clinical supervisor and classroom instructor and were finalized in a discussion by the faculty's education committee. The ratings were based on the resident's attendance and punctuality, motivation and enthusiasm, interviewing skills, clinical differential diagnostic and other problem-solving skills, understanding and use of somatic and psychological treatments, knowledge and skills in general medicine, record keeping, therapeutic relationships with patients, creativity, formal and bedside teaching, and classroom participation. During the quarterly evaluation that followed the receipt of PRITE scores, the PRITE score contributed 2% to 5% of the overall rating. At FUHS, the ASO score contributed 2% to 5% of the overall rating during the quarterly evaluation that followed its administration.
SPSSx software was used to analyze the data (8). Data were incomplete for several residents' PRITE scores and departmental grades; analyses were conducted with available data. Spearman correlation coefficients, using two-tailed tests, were used to assess the relationship between ranked or ordinal variables. Chi-squares were used to assess whether nominal (qualitative, categorical) variables significantly differed in frequency of occurrence from other qualitative variables (e.g., proportion of residents from one school vs. proportion from the other school who perceived the ASO as fair vs. unfair). Cramer's V (9), which incorporates chi-square, was used to assess the association between ranked variables consisting of two or three possible ranks (e.g., degree of test anxiety and ASO score).
Mann-Whitney U-tests (which yield Z-values), the nonparametric equivalent of t-tests, were used to detect differences between groups when the dependent variable is rank-ordered (ordinal) rather than an interval variable (9). For example, the Mann-Whitney was used to determine if ASO scores (e.g., 1=pass, 2=condition, 3=fail) differed between schools. PRITE scores were standardized (z-scored) since subjects were rank ordered within, rather than between, schools. Standardizing the scores allowed comparison between the schools (each Z-score has a mean of 0 and a standard deviation of 1).
We ascertained interrater reliability of ASO scores by using kappa (9,10), a strict standard that gives no credit for minor disagreements, and weighted kappa (9,11), a more lenient standard that assigns partial credit for minor disagreements (pass vs. condition, condition vs. fail); these were weighted 0.75, and major disagreements (pass vs. fail) were given no credit, as in McDermott and colleagues' (1) ABPN study.
A total of 40 residents participated from each school over the two-year period, with 53 participating in 1996 and 65 in 1997. Twenty-four faculty members from each school participated in the ASO over the two-year period. Resident and faculty demographics are shown in t1.
Residents' ASO Scores and Other Performance Indicators (t2, t3, t4)
In 1996, there was a significant difference in grade performance between schools (Mann-Whitney U-test, Z=6.77, P<0.001, n=53), such that LU residents scored higher on the ASO (t2). In general, AMGs scored significantly higher than IMGs in 1996 (Z=2.52, P<0.05, n=53). In 1997, there was no significant difference between ASO scores at LU and FUHS (Z=1.02, P=0.31, n=65; t2). Once again, AMGs scored significantly higher than IMGs (Z=2.79, P<0.01, n=65). Resident PRITE scores and departmental grades did not significantly differ between schools or between PGYs in either testing year. However, AMGs performed better than IMGs on the PRITE in AY 1996—97 (Z=2.86, P<0.01, n=58), but not AY 1995—96 (Z=1.40, P=0.16, n=53), and on departmental ratings in 1996 (Z=3.41, P<0.001, n=53) and 1997 (Z=2.99, P<0.01, n=58).
Examiner Characteristics and Resident Performance
There was no significant difference in the grades assigned by AMG and IMG LU faculty members in 1996 (Z=0.23, P=0.82, n=53) or 1997 (Z=0.68, P=0.50, n=65), or by AMG and IMG FUHS examiners in 1997 (Z=1.18, P=0.24, n=65). However, AMG faculty members from FUHS gave better grades to residents in 1996 (Z=2.39, P<0.05, n=53).
To ascertain whether having been an ABPN examiner caused an examiner to be a tougher or more lenient grader, we compared grades given by ABPN examiners with those of their colleagues. No significant differences in grades based on ABPN examiner status were found for LU faculty in 1996 (Z=1.34, P=0.18, n=53) or 1997 (Z=0.78, P=0.44, n=65) or for FUHS faculty in 1996 (Z=0.96, P=0.34, n=53) or 1997 (Z=0.34, P=0.74, n=65).
For the 1996 ASO at LU, kappa (κ=0.89, P<0.001, n=28) and weighted kappa (κω=0.94, P=<0.001, n=28) were "almost perfect" by Landis and Koch's (L/K) standards (12), and had "excellent agreement" by Fleiss standards (13). Minor disagreements occurred only between 2 of 28 (7%) rater pairs. Interrater reliability was not ascertained at FUHS in 1996, where only composite grades were recorded.
For the 1997 ASO at LU, kappa (κ=0.95, P<0.001, n=30) and weighted kappa (κω=0.97, P<0.001, n=30) were also nearly perfect. However, interrater agreement for the 1997 ASO at FUHS was comparably much weaker, being only fair (κ=0.36. P<0.01, n=35) by L/K standards, and poor agreement beyond chance by Fleiss standards. FUHS weighted kappa increased to moderate (κω=0.56, P<0.0l, n=35) by L/K standards and fair to good agreement beyond chance by Fleiss standards. Kappa for all 1997 raters, regardless of institution, was substantial (κ=0.63, P<0.00l, n=65) by L/K standards, and had fair to good agreement beyond chance by Fleiss standards.
"Intactness" of rater pairs contributed to reduced reliability at FUHS in 1997. A rater pair was considered intact if the pair graded all three sessions together. Because the majority of pairs grading FUHS residents did not remain intact, analyses were rerun only with intact pairs. Kappa increased to substantial according to L/K standards (κ=0.63, P<0.01, n=9), whereas nonintact pairs had a nonsignificant kappa (κ=0.24, P=0.08, n=26). Also contributing to low reliability was the fact that one pair of FUHS faculty members rated 6 FUHS residents. When analyses were conducted without these raters, kappa increased to "moderate" by L/K standards (κ=0.44, P<0.001, n=29). The same-school faculty pairs' interrater reliability was not significant (κ=0.04, P=0.001, n=6). Therefore, intactness of pairs and pairs being from different schools were an important contribution to interreliability.
Lastly, kappa was examined for use of a live patient versus a videotape. Because only seniors were examined with the videotape, senior residents were compared both years. Kappa was nearly perfect both for the LU seniors who interviewed a live patient in 1996 (κ=0.90, P<0.001, n=16) and those who viewed a videotape in 1997 (κ=0.88, P<0.001, n=13). However, kappa was only fair for FUHS senior residents tested with a videotape in 1997 (κ=0.36, P<0.05, n=22). The aforementioned problems with faculty pairings were hypothesized to contribute to the low reliability, as LU had strong interrater reliability with both methods. When same-school rater pairs were eliminated, FUHS 1997 kappa rose to a moderate rating by L/K standards (κ=0.53, P<0.01, n=17). Analyses with intact pairs only were not conducted because of the small number of FUHS seniors in this category.
Forty residents took the ASO in the Spring of 1996 and 1997. Their ASO performance did not significantly differ between the two years; 16 (40%) performed better, 16 (40%) received the same grade, and 8 (20%) did worse. Test-retest reliability was not significant (Spearman r=0.08, P=0.65, n=40). It remained nonsignificant when reexamined for each school separately (FUHS, r=0.02, P=0.93, n=22; LU, r=0.16, P=0.52, n=18). Therefore, the pattern of scores between the two years was random and the test could not be deemed reliable. In contrast, test-retest reliability was demonstrated between the two years for the PRITE exams (r=0.54, P<0.001, n=35), and for residents' departmental grades (r=0.69, P<0.001, n=39).
Content validity was determined during test construction with the intention of measuring clinical interviewing and diagnostic skills with a psychiatric population. The faculty selected the patients on the basis of their psychopathology and ability to be interviewed. The patients were thought to represent the population that residents encounter in a psychiatric setting and in the ABPN orals. Residents were expected to interview the patients as they would in a clinical setting. However, the residents' performance was not sampled in multiple situations, a central feature (2,14,15) of objective structured clinical examinations (OSCEs).
Face validity refers to the examinees' perceptions of whether the exam assesses fairly the knowledge, skills, and attitudes that the examinees expected it to assess (16). t5 shows that the vast majority of the residents perceived that the ASO was educationally valuable and well organized and that their residencies had prepared them well. Although the majority of residents also felt that their grade was fair, significantly more FUHS than LU residents (χ2=6.31, df=1, P<0.01) perceived their grade to be unfair in 1996.
To determine concurrent validity of the ASO, ASO scores were correlated with each resident's mean quarterly department rating and PRITE score during the same academic year. ASO scores from 1996 were positively and significantly correlated with AY 1995—96 departmental ratings (Spearman r=0.47, P<0.00l, n=53) and with Fall 1995 PRITE scores (r=0.22, P<0.05, n=53), thereby demonstrating concurrent validity. Because LU residents performed significantly better than FUHS residents in 1996 (Mann-Whitney U, Z=2.08, P<0.05, n=53), a partial correlation, with the variance due to school taken into account, was performed between 1996 ASOs and AY 1995—96 departmental ratings and Fall 1995 PRITE scores. Concurrent validity remained significant in both cases (r=0.43, P<0.001, n=50; r=0.24, P<0.05, n=53).
In contrast, 1997 ASO grades were not significantly correlated with AY 1996—97 departmental ratings (r=0.16, P=0.22, n=59). Because senior residents performed better than junior residents in 1997 (Z=4.36, P<0.001, n=59), but not in 1996 (Z=0.52, P=0.60, n=53), a partial correlation, with resident year taken into account, was ascertained. Even when resident year was accounted for, the relationship remained nonsignificant (r=0.15, P=0.26, n=56).
Spring 1997 ASO scores and October 1996 PRITE scores were marginally significantly correlated (Spearman r=0.24, P=0.07, n=58). Again, because senior residents performed better than junior residents did in 1997, a partial correlation, with resident year taken into account, was determined. This resulted in a nonsignificant relationship between October 1996 PRITE scores and Spring 1997 ASO scores (r=0.18, P=0.18, n=55).
Because interrater reliability was much higher at LU than FUHS in 1997, concurrent validity was reexamined for each school separately. Under these circumstances, concurrent validity was demonstrated between Spring 1997 oral scores and the October 1996 PRITE for LU (r=0.66, P<0.00l, n=29). However, all else remained nonsignificant.
Predictive validity was determined by correlating ASO scores with subsequent PRITE scores and departmental ratings. The Spring 1996 ASO predicted the October 1996 PRITE score (Spearman r=0.47, P<0.01, n=38), which remained significant once the variance related to school was accounted for (r=0.43, P<0.0l, n=35). The Spring 1996 ASOs reached marginal significance in predicting departmental ratings for AY 1996—97 (r=0.30, P=0.06, n=39). However, once school was accounted for, that relationship became nonsignificant (r=0.23, P=0.16, n=36).
To determine construct validity (2,15,17)—that our test was actually measuring interview, presentation, and diagnostic skills—we first examined whether senior residents performed better than junior residents did, expecting that skills should improve as one advances through the residency. As mentioned above, senior residents performed better than junior residents in 1997 (Mann-Whitney U, Z=4.29, P<0.001, n=65), however, performance did not significantly differ in 1996 (Z=0.52, P=0.60, n=53).
Another way of determining construct validity is to examine the relationship between different methods that theoretically measure the same construct, presentation, and diagnostic skill (16). Because seniors were examined with a live patient in 1996 and a videotaped patient in 1997, we examined the correlation of ASO scores between the two years. The methods were nonsignificantly correlated (r=0.32, P=0.24, n=15). Of note, power may have been weak since only 15 seniors took the exam both years. Therefore, construct validity was not reliably demonstrated.
t5 summarizes the resident satisfaction surveys. Not all residents completed the surveys (40 in 1996 and 37 in 1997). Additionally, not all questions were answered. For both 1996 and 1997, the vast majority of residents agreed that the ASO was valuable, was conducted efficiently, and was followed by a helpful discussion. However, significantly more FUHS than LU residents in 1996 felt the ASO was not conducted efficiently (χ2=4.91, df=1, P<0.05) and was unfair (χ2=6.3 1, df=1, P<0.0l), and that the live patient was difficult to interview (χ2=4.9l, df=l, P<0.05). This coincides with the finding that FUHS residents performed more poorly than LU trainees in 1996. No other significant differences between the schools on satisfaction ratings were observed for either year.
In 1996 (Cramer's V=0.53, P<0.0l, n=38), but not 1997 (V=0.37, P=0.12, n=37), anxiety was positively and significantly associated with the grade received: residents who received lower grades later reported having been more anxious. Difficulty interviewing the patient was not significantly associated with anxiety (V=0.27, P=0.42, n=40) or with the ASO score (V=0.25, P=0.57, n=38). This suggests that the residents' anxiety was related to factors other than patients being difficult to interview, such as residents' trait or state anxiety or the retrospective reporting of anxiety after receiving a low grade.
Twenty-two (53%) of the faculty who examined in 1997 responded to the June 1997 survey (t6). The vast majority perceived that the ASO was valuable for the residents; enjoyed examining; felt they improved as teachers; enjoyed collaborating; thought the ASO was fair, valid, and efficient; and thought that using a videotaped exam was a good idea. None felt that the ASO was too time-consuming to be worth the effort. Of the 1997 faculty who responded to the question about their own bias (7 did not respond), 4 of 15 (26.7%) perceived that "my grade was influenced either positively or negatively by my knowing about the resident's performance in my program." Also, 7 of 15 faculty members who responded about their co-examiner's bias (7 did not respond) thought that "my co-examiners were influenced either positively or negatively by knowing about the resident's performance in their own program."
Because noteworthy proportions of examiners perceived that they themselves (26.7%) or their co-examiners (46.7%) were biased by knowing about the resident's performance in their own program, we ascertained whether there was a "home court advantage"; that is, whether, when examiners disagreed, they favored residents from their own school. This was not the case: of the 15 interrater disagreements at FUHS in 1997, the FUHS rater gave the FUHS resident a higher grade in 8 cases and the LU rater gave the FUHS resident a better grade in 7 cases.
Our results indicate that the psychometric properties (reliability and validity) of the ASOs administered jointly by the LU and FUHS residency programs are questionable. For this reason, FUHS stopped using ASOs for administrative (rating, summative) purposes. (LU had not used ASOs for summative purposes.) However, the value of administering such exams for preparatory (formative) purposes is considerable.
Our results should be viewed cautiously because the ASOs were administered by two programs only; because the sample size is small compared to that in ABPN exams; and because less than half of the LU and FUHS faculty members had previously been ABPN examiners. However, the results also strongly indicate a need for additional studies and for documentation of the reliability and validity of ASOs.
For a measure to be psychometrically sound, good reliability and validity must be demonstrated. We did not find significant test-retest reliability. Conceivably, the substantial time between ASOs reduced test-retest reliability. Interrater reliability was consistently high at one site but low at the other; the latter finding was attributed to lack of intact rater pairings. We found no evidence that a rater's being an ABPN examiner or an AMG or IMG affected rater fairness. Concurrent validity was demonstrated in 1996, but it was not shown in 1997. The 1996 ASO was correlated with departmental grades in AY 1995—96, but the 1997 ASO was not correlated with AY 1996—97 department grades. Examining each school individually in 1997, when interrater reliability was deemed to differ, only resulted in significant concurrent validity between the ASOs and PRITE exams for LU. Predictive validity was also inconsistently demonstrated. This could not be accounted for by junior versus senior performance or by institution. Lastly, even with findings of inconsistent reliability and validity, we replicated a problematic trend that occurs in the ABPN exam: IMGs score more poorly than AMGs (5). The study does not explain the AMG/IMG difference. It suggests, however, that the results are not due to AMG examiners being biased against IMG examinees.
Overall, the validity correlations were of small to moderate strength. Perhaps each rating method (ASO, quarterly evaluations, PRITE) augments the other two by assessing different facets of the resident's knowledge, skills, and attitudes (2). When the reliability of the ASO is in the substantial (0.61—0.80) or near perfect (0.81—1.00) ranges by L/K standards, using all three methods (ASO, department evaluations, PRITE) to rate residents could lead to a fairer, multidimensional evaluation system.
The high degree of resident and faculty satisfaction suggests that the ASO is valuable for educational purposes, and therefore worth the effort expended. But the inconsistent reliability and validity of the ASO raises doubts about its use for grading and promotion purposes. To document and improve interrater reliability and other psychometric properties with the object of using ASOs for summative, high-stakes purposes, several steps should be taken: 1) because of the major reduction of interrater reliability in the face of nonintact pairs, faculty must arrive on time and attend all sessions; 2) because of the reduction of interrater reliability when examiners from the same school were paired, examiners from different schools should be paired if possible; 3) departments "counting" the ASO should state the kappa or weighted kappa value for each ASO administration, and this value should be acceptably high; if it is unacceptably low, the score should not count for administrative purposes; 4) because of the brevity of the ASO examiners' orientation compared with ABPN examiner orientations, the orientation should be more thorough and detailed; 5) administration of the ASO by a consortium of at least three schools could better replicate ABPN conditions and eliminate the faculty's perceptions of their own or their co-examiner's biases based on knowing the candidates.
Because resident reports of being anxious before and during the ASO were significantly associated with poor exam performance, departments could provide test anxiety reduction seminars for willing residents. Such seminars are provided in some review courses (5). Anecdotes (5,18—20) suggest that test anxiety is a huge problem for many candidates, for some of whom the anxiety of having their professional qualifications at stake may well exceed what they experience in emergencies in their clinical practice (18—20).
The ABPN is striving to improve the reliability of its orals. McDermott et al. (1) write that their data "suggest that more explicit grading criteria and more extensive rater training should improve test reliability" and that "efforts to further standardize grading criteria have been undertaken." We suggest that the ABPN could use data from residency ASOs, where the stakes are relatively low, to experiment with methods to improve the reliability and document the validity of the ABPN orals, where the stakes and the consequences are considerable (5,18—20). Because administration of ASOs to residents provides valuable exposure to the testing format of the ABPN orals and facilitates preparation for the examination, further efforts should be made to increase the reliability and validity of the ASOs.