Residency programs invest significant time and resources interviewing and selecting residents. The standing of the program, morale, resident and faculty satisfaction, the provision of quality clinical care, and future match success all hinge on obtaining the "best and brightest" residents. In psychiatry, the pool of eligible physicians is smaller than the number of available positions. This makes the process very serious and competitive between programs. The stakes are substantial and involve more than simply obtaining the desired number of residents in a market that clearly favors the applicants. As many training directors know, a resident without the requisite skills or commitment to psychiatry can be more damaging to a program than an unfilled slot. On the other hand, the desire to fill spots can cause some to overlook potential flaws or lower standards. If a training director knows in advance which residents may be at risk for sub-average performance, specific interventions can be considered to help such residents achieve their full potential. The challenge is an international one. Koster et al. (1) write of drop-outs from a program in The Netherlands as a "failure of the selection procedure" with significant cost to time, training effort, and money. Still, in a retrospective examination, the 8.5% who dropped out did not appear significantly different in suitability or motivation than the trainees who completed the program (1). Wilkinson and Harris (2) write of "borderline trainee interns" who experience difficulty making the transition from student to physician in New Zealand. Difficulties may include lack of integration with the health care team; difficulty with organization, prioritizing, and time management; interpersonal issues; problems with appropriately seeking and using supervision; and deficits in medical knowledge (2). There is frequently a gap in the ability to measure the cognitive versus noncognitive (e.g., interpersonal skills, stamina, dexterity) skills that are germane to residency. There has been little written about valid and reliable predictors of psychiatry residency success.
Hemaida and Kalb (3) utilized the analytic hierarchy process to identify key factors in selecting residents at a family practice program. Six factors were identified (interview assessment, interpersonal skills, fit with current team, fit with the mission/culture of the program, content of personal statement, and future practice goals that match our training) and their consistency ratio and factor weights determined. The resulting survey instrument was consistent with the informal ranking system. The residency program benefited from an exercise that made tacit selection factors explicit and appropriately weighted, resulting in less time and decreased subjectivity in preparing a rank order list. Subjective, noncognitive factors pertaining to interpersonal skills were highly valued, possibly assuming adequate academic and clinical skills (cognitive factors) in medical students who advance to the fourth year (3). Wagoner and Suriano (4) noted that some specialties (e.g., psychiatry, physical medicine and rehabilitation, pathology) have had lower fill rates with U.S. medical graduates and realistically must cast their nets wider with broader selection criteria. Still, the authors believe that as residency positions decrease, the competition—especially at university-based programs—will increase, with a resulting "uniform shift across all specialties toward a greater emphasis on academic variables" (4). In the 2004 National Resident Matching Program (NRMP) data, 1,020 positions in psychiatry were available, 979 were filled. Graduates of United States medical schools comprised 62.8% of those filled positions (5). Even when programs can identify desired qualities, the ability to predict performance can be faulty. Most training directors can recall stellar applicants who performed poorly and others who presented less well on paper but turned out to be outstanding in person.
The goal of this study was to examine the predictive value of the evaluations of applicants for psychiatry residency at the University of North Carolina (UNC) in terms of the subsequent performance of those who matriculated. The preresidency applicant evaluations of those residents who were matched to, and entered, our training program were compared to their postresidency evaluations.
Selection and Evaluation of Applicants for Residency
During the resident recruitment period, in each of the four academic years 1995—1996, 1996—1997, 1997—1998, and 1998—1999, several dozen applicants for residency at UNC psychiatry were interviewed and rated by the training faculty and a subgroup of residents using 11-point scales for each of five dimensions: empathic quality, academic potential, clinical potential, team player, and an overall rating. In addition, interviewers were invited to submit subjective comments. The University of North Carolina did not have explicit criteria for interview invitations, and this study was undertaken to examine that process. The number of applications exceeded the number invited to interview, and invitation to interview was based on overall academic performance. Some applicants extended an interview ultimately declined. Applicants who had significant difficulty with standardized tests or had instances of poor performance were given closer scrutiny, although they were not necessarily denied an interview when mitigating circumstances were present. An interview did not guarantee that an applicant would be ranked.
The overall ratings by those faculty who had evaluated at least six applicants each were ranked within each rater to produce normalized percentile scores, which were then averaged across raters for each applicant. This resulted in an average normalized percentile score for each applicant. Using these ratings, as well as the comments from other interviewers, the residency selection committee developed a match list, representing a consensus on the rank order of candidates for that year. Each year’s match list was submitted to the NRMP and the usual matching procedures were carried out by the national office. For this study, the matched residents for each year were divided into thirds, which were labeled as "A," "B," and "C" from the top through the last third. This "match third" served as our primary preresidency measure of expected performance.
Postresidency Evaluation of Resident Performance
At the end of each 4-year residency training cycle (i.e., in the early summer of 1999, 2000, 2001, and 2002), the residents who had matched to the UNC training program were evaluated again as follows. The Psychiatry Education Office prepared the files of these graduates by temporarily removing all preapplication materials so that final rating could be done independently of the original evaluation materials (leaving rotation evaluations, inservice exams, awards and commendations, and other supporting materials). The Department Chair (RNG) and the Directors of Residency Education (AM and KD) independently reviewed each file and rated the residents with approximately the top third assigned an "A," the middle third a "B," and the remaining third a "C." After the independent evaluations were completed, the three raters met, discussed any evaluation discrepancies, and arrived at a consensus evaluation for each resident.
Our primary analyses evaluated the agreement between pre- and postresidency evaluations of residents by using the weighted kappa coefficient, which is the ratio of the proportion of times the evaluations agree (corrected for chance agreement) to the maximum proportion of times that the raters could agree (corrected for chance agreement). This statistic takes the evaluation weights (A=3, B=2, C=1) into account, so that a discrepancy between A and C is considered more serious than a discrepancy between A and B. Kappa may vary between —1 (perfect disagreement) and +1 (perfect agreement), with 0 indicating no better than chance agreement. The significance of the association between the preresidency "match thirds"and the postresidency evaluations was evaluated using Fisher’s exact test (two-tailed). Analyses of data across all four cohorts also used Cochran-Mantel-Haenszel statistics to control for the year of evaluation. In addition, the percentages of exact agreement and the Spearman correlation between pre- and postresidency evaluations are provided for descriptive purposes.
Approximately one-quarter of the applicant pool matched to UNC each year, and a greater number of the higher ranked candidates (i.e., "A" third) entered the program, thereby truncating the distribution of preresidency ratings. This tended to diminish the strength of any association between pre- and postresidency evaluations, since quality distinctions within an already highly selected group were attempted.
t1 shows the distribution of applicants interviewed and number matched to UNC for each year of the study.
Reliability of Postresidency Final Evaluations
The reliability of the postresidency final evaluations among the three faculty members was analyzed across all four cohorts, which totaled 58 residents. The results showed only modest reliability, with pair-wise agreements of 67% (kappa=0.61), 60% (kappa= 0.51), and 55% (kappa =0.49). However, agreements between each of the three and the final consensus evaluation were generally good: 84% (kappa=0.83), 81% (kappa=0.78), and 71% (kappa=0.64).
Evaluation of Resident Selection
There was no significant association between preresidency selection evaluations and postresidency final evaluations in any of the four cohorts. (1995—1996 entering cohort: Fisher’s exact test: p=0.46, kappa=—0.18, agreement=23%; 1996—1997: Fisher’s exact test: p=1.0, kappa=0.13, agreement=38%; 1997—1998: Fisher’s exact test: p=0.13, kappa=0.50, agreement=56%; and 1998—1999: Fisher’s exact test: p=0.76, kappa=0.28, agreement=54%.) When the four cohorts were combined, there was no significant association (Fisher’s exact test: p=0.38, kappa=0.19, agreement=43%).
Using Cochran-Mantel-Haenszel analysis to control for year, the hypothesis of nonzero correlation between pre- and postresidency evaluations could not be rejected (p=0.077). The Spearman correlation between these measures was 0.24 (p=0.07). Weighted kappa was kappa=0.20, and the 95% confidence interval for the kappa statistic (—0.004 to 0.40) included zero, indicating that the association between evaluations did not differ significantly from that which would be expected to occur by chance. A statistically insignificant trend was suggested.
The overall results of this 4-year study indicate only a small, nonsignificant association between pre- and postresidency evaluations. The hypothesis of nonzero correlation between pre- and postresidency evaluations could not be rejected. As pointed out in the methods section, the postresidency evaluations were carried out on a highly selected group, which diminished the likelihood of successfully making further distinctions. In addition, there was no postresidency evaluation data on applicants interviewed (whose numbers have increased in recent years) who did not match at UNC. These include those we ranked highly as well as those low on our rank list. The findings suggest that our selection process is no better at identifying the top performers than the average ones, contrary to our initial hypothesis. Sensitivity is about 42% (8/19) for those applicants originally deemed outstanding and highly desirable, 42% (8/19) for the solid ones, and 45% (9/20) for the average applicants. A potential confound is our decision to compare postresidency cohorts with each other and force a uniform distribution. It could be that some groups were uniformly outstanding or uniformly average. Additional limitations include a relatively small sample size limited to one institution. Whether our findings could be generalized to other programs is not known. There were no problematic residents in the sample reported, although there have been some in the past that were not identified by our selection process. The faculty on the resident selection committee are seasoned veterans of applicant selection, interviewed a majority of the applicants, and have years of experience in residency education. The postresidency raters were not blind to the residents. Even though their application materials were removed from their files, the raters may have seen the applications in the past and certainly would have had some type of working relationship with some of the residents. That could lend itself to bias, or greater accuracy, in the postresidency evaluations.
With the above caveats in mind, the findings illustrate the difficulty in making predictions about subsequent performance. There have been other attempts to correlate medical student performance with that in residency. Alexander et al. (6) surveyed the residency training directors of the University of Michigan Medical School graduating classes of 1996, 1997, and 1998. Correlations between various medical school grades and residency director composite scores were positive, relatively high, and statistically significant. However, academic performance explained less than 20% of the variance in training director’s assessment. This suggested that grades did not capture other important factors (6). Bell et al. (7) also concluded that overall obstetrics and gynecology resident performance could not be predicted by medical school achievements. In an examination of medical student surgical knowledge evaluated subjectively (observed performance) and objectively (shelf and oral exams), the subjective evaluations by faculty and residents correlated poorly with the objective measures. Subjective evaluations may be influenced by variables like attitude, enthusiasm, and apparent interest (8).
Gilbart et al. (9) developed a rating form to determine the most important criteria and assessed its reliability across different orthopedic surgery programs in Canada. There was significant consensus about the most important qualities and behaviors. The scale had reasonable reliability and impressive validity within programs. There was no significant difference in reliability between programs that used structured versus unstructured interviews, despite earlier reports that structure increases reliability and validity. However, ratings were very unreliable across programs, suggesting that different programs have different sets of priorities despite reasonable agreement on the general characteristics of a good resident (9). Another study of interobserver reliability cautioned that screening applicants on objective data may be adequate, but that great variability exists for subjective elements (10). Conversely, bias can be introduced when interviewers have objective data. Miles et al. (11) note that ratings of surgical residency candidates at one site were less favorable when interviewers were given only the applicant’s medical school in comparison to interviewers who had the complete application package. At the same site, overall ratings correlated positively with U.S. Medical Licensing Examination (USMLE) scores in blinded interviews, but negatively with unblinded ones. The blinded interviews had less impact at a comparison institution, again suggesting that programs have "a distinct philosophical culture and history" that influences their evaluation process (11). Another study suggested that prior knowledge of USMLE scores provided a "halo effect," in which the interviewer quickly reaches a conclusion about an applicant, sometimes even prior to the interview (12).
While objective criteria may better correlate with future objective performances, they can be fallible. Information may be flawed, if not dishonest. Edmond et al. (13) reviewed deans’ letters for variables like failing or marginal grades, leaves of absence, and repeating an academic year. In their sample, 27% of failures in a preclinical course were not noted, 33% of failing grades in a clinical rotation went unmentioned, and 50% of leaves of absence were not stated. Overall, at least one suppressed variable was found in 34% of the 532 letters examined (13). In addition to omissions from deans’ letters, there have been reports of more overt deception on the part of applicants. Dale et al. (14) examined the publications cited by applicants to an orthopedic residency program. The misrepresentation ("citations of nonexistent articles in actual journals and nonauthorship of existing articles") of 18% of 76 citations was uncovered. The authors suggest that programs require copies of cited material as part of the application process. This is a sentiment echoed by Baker and Jackson (15). In their review of the literature, misrepresentations were found in 16% (imaging fellowship), 19.7% (pediatric residency), 20% (emergency medicine residency), and 34% (gastroenterology fellowship) of application samples. The authors examined 379 applications to a radiology residency, intentionally choosing a less competitive year. Ultimately, 13 (11%) were felt, in their restrained criteria, to be misrepresented. An applicant who would be deceitful in an application may also be deceitful in their practice. Lest one believe that misrepresentation is found only in highly competitive residencies, Grover et al. (16) surveyed family practice program directors and found that almost one-half of the respondents felt they had discovered deception. This ranged from hiding consideration of another specialty to inaccuracies in personal statements, significant omissions, and outright falsifications. There was a significant association of deception recognition with both older training directors and having a post-Match credentialing process (e.g., history and physical, drug screening, disability accommodation screening, or criminal background checks). It was not known whether those deceptive applicants were ultimately model or problematic residents. The authors caution against taking applicants and supporting information at "face value" (16).
However, total reliance on objective, easily verified parameters also has pitfalls. Edmond et al. (17) examined the impact on African American applicants if only USMLE part I scores are utilized. Many programs (92% in their survey) may use the USMLE score (60% had a threshold minimum) in deciding which applicants will be granted an interview. The authors looked at hypothetical cutoff scores and the proportion of rejected applicants by race. Depending on the minimum score, African American applicants would be 3 to 6 times less likely to be interviewed (17). Foreign medical graduates would be another minority impacted by overconfidence in objective scores. A survey (18) of psychiatry residency training directors revealed that for both U.S. and foreign medical graduates, personality, psychological mindedness, communications skills, and interview performance were highly valued. Examination scores, e.g., the Federal Licensing Exam (FLEX) were viewed as more important for the foreign medical graduates, perhaps because training directors know little about the admission and training standards in foreign medical schools. Clerkship performance is undervalued for similar reasons. The authors caution that exam scores may not reflect clinical ability or the skill sets required in psychiatry, but rather may be subject to the same complaints levied at other objective tests—that they measure cognitive knowledge and test-taking skills (18).
A couple of studies have suggested that the impact of gender on resident selection is fairly negligible. Rodenhauser et al. (19) found that both male and female physicians rated fictional female candidates more favorably. A similar study (20) pairing a true female application and an altered male version found no significant difference in ranking. In fact, there was a statistically insignificant trend in favor of women. In competitive specialties, there is an emphasis on academic performance as objectively measured, and a favorable application review is a prelude to an invitation to interview. The interview may be the juncture at which bias is introduced. While not sufficient, the interview is necessary. Getting a residency interview is predictive, in that the majority of U.S. medical graduates interviewed ultimately match with a program (20).
Given the vagaries of total reliance on objective measures and the bias they can introduce, and concerns about the impact on underrepresented groups, is there a way to objectify some of the intangible qualities favored in residents? There are complaints that too few medical schools utilize noncognitive tests to evaluate skills and behaviors required of the competent physician. An example is the Defining Issues Test that is designed to measure moral reasoning (21). There may be "a generalized reluctance to employ the ‘soft’ or ‘subjective’ evaluation of skills and behaviors in summative assessment" (22). It would not be difficult to find those who have benefited, and felt victimized, by both objective and subjective assessments. Clearly, there can be total reliance on neither. Training directors cannot assume that the qualities they would avoid in a resident have already been screened out or remediated in medical students and thus rely on interview data. The interview is invaluable in assessing interpersonal and communication skills. Based on a day’s interaction, evaluators also try to predict years of resident learning and performance. There is no foolproof way of doing this. Each program formally, or informally, determines its selection bias, based on its goals. All would seek to produce competent, well-trained physicians that they would be proud to call a colleague and refer patients to without compunction. Program priorities on the production of clinicians, researchers and academicians, educators, administrators, and field leaders are invariably unique, as is each program’s commitment to care in underserved areas, diversity, and cultural competence (23). While none of these goals is mutually exclusive, programs hope to predict, even control, their mix of outcomes. On the other hand, the culture of a program (e.g., the resident’s peer relationships, supportiveness, relationships between residents, faculty, and administration, morale) is an important factor in the quality of a program, as perceived by both psychiatry program directors and residents (24). While we report statistically insignificant associations between pre- and postresidency evaluations, perhaps with better defined criteria and tools our ability to predict outcomes will improve.