The public expects vocational certifying bodies, such as the clinical professional organizations including the Royal Australian and New Zealand College of Psychiatrists (RANZCP), to ensure that those it certifies have the necessary clinical skills to function safely and effectively. Accordingly, examinations of knowledge, attitudes, and skills have been developed to ensure that required standards have been met before vocational registration. One such examination is the “long case,” which has a long tradition in both undergraduate and postgraduate medical examinations. The equivalent RANZCP examination is currently in the format of an observed clinical interview (OCI), which reflects the College’s partially successful attempt to improve reliability (1) by including an observed interview and formal examiner training. This form of examination has been retained alongside a multi-station, objective structured clinical examination (OSCE), in part, because of its “approximation to the real world” (2), including the use of real, not simulated or standardized, patients.
Both long cases and OSCEs reflect the assumption that the psychiatrist-examiners are the best placed to judge the standards required. However, there is a divergence of opinion between patients and psychiatrists regarding what is important in their clinical care (3). This raises the question of whether psychiatrists are, in fact, the best placed to judge all of the standards required for clinical practice. As noted by Parkes, “in actual clinical encounters, the patient is often the only judge of the physician’s communication skills” (4). In an effort to capture the experiences and judgments of the patient in psychiatric examinations, there has been some research on the judgments of Standardized Patients (SPs) and “simulated” patients in OSCE examinations in psychiatry (5, 6). These researchers have reported discrepancies between the scores of SPs and physician-examiners. However, despite increasing emphasis on taking patient preferences and feedback into account, there has not, to our knowledge, been any systematic consideration of “real” patient judgments regarding the performance of psychiatric trainees in clinical examinations. The only report describing patients’ experiences in participating in the equivalent British examination did not include the patients’ judgments of the candidates or their interview skills (7). The views of examiner-psychiatrists judging examination candidates on aspects of the interpersonal skills of the candidate, interview style, and rapport have never, to our knowledge, been tested against the views of actual patients. The aim of the present study was to compare the judgments of psychiatrist-examiners with patients in a clinical examination, in order to inform further consideration and discussion of the role of patients in such assessments.
The Canadian Physician Achievement Review (PAR) peer and patient feedback forms were taken as the starting-point (8).The PAR program is designed “to provide doctors with information about their medical practice through the eyes of those they work with and serve;” it covers a range of skills, including medical competency, communication skills, and patient-management. The PAR program involves the use of 5-point Likert-scale questionnaires; slightly different versions are specific to and can be completed by the clinician, medical colleagues, nonmedical coworkers, and patients. For the purpose of our study, the questions from these somewhat varying versions were used to design two specific questionnaires, one for examiners (see app1) and one for patients (see app2). Questions from the PAR instruments were expanded, and some questions were modified to cover aspects required by the OCI examination process (see app1). As it would not be feasible or acceptable to “train” patients in their use, both groups of raters were only asked to complete the questionnaires as they felt appropriate, from their own perspective.
The two specifically-designed questionnaires were initially piloted and refined as above. Practice OCI examinations were used for data collection. Supervisors and trainees in psychiatry who were involved in formal practice examinations were provided with information and were invited to take part in the study. The supervising psychiatrists took responsibility for gaining informed consent from the trainees (who were already aware of the study) and the patients. If the patient agreed to contribute data, then both he/she and the supervising psychiatrists completed their respective versions of the questionnaire at the end of the OCI. No examiner was allowed to contribute data from more than five mock examinations, to minimize any chance of the results being excessively influenced by any one participant. The main information collected was the paired observations of the examiner and patient, as in the questionnaires. As English was the principle language for all participants, no specific training was given on the meaning of the questions or the use of the questionnaires.
This procedure was approved by the New Zealand Multi-Region Ethics Committee
Sixteen trainee psychiatrists participated in one or more mock OCIs each. The trainees were age 30–40 years, with approximately equal numbers of men and women; 30 different patients participated in the 30 OCIs, conducted by 8 participating psychiatrist-examiners. Six of the latter had been trained to conduct OCI exams by the RANZCP. That training usually consisted of 3 hours of practice and reviewing videotapes of candidates being examined, with peer modification of standard-setting. Three-quarters of the examiners were men, and all were within the age range of 45–65 years. Patients ranged in age from 20 to 65 years, with both genders equally represented.
Two-thirds of the patients were inpatients, with the remainder being community-based. All were from adult acute wards, but were regarded by their treating team as able to give consent and participate. No information about them, apart from gender, was recorded as research data.
Five-point Likert-scale ratings of each trainee’s performance were compared for 16 question pairings. For seven of these pairings, the examiners' and patients' questions were identically worded, and for another, substantially so (P3/E3). For some of the others, an attempt was made to match wordings with overlap or significant similarity. One Patient question (P11) was paired with each of four of the examiner ratings, and an attempt was made to label the area of commonality.
The mean for each trainee was calculated by using the differences in scores between the two raters, patient and psychiatrist. The nonparametric Wilcoxon signed-rank test (analogous to the parametric paired t-test) was used to compare results across trainees.
The mean ratings of patients and psychiatrists are shown in Table 1. All of the mean scores are above the neutral position (score: 3) and generally near “agree” (score: 4). Differences between psychiatrist and patient ratings were significant for seven of the domains.
When correlation coefficients were calculated (see Table 2) the only question for which the examiner and the patients had a significantly correlated score was #5, (“whether or not the candidate had asked appropriate questions”), and that correlation was negative. All of the other correlations were low, with only three (“asking appropriate questions; asking details of the patient’s life; communicating well”) showing correlations approaching significance.
One of the strengths of our study is that patients and psychiatrists used essentially the same questions for 8 of the 16 assessment fields. For all but one of these eight fields, however, there was a significant difference in the ratings, with the patients scoring the candidates higher. Of note, five of the seven were domains that could be considered to reflect personal and subjective judgments that only the patient can make validly (listened, seemed to understand, allowed the patient to refer to all important issues, was likeable, would be recommended to a family member). Conversely, in the areas that could be conceptualized as more “technical” (formulation, diagnosis, memory assessment), there were no significant differences in the assessments of patients and examiners, despite the variation in wording between the two groups for these questions.
The often-poor correlation between examiner and patient assessment of trainee interviewing skills requires explanation. Although the overall sample size was small (N=30 examinations), it is sufficiently large to allow statistical testing. The wordings of the questions addressed by the psychiatrist-examiner and by the patients are identical (or substantially so for Question 3), for 8 of the questions (#1–8). The pairings in the other eight represent matches with substantial thematic or conceptual overlap between the two different versions of the questionnaire, as noted in the two tables, so the differences are unlikely to be accounted for by any difference in the rating scales used by the two raters.
Internationally, in the last two decades, there has been increasing involvement of patients and caregivers in teaching, including teaching of mental-health professionals and undergraduate medical students (9–11). There has, however, been little reported involvement of patients as examiners. Those studies reported are largely of SPs, “simulated” or “analog” patients, not real patients; and of OSCEs, rather than “long cases.” A study of undergraduate medical student examinations in Internal Medicine found, like ours, that SPs scored students higher on communication skills than did physician-examiners (12). In a diploma for palliative care in the U.K., simulated patients, like our patients, tended to give higher scores on the “communication” domains than did the medical examiners, leading the authors to suggest that the two could not be used interchangeably, but as complementary assessments (13). On interviewing the simulated patients subsequently, it seemed that this difference in grades was due to the actors’ reluctance to down-grade candidates, particularly as the latter were known to be senior practitioners, and the simulated patients were often nurses or clerical staff. By using real patients, we may have avoided that problem. It could, of course, be argued that real patients are even more “disempowered” than actors. However, both before and at the beginning of the examination, patients are assured that the examination candidates have no role in their treatment, and that the examination is of the candidate, not the patient. We believe that this mitigates the power imbalance to some extent.
The use of “consumers” as examiners of communication skills has been reported in the postgraduate examination for general practitioners in New Zealand, but these consumers observed the examination of a simulated patient, rather than being the patient interviewed, so were, thus, more in the role of “analog” patients (14). However, this study found differences similar to our study in the scoring of a number of aspects of communication. Thus, their patients scored the examinees more highly on such areas as “responding to the patient’s nonverbal communication,” “ensuring that the patient understood,” and “exploring patient knowledge.” These are congruent with some of the domains in which our patients scored the examinees more highly, such as “seemed to understand,” and “allowed the patient to refer to important issues.”
In psychiatry, Whelan et al. (6) recently found only moderate agreement between SP and examiner scores in an OSCE examination, with particularly low correlations on communication scores. Their conclusion was that inclusion of SP scores lacked concurrent validity, given that “experienced physician-examiners are the closest thing to a ‘gold standard’…,” an assumption that has been questioned (4, 15).
It may be that patients make judgments of practitioners on the basis of different criteria. A meeting of communication-skills experts in Michigan in 2002, identifying that patient satisfaction with an encounter may not be reflected in current physician-rated scales, suggested that a new tool was needed “to measure the essence, or meaning, of the visit from the patient’s perspective” (16). Other, more recent, authors have suggested that the different judgments should be viewed as complementary. In an examination of the differing views of the patient, examiner, and student of an Objective Structured Clinical Examination, the authors did not assume that any one view constituted a gold standard, and concluded that “triangulating perspectives” provided a more valid measure of a student’s clinical performance (17).
Although the differing expectations of patient and psychiatrist may explain the different judgments that each makes of the trainee, another factor to be considered is that of the purpose of examinations and their role in gate-keeping. Foucault saw the examination as a means by which disciplinary power is exercised (18). Similarly, in one of the few investigations of the role of postgraduate examinations, Sarangi and Roberts (19) referred to professional certification and the conferring of membership in a professional group as “gate-keeping processes.” These authors suggested that the oral examination in postgraduate settings is “a blend of academic examination and selection interview.” If this is the case, one might expect the judgments of examiners to differ from those of patients, but one could also expect considerable resistance from boards and postgraduate certifying bodies to the concept of relinquishing that gate-keeping capacity
This study is limited by its small sample size and the lack of training of patients for their roles as evaluators. However, this could be said to replicate the real world of clinical practice, in which patients are not trained to be assessors of their psychiatrists, and respond as unique individuals in that unique setting (20). Despite such random variability, one would hope that the doctor/trainee would conduct an adequate assessment and endeavor to ensure where possible that the patient is satisfied with the encounter. A further limitation is that, although most of the examiners were trained to examine, they were not trained in the use of this particular evaluation tool. However, the domains are very similar to those that examiners routinely examine, so we do not expect this to have significantly altered our results.
Despite increasing awareness of the discrepancy between patient and psychiatrist views of clinical interactions, there has been minimal research exploring their differing perspectives of the examination performances of trainees. The assumption, largely unspoken, has been that the psychiatrist’s judgment was the gold standard, particularly of the more “technical” aspects of the assessment. Our study is unique in its focus on “the long case” and its use of actual patients. Although our findings of discrepant judgments between patients and examiners are congruent with other findings using SPs in OSCEs, our findings of congruence on more technical aspects of the assessment raise further questions as to the appropriateness of excluding patients from contributing to summative assessments of trainees, and, although change cannot be mandated on the basis of a pilot study, we suggest that this study needs to be replicated in different settings with larger groups, with a view to including consumers in the examination process.
Author GM developed the study concept and managed it. Se-B developed the questionnaires. All authors participated in the study and paper preparation. JMacD wrote the first draft of the Discussion.
The authors are not aware of any conflicts of interest.
The authors acknowledge particularly the contributions of participating examiners, trainee psychiatrists, and patients.