The outcome of healthcare education should preferably be measured by providers' performance in the real-world setting by the use of valid and reliable tools (1–3). Standardized patients (SPs) have been used in different domains of medical education for assessment of undergraduates' and postgraduates' skills. SPs were first introduced in 1963 by Howard Barrows and Steve Abrahamson, who used them for education and assessment in neurological disorders. They constructed an examination for third-year students by using volunteer SPs, who demonstrated neurological disorders (2). An SP is defined as an individual who has been well-trained to replicate the patient presentation in a clinical-encounter setting in a consistent and realistic way. To assess health professionals' performance in real practice, unannounced SP visits are recommended and have been used in a limited number of studies (3). The time for the visit of the unannounced SP should ideally not be known by the participating doctors. Such an arrangement requires fully-informed consent from the doctors, who accept that they may be visited by an SP a specified number of times within a defined time-frame.
Unannounced SPs have been used in educational research, including two randomized clinical trials (RCTs) to assess the effectiveness of educational interventions on the performance of general practitioners (GPs) regarding cancer control (4) and of GP trainees regarding nutrition (5). A more extensive use of simulations in continuing medical education (CME), encompassing SPs and other techniques, has been proposed to assess the clinical reasoning and skills of health professionals (6). Before extending the use of unannounced SPs to evaluation of educational interventions related to psychiatry, however, it is vital to determine the validity and reliability of the technique. In a literature review conducted in 2007, more than 5,000 articles or books were found on simulation and health-professional education, although only 72 of these were related to psychiatry education (7).
We report here on the procedures and results of training and validation of SPs' portrayal of depressive disorders in primary care, as part of an RCT that assessed various educational strategies for teaching GPs in Tehran, Iran, on how to manage depressive disorders (8–10). The SPs were used in the study for unannounced assessment of the GPs' performance in their own practice on two occasions, before and after the education. The aim of the present study was to assess the accuracy (validity) and consistency (reliability) of the SPs and the tools used for the assessment.
In order to assess performance of doctors through SPs, the following steps and procedures must be followed: 1) development of scenarios; 2) selection and training of SPs to portray the scenarios; 3) identification of relevant aspects or items for assessment; 4) choosing a method for assessing the items; and 5) training the SPs to use the method. We describe here how these steps and procedures were developed, validated, and tested for reliability. All parts of the study were conducted in Farsi.
Five different scenarios regarding depressive disorders were compiled by three psychiatrists on the research team. The scenarios were related to depression, combined with various problems, such as acute risk of suicide, pathologic grief, headache, hypothyroidism, and bipolar disorders. The scenarios were based on issues relating to the diagnosis and treatment of depressive disorders, which had been emphasized in a guidebook (11) for GPs, compiled by an expert team (three psychiatrists, two general physicians, one epidemiologist, and an educator), based on World Health Organization (WHO) documents (12) and the World Psychiatry Association Bulletin on depressive disorders (13). The scenarios were finally developed during the SP training program. Optimally-written SP scenarios should be highly detailed, envisaging more information than any GP might elicit. To be convincing, the SP should respond with the same certainty as a true client might show in answering any of the questions.
The content validity of the scenarios was determined by consensus in an expert panel with 10 faculty members from the department of psychiatry at Tehran University of Medical Science (TUMS).
Selection of Potential SPs
Twenty-five psychology and nursing students responded to an invitation and were interviewed and assessed with certain criteria related to well-being, availability and importance of payment. Fifteen students were assessed as eligible, although only 10 decided to participate in the training. The students were between 21 and 43 years of age; four students were men. All had some knowledge of psychiatric disorders.
SPs were trained by three psychiatrists (MSa, AT, MA) and an educator (MSh) in a small-group setting utilizing five sessions of 2 hours each regarding SP portrayal, plus a 3-hour training session on how to fill in the checklists that summarized each GPs' performance. Before each educational meeting, the SPs read handouts on depressive disorders, watched videos regarding depressed patients, and participated in the ambulatory psychiatry clinic. They also read their assigned scenarios carefully and discussed their roles with other SPs. In each educational session, the trainees performed audio-recorded role-play based on their scenarios with one of the psychiatrists acting as the doctor. Continuous feedback was given by the trainers, and the SPs also received the audiotapes for their own use.
Validation of SPs' Portrayals
One week after the end of the training course, the SPs performed their final role-plays with one GP as the doctor. The performance was assessed by the three psychiatrists, using an observational rating scale (available from the first author) comprising five items for verbal expression and four for nonverbal. The scale was developed specifically for this purpose, and the content validity of the scale was determined by consensus among the panel of experts. The psychiatrists had also compiled a guideline for assessment of the various case scenarios. For the data analysis, the ratings of the verbal expression were collapsed to a score of 1 for “always” or “frequently,” a score of 2 for “randomly/sometimes,” and a score of 3 for “rarely” or “never.” For the nonverbal items, a score of 1 was given for “very poor” or “poor,” a score of 2 for “mild,” and a score of 3 for “good” or “excellent.” The final score was calculated as the mean score of the nine items, where a score of 1 meant overall poor performance, and a score of 3 meant excellent performance.
Each of the 10 SPs was given one scenario, which was portrayed twice with the same GP. Each session took about 15 minutes, and afterward the SPs filled in the checklists (see below). The second role-play took place after an interval of about 10 minutes. Each of the five scenarios was performed by two SPs. All interactions were videotaped, for a total of 20 videos. The three psychiatrists individually watched each video and completed the observational rating scale.
The validity of the SPs' portrayals was assessed by comparing the score for the first role-play with the criterion of reaching at least 90% or more of the maximum mean score of 3.0.
The reliability of the SPs' portrayals was assessed using a test–retest approach where the degree of identical scoring of the items for the first and second (final) role-play was determined.
Checklists for Assessment of Doctors' Performance
The SPs were trained to be able to make unannounced visits at GPs' practices. To assess the performance of the doctors, aspects of management of depression were identified, related to diagnosis and treatment. One checklist was compiled for each of the scenarios, by the same expert group that had developed the guidebook on the management of depression. The specified content of the checklists was based on DSM-IV criteria (13) for diagnosis and management of depression (the checklists are available from the first author).
Each checklist had two parts. The first part (Diagnosis) included 13–16 items, with binary questions (Yes/No) related to the most important aspects to be considered by the GPs when encountering a suspected case of depressive disorder. It was to be completed by the SP after the consultation with the GP. The second part (Treatment) had 2–4 items, related to the appropriate treatment procedure. This part was completed by the three psychiatrists based on documents collected by the SP, such as prescription, laboratory tests, and referral to a specialist. The content validity of the checklists was assessed by the panel of 10 psychiatrists from TUMS.
The reliability of the checklists was tested by assessing interrater concordance between groups of SPs, each SP assessing her or his own role plays plus two of the others' after watching the respective video recordings.
The concurrent validity of the SPs' completion of the checklists was assessed by determining the correlation between the SPs' completed checklists and the checklists completed by the three psychiatrists individually.
The reliability was assessed by a test–retest approach, where the SPs' initial checklists were compared with a checklist assessment made 1 week later, after watching the video recording of their own performance.
Assessment of Performance in the Real Situation
The criterion for assessing the actual performance of the SPs' portrayal was the reported detection rate by the GPs.
The study protocol was approved by the Ethics committee at TUMS. The potential SPs were informed that they would remain anonymous in all publication of results from the study and that they could receive direct support via mobile telephone contact with the main investigator (MSh) if they encountered any problem during their visits to the GPs. Studies using SPs do normally require the consent of the participating physicians (3). All 192 GPs in the intervention study (8–10) were informed about the objectives of the study and methods used for education and assessment, such as unannounced SPs. All GPs accepted these terms and signed an informed consent.
SPSS software Version 16 (SPSS, Inc, Chicago, IL) was used for analyzing data. For assessing the correlation, the kappa (κ) coefficient was computed.
The mean score for assessing the validity of SPs' portrayal was 2.95 (range: 2.86–3.0). The testing for reliability of the portrayals showed a mean of 90.5% (range: 80%–100%) identical responses to the items on the observational rating scale.
The mean κ for validity of completed checklists by individual SPs was 0.55 (range: 0.22–0.90). The mean κ for reliability of completed checklists was 0.72 (range: 0.26–1.0).
The scores and individual test results for the 10 participants are shown in Table 1. One of the trainees (#6) was assessed as not eligible to act as an unannounced SP, because of low performance in completing the checklist.
TABLE 1.Mean Scores of Three Raters After Watching Videos Regarding Standardized Patients' (SP) Portrayal (Validity of SPs' Portrayal) and Percentage of Identical Responses Among Raters for Each SP (Interrater Reliability [κ]) and the Validity and Reliability of Completed Checklists by Each SP
The mean κ for reliability of the five checklists was 0.82 (0.71, 0.74, 0.80, 0.83, and 1.0, respectively). Of the total of 372 unannounced visits made by the nine trained SPs to the GPs participating in the intervention study (8–10), only one SP was detected by the GP on the first visit.
This study demonstrates that SPs can be trained to become a reliable and valid tool for assessing performance of individual GPs with regard to diagnosis and management of depression disorders when they serve as unannounced patients in the clinical practice. Our overall findings are in concordance with studies in other areas of medicine (3, 15–18), which all concluded that unannounced SPs can reliably rate some aspects of physician behavior, such as physician examination, test-ordering, consultations, or physician response to specific patient requests. However, evaluation with SPs is more complicated in practice-based settings. Unlike structured direct observations of out-of-clinic examinations, like OSCE, we have no control over any of them in the unannounced SP setting. The participants in an OSCE setting are also aware of being assessed, which makes the setting different from that of the natural environment (3, 15–18).
An important aspect that distinguishes the present study is the development of a valid observational rating scale to assess various aspects of the SP's portrayal. The state of the art has been dominated by global rating scales, which assess the overall quality of the SP (19), although some researchers have used a combination of detailed checklists and global rating scales (20–22). In the present study, we developed a valid and reliable scale to ensure an accurate rating by the psychiatrists of the SP's portrayal in defined skill components. This also enabled the psychiatrists to use the checklist as an instructional tool to provide the SP candidates with effective feedback. We assume that the thoroughness of the training contributed to the high levels of validity and reliability of the SP portrayals.
As mentioned in the most comprehensive review article (3), previous studies have paid attention to only a few aspects of SP validation, whereas we have covered the whole range of aspects of the validation process. Moreover, this is the first study using unannounced SPs that have been used for evaluation of the performance by such a large number of doctors and number of visits. Also, it is the first time in Iran that SPs have been systematically trained and validated to perform unannounced visits to practicing doctors.
We regard the reliability and validity of case presentations to be acceptable for all trained SPs. We also conclude that the validity and reliability of potential SPs' competence to complete the checklists was excellent or acceptable for all SPs but one, who was also excluded from the group of SPs that visited the GPs. Other researchers have surmised that inaccuracies of recording by unannounced SPs might be a bias in studies on provider performance (23). Our study has shown that standardized procedures for training and validation of SPs can be protective against such bias.
Another aspect regarding unannounced SPs is related to the detection rate of SPs by the general-practitioners, which has been reported to be sometimes as high as 70%, although the majority of studies report rates below 15% (3). The low rate in our study confirms that careful and systematic training of SPs could reduce this almost to zero.
The process of training and validation of SPs that we have used, is quite detailed and time-consuming, as it involves bringing together an expert group to create a valid scenario, development of a checklist to ensure adherence to the scenario, considerable training for the SP, and then validation through review of one or more rehearsals of the scenario. However, we argue that it is necessary to follow a process like this in order to ensure that the collected information about the providers' performance is as accurate as possible. Still, no independent verification of the actual performance of the SP in the real-world setting was obtained—which would have been ideal. Alternatively, SPs could be rechecked on their performance at the end of the study period. Careful SP training for completing checklists is necessary to ensure validity and reliability of the data for assessment of performance of doctors.
Having a limited budget was the important reason for restricting the number of trained and validated SPs in this study. Moreover, our findings regarding the individual or group of SPs cannot be generalized, and this was not the aim of the study. However, what can be transferred to other studies, with similar objectives of evaluating performance, is how we developed the processes of validation and how the testing took place.
We describe, for the first time, a thorough validation of the technique of standardized patients in the portrayal of depressive disorders in primary-care settings. The methods of validation of SPs allows for confidence in employing this technique for evaluation of educational interventions in the field of management of depression in primary care. Similar techniques can be used for other psychiatric disorders in continuing medical education and professional development. Further research is needed to find the most appropriate ways of ensuring the accuracy and consistency of unannounced SPs in real practice.
The study was supported by the Tehran University of Medical Sciences through an educational study grant. Great thanks to the voluntary “standardized patients,” the staff of the Continuing Medical Education office, and the faculty members of the psychiatry department.
At the time of submission, the authors reported no competing interests.