Beyond being an inherent responsibility of medical licensing boards, competency assessment has been an obligation and an ongoing challenge for those institutions responsible for the training and certification of physicians. Such assessment has had the practical function of establishing minimal professional standards that ensure the basic fitness of future physicians. At the same time, it can serve the more esoteric function of defining ideals: those values and principles, which members of the field of medicine strive for, providing a professional identity as well as motivation and direction for ongoing learning.
Definitions of Competence
It is not surprising that the broad meaning and varied purposes of “professional competence” have never been easily translated into one unifying definition that includes the various domains of medical practice. In their 2002 review article, Epstein and Hundert (1) proposed the following definition of professional medical competence: “the habitual and judicious use of communication, knowledge, technical skills, clinical reasoning, emotions, values and reflection in daily practice for the benefit of the individual and community being served.” This thoughtful definition does seem to capture the disparate layers of competence but does not lend itself to easy measurement. It incorporates cognitive, technical, integrative, contextual, relational, affective, moral and “mindful” qualities.
Medical schools, training programs and licensing bodies have had widely varied and frequently unstudied techniques for assessing the professional competence of those completing training or seeking a license to practice medicine. At the medical school level, these standards are set by the Liaison Committee on Medical Education (LCME) and are tested by the National Board of Medical Examiners (NBME). As of February 2004, the LCME requires that medical schools establish a system for the evaluation of medical student achievement: “not only retention of factual knowledge, but also development of the skills, behaviors, and attitudes needed in subsequent medical training and practice.” They have specified that such evaluations address “problem solving, clinical reasoning and communication skills,” be given in a timely manner, and be developed with an awareness of the “uses and limitations of various test formats, the purposes and benefits of criterion-referenced versus norm-referenced grading, reliability and validity issues (and) formative versus summative assessment” (2). At the level of graduate medical education, such assessments have traditionally followed a “minimum standards” model and have been determined by each field within the American Board of Medical Specialties (ABMS); in the case of psychiatry they have been defined and measured by the American Board of Psychiatry and Neurology (ABPN). The Accreditation Council on Graduate Medical Education (ACGME) and its Residency Review Committees (RRCs) set the common program requirements for training in each medical specialty and the institutional requirements for those parent institutions that sponsor the residency training programs. The RRCs determined the necessary duration of rotations, and each program could define its own criteria for satisfactory performance. Until recently, the ACGME has simply required that programs document residents’ completion of the required rotations and clinical experiences, though the sponsoring institutions did have some accountability for the performance of the trainees.
The field of medicine has failed to keep pace with the competency assessments attempted in other professional disciplines, such as business, aviation, and education. In response to growing demands for accountability from the public and from the sources of funding for health education and health care delivery (3), the ACGME in 2000 laid out a definition of competence that included six specific areas of focus: patient care (including clinical reasoning), medical knowledge, practice-based learning and improvement, interpersonal and communication skills, professionalism, and systems-based practice (4). The new RRC requirements for general psychiatry training require each training program to demonstrate that it has “an effective plan for assessing resident performance throughout the program and for using assessment results to improve resident performance. This plan should include use of dependable measures to assess residents’ performance in all six general competencies” (5). The ACGME’s intention is to mandate that increasingly reliable and valid assessment measures be used by all training programs over the next decade with the goal of providing “more credible, accurate, reliable and useful educational outcome data” (6).
Implications for Psychiatry Training Programs
The challenges for psychiatry residency training programs include expanding upon the basic ACGME definitions of the general competencies to determine specific, observable, and measurable elements of the six general competencies particular to general psychiatry and to specify psychotherapy competencies. This expansion needs to happen by consensus of experts within the field and needs to be very specific in order to be testable. This process has begun at both the ACGME and at the American Association of Directors of Psychiatric Residency Training Programs (AADPRT), and elements that are particular to psychiatry have been proposed, though not yet definitively established (3, 7). As the situation stands, each residency training program provides its own reasonable elaboration of these elements, and then training programs must determine which assessment tools will prove most accurate, constructive, reliable, and practical for their evaluations. Ideally, the assessment results will then provide specific information and directions for improvements in the development and performance of individual residents, as well as improvements in the educational programs and experiences that constitute their training. In this article, we will review various assessment tools currently in use in postgraduate medical training, as well as the available literature on these tools. We will consider their potential usefulness and possible liabilities for assessing professional competence in psychiatry training programs, though data on the use of these tools within psychiatry training programs have not yet been collected. For those who wish to read more extensively about specific topics within this field, we have included an annotated bibliography (Appendix 1) of the recent literature at the conclusion of the article. For those who would like to access available tools or learn more about the approaches of some of the programs using them, we have also attached a list of web-based resources (Appendix 2).
Before describing specific tools, it is important to consider certain principles of assessment. Beyond thinking about the general and specific components of competency, there are important distinctions between knowing, showing and actually performing competent behaviors as a routine part of practice (not only when observed). Though the first of these may be easy to measure, the second is generally more labor- and time-intensive, and the last is quite challenging. Subsequently, tools for the first often are well established and straightforward, whereas the development of tools to assess showing and doing has generally been more complex and problematic (8).
Assessment can be broadly thought of as being formative or summative: assessment done in order to further the learning process is formative assessment, whereas that which is done in order to determine adequate performance or acquisition of knowledge is summative assessment. Formative assessment is usually done early and often so that residents and their training programs have the opportunity to identify areas of weakness (both within the resident’s performance and pertaining to educational experiences), make constructive changes, and then evaluate whether or not those have helped. Summative assessment is often done only at the completion of a component of training and used primarily for decision-making.
Benchmarks and Thresholds
A critical concept that emerges in the development of assessment tools is the determination of benchmarks. These are observable or measurable skills or behaviors that are expected to reflect accurately the degree of mastery over one of the general competencies. In choosing which benchmarks to measure, it is important to consider how frequently something may occur and be observed, how serious failure to adequately perform a task would be, and whether description of this task is primarily subjective or objective (reflecting its likely reliability). Once these benchmarks have been chosen, thresholds of competence must be defined and then set for each level of training. Ideally, these thresholds should be set by consensus across training programs; otherwise, the standards for professional competence will remain distinctly program-dependent. Examples of possible thresholds include describing the frequency of demonstration of a mastered skill and comparing it to what is expected at the resident’s level of training or comparing performance on a task with “average peers.” Accreditors discourage peer-referenced methods as their reliability and validity are difficult to establish. Alternately, we might select a model of professional development that allows trainees to compare themselves primarily to objective standards rather than peers. Such a system could simply be described as “adequate” or better on a Likert scale (a scale in which respondents are asked to rate the degree to which they agree or disagree with a given statement) or could use a model such as the Dreyfus Brothers’ Model of Skill Acquisition (9). This is a five-step model of professional development: novice, advanced beginner, competent, proficient, expert/master. This particular approach is supported by David Leach, M.D., the executive director of the ACGME (10). It should be noted that achievement of these developmental levels would still need to be clearly defined by graduated thresholds of the various agreed-upon benchmarks.
In choosing tools for measuring physician performance and professionalism, it is critical to appreciate what is known of their psychometric properties: specifically their reliability and validity (construct, content, and predictive validity), as well as their information yield. [Note: Predictive validity has been achieved when test items have predicted accurately some future behavior. Content validity has been achieved when the specific knowledge and skills covered by test items prove to be representative of the larger domain of knowledge and skills at issue. Construct validity refers to the degree to which inferences can legitimately be made from the operationalizations in your assessment to the theoretical constructs on which they were based: if you have construct validity, you can generalize from your measures to the concept that you meant to be assessing.] This information will help training directors place evaluations within their proper context as they also consider the possibility of rater errors, recall bias, and sampling bias. Most importantly, they must be able to estimate whether or not an evaluation is accurate and “generalizable.” The information yield of an instrument is an important factor to consider as well. The availability of detailed and specific information, not just general conclusions, can make the difference between discouraging feedback and feedback that is experienced as constructive and helpful by residents, motivating them toward focused efforts at improvement.
Efficacy of the Assessment Program
When creating methods to assess the developing competencies of trainees, it is also important to plan on assessing the system itself. That is, it should lend itself smoothly to providing feedback to the training program about its strengths and deficiencies and to the teachers and supervisors about the quality of their teaching and the adequacy and reliability of their evaluations. Finally, we should think about whether our evaluative model contributes to the development of physicians who are “more professional”: more mindful of their strengths and weaknesses, more motivated to learn or improve by other standards (patient satisfaction, their own job satisfaction, etc.). This would accemplish the performance evaluation of faculty required by the ACGME on an annual basis, as well as credentialing by hospital systems of care.
The importance of feedback, the constructive delivery of formative assessment, and its distinction from evaluation cannot be overstated. It is clearly important that residents feel supported and not judged in order to create a climate that is conducive to motivated learning and positive change. In his critical paper on feedback, Ende explained, “distinct from evaluation, feedback presents information, not judgment. Feedback is formative” (11). The key to providing such constructive feedback is that it be face-to-face, ongoing, mutual, timely, based on actual data, nonsubjective and nonsummative. Ende elaborated upon these important features of ideal feedback. He described mutuality, in which an evaluator is an ally, working with common goals, committed to improving his or her own performance as well as that of the trainee. Effective feedback also occurs regularly and is expected, is based on specific observable behaviors that are remediable, and is phrased descriptively, without interpretation or allusions to perceived intentions. In order for formative assessments to be meaningful, the evaluators must be free to be frank, and both the evaluators and the trainees must believe that their purpose is to ensure improvement, not to reach definitive conclusions about a resident. This does not mean that there would never be serious consequences for “unprofessional” behavior, but rather that the emphasis would be on the opportunity for improvement rather than sanction (unless attempts at improvement have already been exhaustive). Truly honest feedback is difficult but possible if undertaken thoughtfully.
With more systematic, detailed, and accurate assessments of the competencies, remediation may become an increasingly prominent issue for training programs. The challenge of remediation is a complex and difficult one, from how to make it effective to how to manage the strain it can place on the finances and workforce of a department. While a review of the specific difficulties of and models for remediation is beyond the scope of this article, thoughtful planning for effective remediation will be a crucial part of the new systems of assessment and should also be considered in future research on the competencies.
One final matter to consider is the matter of confidentiality. Beyond the importance of a trusting and confidential relationship between the evaluator and resident, one must also consider that records of such comprehensive evaluations might be sought by licensing bodies or might even be deemed discoverable by a court if a physician is the object of a lawsuit. Thus, careful consideration of the format of these evaluations is needed; perhaps the comprehensive assessments should be used only for formative purposes. Then the more summative information could be preserved in aggregate form (that is, anonymously with regard to those being evaluated) to demonstrate that a program is teaching and measuring the general competencies. Distinct from confidentiality is the possible anonymity of raters (at least a resident’s blindness to them). Such anonymity might be helpful in improving the accuracy and thoroughness of evaluations and maintaining a resident’s receptiveness to feedback, but it risks making such feedback general and thus less effective. While anonymity may be necessary for peer evaluation or patient evaluations, there is clearly some role for identified (and mutual) evaluators.
Specific Assessment Tools
There are many assessment tools that have varying amounts of subjective and objective information, require varying degrees of effort, and may have particular usefulness for certain general competencies. It should be noted that most of the literature and the established psychometric qualities of various instruments derive from the medical and surgical specialties. While helpful, their applicability to psychiatry may in some cases be limited, particularly when one considers the different priorities and outcomes in psychiatry, the longer time-course of treatments and the challenges of observing treatment.
Global ratings are currently the most widely used method of assessment in graduate medical education (6). Global rating forms are easy to create and use but, at best, provide only limited information and direction for remediation and, at worst, are very inaccurate.
These ratings require evaluators to make subjective and broad (“global”) conclusions about the quality of behaviors, skills, knowledge, and attitudes demonstrated by a resident. These evaluations typically are performed at the completion of each rotation, and there is great variability on the particulars of the rating forms and how evaluators use them; they rarely detail the performance of specific tasks or specify what observations have led to the evaluator’s conclusions. While research has found that residency directors consider global ratings their most important tool for evaluating resident competence, it also has demonstrated that their reliability and content validity are actually quite poor. Rater errors of leniency/severity, range restriction, failure to distinguish among dimensions of competence (“the halo effect,” or generalizing from perceived talent to all specific areas of competence without actual assessment) have all been reported (12–14). Several studies have demonstrated that global ratings of medical residents have been inflated when compared with other assessment instruments (1, 6, 15, 16). The reliability measurements of these instruments have been at best inconsistent and at worst very poor (1, 6, 17–19). The best measured interrater reliability described in a medical encounter was in the range of 0.64 to 0.78 when the rating forms were tailored to specific clinical tasks and based on the observation of a single encounter (13). The content validity is also highly suspect, as there is no way to determine what specific knowledge, skills or behaviors are being evaluated. Finally, global ratings are not very useful at clarifying those specific areas that need improvement.
Self-evaluations are quite similar to global ratings. They are widely used, easy to design and implement, and are inexpensive. These evaluations, however, usually have poor predictive validity and poor correlation with evaluations by supervisors (of note, they are typically more severe). Some studies have demonstrated improved psychometric properties when the residents are trained in the specific criteria for the rating forms (1, 20, 21). Self-evaluations highlight the important distinction between competence and confidence and appear to have a helpful role in the promotion of honesty and self-awareness (20, 21). They can be an important component of 360-degree evaluations (see below), where they may be especially useful in highlighting those areas in need of improvement of which the resident is unaware. That is, residents’ own perceptions of their strengths and deficiencies can be compared with the perceptions of supervisors, patients and other team members to highlight areas that need improvement of which the resident is unaware. The contrast with the self-evaluation may enhance self-awareness and receptiveness to feedback, facilitating change. They also have been used with greater success as part of standardized patient exercises, where again the residents must be trained in their use after which they can be compared with the other evaluations as part of the resident’s feedback.
Checklists consist of the specific and observable component actions that make up a more complex undertaking. These components are assumed to be necessary for the competent performance of a single task that is in turn considered a benchmark of a general competency. The checklist may describe simply whether or not something occurred or may expand upon their performance (whether certain tasks were performed completely or not, correctly or not). While they are, strictly speaking, subjective measures, the specific and structured nature of these measurements has contributed to a better reliability than global ratings, in the range of 0.7–0.8 (6). They are full of useful and specific information that can direct improvement, but their content validity depends on the extent to which those behaviors on the checklist accurately reflect the components of the activity being observed and whether the score on the checklist has a demonstrated relation to a desired outcome. Given their reliability and rich yield of detailed information, checklists are useful tools within standardized oral exams or directly observed standardized patient examinations (objective structured clinical examinations, OSCEs). They have demonstrated reasonable construct validity, though this may be attributable to residents’ improvement on test-taking. Whether the information gathered is “generalizable” will depend on the number and variety of observational settings. Checklists are proscriptive, and as such will not account for developmental nuances (when skills are expected to be mastered), nor will they distinguish performance that is considered to be “expert.”
360-degree assessments were originally developed for use within business settings; they utilize evaluations from all relevant persons within an employee’s sphere of function. Such assessments have been piloted in several medical and surgical residency training programs in the last several years, with preliminary reports of their usefulness being very positive (6, 22–24). Evaluators may include faculty, ongoing supervisors, other professionals (social workers, nurses, etc.), paraprofessionals (administrative staff), medical students, patients, peers, and the residents themselves. While all subjective forms are vulnerable to rater bias and sampling bias, the array of feedback from many sources with different professional backgrounds can provide specific and credible information about patterns of strength or weakness. Such detailed information from a host of sources is a powerful tool for promoting self-awareness and motivating efforts at improvement.
Studies within the business sector have demonstrated that 360-degree evaluations can be information-rich and very reliable. Specific information along with the meaningful distinctions across settings and raters and a gap-analysis with the self-assessment can combine to provide highly precise and persuasive feedback (22, 24, 25). (A gap analysis is an analysis of the overall or recurring differences between self-evaluations and all others, which might suggest that a resident has a “blind spot” in one area or is overly self-critical in another.) They appear to have the greatest content and construct validity for assessing professionalism, interpersonal and communication skills, and systems-based practice (6, 22, 24–27). Research has demonstrated that as many as 20 to 50 attending evaluations, 20 patient evaluations or as few as five nursing evaluations are necessary to achieve stable (reliable) ratings of a resident’s humanistic qualities (1, 6, 22–24, 28, 29). All of these psychometric strengths depend on the strength of the particular evaluation form used. The evaluation tools are typically subjective ratings (with scales) across specified domains of function. These domains may be tailored to evaluate the ACGME general competencies, provided they are thoughtfully broken down into observable components (benchmarks) that can then be scored by the various evaluators. The form’s questions must be tailored to the particular evaluator; while supervisors may be reasonably expected to evaluate medical knowledge and clinical reasoning, patients are better at professionalism and communication skills (though the validity of these assessments in medicine has not yet been studied). Consideration should be made in constructing the form of the specific scales used: do they rate on a standard 5-point Likert scale (more useful for patients and paraprofessionals) or use standards such as “at the level of most residents in their year of training” (with the concerns for reliability and validity that peer-referenced standards produce) or “among the worst—average—best residents I have worked with,” as might be more fitting for supervisors, faculty and other professionals.
Peer ratings have demonstrated accuracy and reliability in the measurement of physician performance and may be particularly useful for evaluating professionalism (1). However, they raise concerns about the confidentiality of the raters, the effect on class morale and the risk of rater biases. When peer evaluations are integrated into a 360-degree evaluation, though, these concerns are somewhat diminished and they can play a helpful role. Most of those training programs that have described piloting these assessments have reported gathering very meaningful information that directs remediation and also described positive feedback from residents who reported finding the information highly constructive (22–25, 29). While not well-established in medicine, 360-degree assessments are a very promising tool for the evaluation of professional competence and of the adequacy of training programs.
A computer-based evaluation system may be applicable to many types of assessment tools, but can be especially useful for such data-intensive evaluation systems as the 360-degree evaluations. A computerized evaluation system can improve the efficiency of distributing and gathering many forms at relatively frequent intervals. If computer systems are used, it is easier to compile the information not only about individual residents, but also about patterns emerging within evaluator groups or for an individual evaluator across residents. When there are distinct deficiencies or redundancies in a training program, often it will become evident quickly with such comprehensive evaluations. Areas in need of improvement for an individual resident and for whole programs can then be remediated promptly and followed in subsequent evaluations. While these systems can greatly improve efficiency and return rate, they are frequently expensive and the confidentiality of the data must be carefully ensured.
In direct observation, faculty shadow and directly observe residents performing clinical duties, and evaluate them with global ratings or checklists. Within internal medicine, the Mini-CEX is an example of a direct observation assessment. These assessments are very time-consuming and require patient consent. They are prone to poor interrater reliability unless they are done with checklists that have highly specific criteria and are performed by faculty who are trained to use them (30). They are also prone to sampling bias (observing unusually complex or simple cases by chance) unless multiple clinical situations are observed or the degree of complexity and acuity are accounted for (1, 6, 31). Finally, direct observation assessments are prone to the “Hawthorne effect,” in which residents are on their best behavior when they know they’re being watched (15, 31). These are considerable liabilities. However, reasonable reliability can be achieved with adequate checklist forms which can be used in the psychiatric emergency room, inpatient rotations, or in consultation-liaison rotations. The validity of this approach appears to be best for medical knowledge and patient care and worst for professionalism, communication, and interpersonal skills (likely Hawthorne effect) (1, 6, 15, 26, 27, 30, 31). The observation and rating of actual videotaped physician-patient interactions have been used and studied in the United Kingdom, the Netherlands, and Canada. It has similar advantages and similar susceptibilities to error and bias as live evaluations, unless the resident and patient are blinded to the videotaping, which raises ethical concerns (1, 6, 17, 32, 33). However, direct observation of videotaped therapy sessions by trained supervisors with specific checklists potentially has great applicability to the assessment of competency in psychodynamic and other modalities of psychotherapy (3).
Standardized written examinations, such as the Psychiatry Residency In-Training Examination (PRITE), CHILD PRITE and the American ABPN written board examination are well known and have well-established reliability. However, their content and predictive validity are more questionable. While their relevance to behavior is limited, these examinations are clearly useful in assessing quantifiable competencies (such as factual medical knowledge). Their advantages include their reliability, limited time and cost requirements, and low need for faculty time. In addition, standardized written exams may be useful as supplements to standardized clinical exams (see below), reducing the needed number of stations (and thus cost) and improving reliability (1, 6, 34–36).
Structured oral examinations, such as vignettes and role playing, can achieve excellent levels of reliability with structured and specific rating instruments and trained raters, achieving reliability scores of 0.78 to 0.91. Though they are prone to the Hawthorne effect and sampling bias, they can evaluate rare but important scenarios. These examinations appear to have limited validity, but they are most valid for the general competencies of patient care, medical knowledge and interpersonal communication (1, 6, 25–27).
These types of assessments include standardized patient examinations and objective structured clinical examinations (OSCEs), which use different stations with standardized patients to be examined or other clinical tasks to be performed (reading x-rays and ECGs, etc.). Trained faculty observe and rate residents’ performance at each station with detailed checklists. Standardized patients also are trained to rate the residents with a detailed checklist. The medical and surgical literature have established an adequate reliability rate for OSCEs (approximately 0.8) (1, 6, 37–39). More stations increase the reliability, and some literature suggests the need for 14 to 18 stations generalize to overall clinical competency. However, the number of stations needed can be reduced by the addition of a standardized written examination (35, 38). OSCEs also have well established content and construct validity, but there remains the problem of establishing a gold standard for validity (1, 6, 32, 33). Though they can be rich in information, OSCEs are costly and time-consuming. The small number of skills that can be observed at discrete stations may limit their usefulness in assessing the competency of psychiatric trainees. The ratings by the standardized patients themselves appear to be quite useful for the evaluation of interpersonal communication and professionalism; however, the literature has demonstrated that they are prone to recall bias (1, 6, 19, 27, 39). With the use of resident-blinded standardized patient exams, a resident’s performance in a rare clinical scenario can be evaluated without the risk of the Hawthorne effect. However, these are usually rated by the standardized patient only and thus are prone to recall bias. Nonetheless, they appear useful for evaluating professionalism (particularly in managing ethically complex scenarios) and interpersonal skills (1, 6, 19, 25, 27, 33, 38).
Portfolios are a collection of a resident’s work or evidence of a resident’s accomplishments. In relation to psychiatry they may include published work or other research related materials. They may also include process notes, written patient evaluations, the content of a grand rounds that the resident gave or work they did during an elective, and letters from patients, peers for whom they have provided a consultation or administrators with whom they have interacted. Evidence of response to clinical errors or efforts to improve their practice, such as written learning or treatment plans, can also be included. While portfolios are very useful for areas that are hard to assess or might otherwise not be accounted for in the assessment process, they are notoriously difficult to “score,” given their diverse contents. It is possible to establish specific requirements for the amount, type, and quality of materials with which a resident is expected to fill his or her portfolio over the course of training. Doing so would improve reliable and useful evaluation of the portfolio. Portfolios are a logical forum for a resident’s ongoing self-evaluation or evidence of growth in areas they and their director have targeted, defining development and self-awareness as prized values within their training program.
While there are many other evaluation tools that have been developed by various medical schools and residency training programs, Strategic Management Simulation (SMS) is an interesting approach to evaluating decision-making in complex, ambiguous situations, a skill that is essential to the practice of psychiatry and is difficult to assess. SMS was developed by a surgical training program to assess critical thinking and decision-making in situations that are volatile, complex, and/or ambiguous. It is a computer-based program with six half-hour tasks (clinical vignettes) that evolve dynamically. Residents are expected to respond to the evolving scenario, and the program assesses their planning, sequential thinking, use of information, strategizing and learning from mistakes. The developers of the program have found it to be highly reliable, and found that the ratings on this exam corresponded closely with faculty evaluations of the residents. In addition to the predictive validity of performance in clinical rotations, they also found high predictive validity as measured against income, promotions, and number of people supervised at a given age (40). This suggests that it might be fruitful to develop a similar computer program which could assess a resident’s capacity for critical thinking and decision-making in otherwise rare clinical scenarios (ethical challenges or rare but dangerous medication effects) or those that would not lend themselves to direct observation, perhaps even psychodynamic psychotherapy. Their program appears to have been used for summative purposes, but certainly could be used to provide detailed feedback about those skills evaluated.
There are no perfectly accurate and reliable evaluation instruments available for assessing the competence of psychiatrists through their training, but there are promising instruments that could be assembled into an evaluation system. Ideally, such a system would be rich enough in accurate and relevant detail to provide constructive, directive feedback. At the same time, the risk of creating permanent records of specific resident problems could have unintended licensing or forensic repercussions. The need for thorough and accurate information must also be balanced against the realities of limited time and money available for designing, training, and implementing such evaluation programs. Finally, the goal of formative assessment may feel at odds with the summative requirements of accreditation and licensing bodies. Given the available literature, it appears that 360-degree instruments are promising formative and summative assessment techniques, lending themselves especially well to the assessment of practice-based learning, systems-based practice, interpersonal and communication skills, professionalism, and patient care. Of course, the devil is in the details, and the accuracy of these instruments will depend on the particulars of the forms used, including the training or explanations given to raters. Patient care and medical knowledge appear to be best evaluated by standardized clinical exams and standardized written exams, respectively. An evaluation program might then be rounded out with portfolios that emphasize the work of trainees that might not otherwise be included. Finally, direct observation of videotaped treatments, rated with checklists and specific feedback, could be critical adjuncts to the traditional methods of psychotherapy supervision, ensuring that these critical but difficult-to-measure skills are not ignored in the rush to evaluate and demonstrate competency. While generating a practical, reliable, accurate and constructive evaluation system may now be an imperative, it presents an opportunity as well, an opportunity to highlight ACGME’s and our professional values and standards of self-awareness, honesty, humility, curiosity, and diligence in active form, not just recited oaths, for faculty and trainees alike.