Original Research

Development of a handoff evaluation tool for shift‐to‐shift physician handoffs: The handoff CEX



Increasing frequency of shift‐to‐shift handoffs coupled with regulatory requirements to evaluate handoff quality make a handoff evaluation tool necessary.


To develop a handoff evaluation tool.


Tool development.


Two academic medical centers.


Nurse practitioners, medicine housestaff, and hospitalist attendings.


Concurrent peer and external evaluations of shift‐to‐shift handoffs.


The Handoff CEX (clinical evaluation exercise) consists of 6 subdomains and 1 overall assessment, each scored from 1 to 9, where 1 to 3 is unsatisfactory and 7 to 9 is superior. We assessed range of scores, performance among subgroups, internal consistency, and agreement among types of raters.


We conducted 675 evaluations of 97 unique individuals during 149 handoff sessions. Scores ranged from unsatisfactory to superior in each domain. The highest rated domain for handoff providers was professionalism (median: 8; interquartile range [IQR]: 7–9); the lowest was content (median: 7; IQR: 6–8). Scores at the 2 institutions were similar, and scores did not differ significantly by training level. Spearman correlation coefficients among the CEX subdomains for provider scores ranged from 0.71 to 0.86, except for setting (0.39–0.40). Third‐party external evaluators consistently gave lower marks for the same handoff than peer evaluators did. Weighted kappa scores for provider evaluations comparing external evaluators to peers ranged from 0.28 (95% confidence interval [CI]: 0.01, 0.56) for setting to 0.59 (95% CI: 0.38, 0.80) for organization.


This handoff evaluation tool was easily used by trainees and attendings, had high internal consistency, and performed similarly across institutions. Because peers consistently provided higher scores than external evaluators, this tool may be most appropriate for external evaluation. Journal of Hospital Medicine 2013;8:191–200. © 2013 Society of Hospital Medicine

Copyright © 2013 Society of Hospital Medicine

Transfers among trainee physicians within the hospital typically occur at least twice a day and have been increasing among trainees as work hours have declined.[1] The 2011 Accreditation Council for Graduate Medical Education (ACGME) guidelines,[2] which restrict intern working hours to 16 hours from a previous maximum of 30, have likely increased the frequency of physician trainee handoffs even further. Similarly, transfers among hospitalist attendings occur at least twice a day, given typical shifts of 8 to 12 hours.

Given the frequency of transfers, and the potential for harm generated by failed transitions,[3, 4, 5, 6] the end‐of‐shift written and verbal handoffs have assumed increasingly greater importance in hospital care among both trainees and hospitalist attendings.

The ACGME now requires that programs assess the competency of trainees in handoff communication.[2] Yet, there are few tools for assessing the quality of sign‐out communication. Those that exist primarily focus on the written sign‐out, and are rarely validated.[7, 8, 9, 10, 11, 12] Furthermore, it is uncertain whether such assessments must be done by supervisors or whether peers can participate in the evaluation. In this prospective multi‐institutional study we assess the performance characteristics of a verbal sign‐out evaluation tool for internal medicine housestaff and hospitalist attendings, and examine whether it can be used by peers as well as by external evaluators. This tool has previously been found to effectively discriminate between experienced and inexperienced nurses conducting nursing handoffs.[13]


Tool Design and Measures

The Handoff CEX (clinical evaluation exercise) is a structured assessment based on the format of the mini‐CEX, an instrument used to assess the quality of history and physical examination by trainees for which validation studies have previously been conducted.[14, 15, 16, 17] We developed the tool based on themes we identified from our own expertise,[1, 5, 6, 8, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29] the ACGME core competencies for trainees,[2] and the literature to maximize content validity. First, standardization has numerous demonstrable benefits for safety in general and handoffs in particular.[30, 31, 32] Consequently we created a domain for organization in which standardization was a characteristic of high performance.

Second, there is evidence that people engaged in conversation routinely overestimate peer comprehension,[27] and that explicit strategies to combat this overestimation, such as confirming understanding, explicitly assigning tasks rather than using open‐ended language, and using concrete language, are effective.[33] Accordingly we created a domain for communication skills, which is also an ACGME competency.

Third, although there were no formal guidelines for sign‐out content when we developed this tool, our own research had demonstrated that the content elements most often missing and felt to be important by stakeholders were related to clinical condition and explicating thinking processes,[5, 6] so we created a domain for content that highlighted these areas and met the ACGME competency of medical knowledge. In accordance with standards for evaluation of learners, we incorporated a domain for judgment to identify where trainees were in the RIME spectrum of reporter, interpreter, master, and educator.

Next, we added a section for professionalism in accordance with the ACGME core competencies of professionalism and patient care.[34] To avoid the disinclination of peers to label each other unprofessional, we labeled the professionalism domain as patient‐focused on the tool.

Finally, we included a domain for setting because of an extensive literature demonstrating increased handoff failures in noisy or interruptive settings.[35, 36, 37] We then revised the tool slightly based on our experiences among nurses and students.[13, 38] The final tool included the 6 domains described above and an assessment of overall competency. Each domain was scored on a 9‐point scale and included descriptive anchors at high and low ends of performance. We further divided the scale into 3 main sections: unsatisfactory (score 13), satisfactory (46), and superior (79). We designed 2 tools, 1 to assess the person providing the handoff and 1 to assess the handoff recipient, each with its own descriptive anchors. The recipient tool did not include a content domain (see Supporting Information, Appendix 1, in the online version of this article).

Setting and Subjects

We tested the tool in 2 different urban academic medical centers: the University of Chicago Medicine (UCM) and Yale‐New Haven Hospital (Yale). At UCM, we tested the tool among hospitalists, nurse practitioners, and physician assistants during the Monday and Tuesday morning and Friday evening sign‐out sessions. At Yale, we tested the tool among housestaff during the evening sign‐out session from the primary team to the on‐call covering team.

The UCM is a 550‐bed urban academic medical center in which the nonteaching hospitalist service cares for patients with liver disease, or end‐stage renal or lung disease awaiting transplant, and a small fraction of general medicine and oncology patients when the housestaff service exceeds its cap. No formal training on sign‐out is provided to attending or midlevel providers. The nonteaching hospitalist service operates as a separate service from the housestaff service and consists of 38 hospitalist clinicians (hospitalist attendings, nurse practitioners, and physicians assistants). There are 2 handoffs each day. In the morning the departing night hospitalist hands off to the incoming daytime hospitalist or midlevel provider. These handoffs occur at 7:30 am in a dedicated room. In the evening the daytime hospitalist or midlevel provider hands off to an incoming night hospitalist. This handoff occurs at 5:30 pm or 7:30 pm in a dedicated location. The written sign‐out is maintained on a Microsoft Word (Microsoft Corp., Redmond, WA) document on a password‐protected server and updated daily.

Yale is a 946‐bed urban academic medical center with a large internal medicine training program. Formal sign‐out education that covers the main domains of the tool is provided to new interns during the first 3 months of the year,[19] and a templated electronic medical record‐based electronic written handoff report is produced by the housestaff for all patients.[22] Approximately half of inpatient medicine patients are cared for by housestaff teams, which are entirely separate from the hospitalist service. Housestaff sign‐out occurs between 4 pm and 7 pm every night. At a minimum, the departing intern signs out to the incoming intern; this handoff is typically supervised by at least 1 second‐ or third‐year resident. All patients are signed out verbally; in addition, the written handoff report is provided to the incoming team. Most handoffs occur in a quiet charting room.

Data Collection

Data collection at UCM occurred between March and December 2010 on 3 days of each week: Mondays, Tuesdays, and Fridays. On Mondays and Tuesdays the morning handoffs were observed; on Fridays the evening handoffs were observed. Data collection at Yale occurred between March and May 2011. Only evening handoffs from the primary team to the overnight coverage were observed. At both sites, participants provided verbal informed consent prior to data collection. At the time of an eligible sign‐out session, a research assistant (D.R. at Yale, P.S. at UCM) provided the evaluation tools to all members of the incoming and outgoing teams, and observed the sign‐out session himself. Each person providing a handoff was asked to evaluate the recipient of the handoff; each person receiving a handoff was asked to evaluate the provider of the handoff. In addition, the trained third‐party observer (D.R., P.S.) evaluated both the provider and recipient of the handoff. The external evaluators were trained in principles of effective communication and the use of the tool, with specific review of anchors at each end of each domain. One evaluator had a DO degree and was completing an MPH degree. The second evaluator was an experienced clinical research assistant whose training consisted of supervised observation of 10 handoffs by a physician investigator. At Yale, if a resident was present, she or he was also asked to evaluate both the provider and recipient of the handoff. Consequently, every sign‐out session included at least 2 evaluations of each participant, 1 by a peer evaluator and 1 by a consistent external evaluator who did not know the patients. At Yale, many sign‐outs also included a third evaluation by a resident supervisor.

The study was approved by the institutional review boards at both UCM and Yale.

Statistical Analysis

We obtained mean, median, and interquartile range of scores for each subdomain of the tool as well as the overall assessment of handoff quality. We assessed convergent construct validity by assessing performance of the tool in different contexts. To do so, we determined whether scores differed by type of participant (provider or recipient), by site, by training level of evaluatee, or by type of evaluator (external, resident supervisor, or peer) by using Wilcoxon rank sum tests and Kruskal‐Wallis tests. For the assessment of differences in ratings by training level, we used evaluations of sign‐out providers only, because the 2 sites differed in scores for recipients. We also assessed construct validity by using Spearman rank correlation coefficients to describe the internal consistency of the tool in terms of the correlation between domains of the tool, and we conducted an exploratory factor analysis to gain insight into whether the subdomains of the tool were measuring the same construct. In conducting this analysis, we restricted the dataset to evaluations of sign‐out providers only, and used a principal components estimation method, a promax rotation, and squared multiple correlation communality priors. Finally, we conducted some preliminary studies of reliability by testing whether different types of evaluators provided similar assessments. We calculated a weighted kappa using Fleiss‐Cohen weights for external versus peer scores and again for supervising resident versus peer scores (Yale only). We were not able to assess test‐retest reliability by nature of the sign‐out process. Statistical significance was defined by a P value 0.05, and analyses were performed using SAS 9.2 (SAS Institute, Cary, NC).


A total of 149 handoff sessions were observed: 89 at UCM and 60 at Yale. Each site conducted a similar total number of evaluations: 336 at UCM, 337 at Yale. These sessions involved 97 unique individuals, 34 at UCM and 63 at Yale. Overall scores were high at both sites, but a wide range of scores was applied (Table 1).

Median, Mean, and Range of Handoff CEX Scores in Each Domain, Providers, and Recipients
DomainProvider, N=343Recipient, N=330P Value
Median (IQR)Mean (SD)RangeMedian (IQR)Mean (SD)Range
  • NOTE: Abbreviations: IQR, interquartile range; SD, standard deviation.

Setting7 (69)7.0 (1.7)297 (69)7.3 (1.6)290.05
Organization7 (68)7.2 (1.5)298 (69)7.4 (1.4)290.07
Communication7 (69)7.2 (1.6)198 (79)7.4 (1.5)290.22
Content7 (68)7.0 (1.6)29
Judgment8 (68)7.3 (1.4)398 (79)7.5 (1.4)390.06
Professionalism8 (79)7.4 (1.5)298 (79)7.6 (1.4)390.23
Overall7 (68)7.1 (1.5)297 (68)7.4 (1.4)290.02

Handoff Providers

A total of 343 evaluations of handoff providers were completed regarding 67 unique individuals. For each domain, scores spanned the full range from unsatisfactory to superior. The highest rated domain on the handoff provider evaluation tool was professionalism (median: 8; interquartile range [IQR]: 79). The lowest rated domain was content (median: 7; IQR: 68) (Table 1).

Handoff Recipients

A total of 330 evaluations of handoff recipients were completed regarding 58 unique individuals. For each domain, scores spanned the full range from unsatisfactory to superior. The highest rated domain on the handoff provider evaluation tool was professionalism, with a median of 8 (IQR: 79). The lowest rated domain was setting, with a median score of 7 (IQR: 6‐9) (Table 1).

Validity Testing

Comparing provider scores to recipient scores, recipients received significantly higher scores for overall assessment (Table 1). Scores at UCM and Yale were similar in all domains for providers but were slightly lower at UCM in several domains for recipients (see Supporting Information, Appendix 2, in the online version of this article). Scores did not differ significantly by training level (Table 2). Third‐party external evaluators consistently gave lower marks for the same handoff than peer evaluators did (Table 3).

Handoff CEX Scores by Training Level, Providers Only
DomainMedian (Range)P Value
NP/PA, N=33Subintern or Intern, N=170Resident, N=44Hospitalist, N=95
  • NOTE: Abbreviations: NP/PA: nurse practitioner/physician assistant.

Setting7 (29)7 (39)7 (49)7 (29)0.89
Organization8 (49)7 (29)7 (49)8 (39)0.11
Communication8 (49)7 (29)7 (49)8 (19)0.72
Content7 (39)7 (29)7 (49)7 (29)0.92
Judgment8 (59)7 (39)8 (49)8 (49)0.09
Professionalism8 (49)7 (29)8 (39)8 (49)0.82
Overall7 (39)7 (29)8 (49)7 (29)0.28
Handoff CEX Scores by Peer Versus External Evaluators
Provider, Median (Range)Recipient, Median (Range)
DomainPeer, N=152Resident, Supervisor, N=43External, N=147P ValuePeer, N=145Resident Supervisor, N=43External, N=142P Value
  • NOTE: Abbreviations: N/A, not applicable.

Setting8 (39)7 (39)7 (29)0.028 (29)7 (39)7 (29)<0.001
Organization8 (39)8 (39)7 (29)0.188 (39)8 (69)7 (29)<0.001
Communication8 (39)8 (39)7 (19)<0.0018 (39)8 (49)7 (29)<0.001
Content8 (39)8 (29)7 (29)<0.001N/AN/AN/AN/A
Judgment8 (49)8 (39)7 (39)<0.0018 (39)8 (49)7 (39)<0.001
Professionalism8 (39)8 (59)7 (29)0.028 (39)8 (69)7 (39)<0.001
Overall8 (39)8 (39)7 (29)0.0018 (29)8 (49)7 (29)<0.001

Spearman rank correlation coefficients among the CEX subdomains for provider scores ranged from 0.71 to 0.86, except for setting (Table 4). Setting was less well correlated with the other subdomains, with correlation coefficients ranging from 0.39 to 0.41. Correlations between individual domains and the overall rating ranged from 0.80 to 0.86, except setting, which had a correlation of 0.55. Every correlation was significant at P<0.001. Correlation coefficients for recipient scores were very similar to those for provider scores (see Supporting Information, Appendix 3, in the online version of this article).

Spearman Correlation Coefficients, Provider Evaluations (N=342)
Spearman Correlation Coefficients
  • NOTE: All P values <0.0001.


We analyzed 343 provider evaluations in the factor analysis; there were 6 missing values. The scree plot of eigenvalues did not support more than 1 factor; however, the rotated factor pattern for standardized regression coefficients for the first factor and the final communality estimates showed the setting component yielding smaller values than did other scale components (see Supporting Information, Appendix 4, in the online version of this article).

Reliability Testing

Weighted kappa scores for provider evaluations ranged from 0.28 (95% confidence interval [CI]: 0.01, 0.56) for setting to 0.59 (95% CI: 0.38, 0.80) for organization, and were generally higher for resident versus peer comparisons than for external versus peer comparisons. Weighted kappa scores for recipient evaluation were slightly lower for external versus peer evaluations, but agreement was no better than chance for resident versus peer evaluations (Table 5).

Weighted Kappa Scores
External vs Peer, N=144 (95% CI)Resident vs Peer, N=42 (95% CI)External vs Peer, N=134 (95% CI)Resident vs Peer, N=43 (95% CI)
  • NOTE: Abbreviations: CI, confidence interval; N/A, not applicable.

Setting0.39 (0.24, 0.54)0.28 (0.01, 0.56)0.34 (0.20, 0.48)0.48 (0.27, 0.69)
Organization0.43 (0.29, 0.58)0.59 (0.39, 0.80)0.39 (0.22, 0.55)0.03 (0.23, 0.29)
Communication0.34 (0.19, 0.49)0.52 (0.37, 0.68)0.36 (0.22, 0.51)0.02 (0.18, 0.23)
Content0.38 (0.25, 0.51)0.53 (0.27, 0.80)N/A (N/A)N/A (N/A)
Judgment0.36 (0.22, 0.49)0.54 (0.25, 0.83)0.28 (0.15, 0.42)0.12 (0.34, 0.09)
Professionalism0.47 (0.32, 0.63)0.47 (0.23, 0.72)0.35 (0.18, 0.51)0.01 (0.29, 0.26)
Overall0.50 (0.36, 0.64)0.45 (0.24, 0.67)0.31 (0.16, 0.48)0.07 (0.20, 0.34)


In this study we found that an evaluation tool for direct observation of housestaff and hospitalists generated a range of scores and was well validated in the sense of performing similarly across 2 different institutions and among both trainees and attendings, while having high internal consistency. However, external evaluators gave consistently lower marks than peer evaluators at both sites, resulting in low reliability when comparing these 2 groups of raters.

It has traditionally been difficult to conduct direct evaluations of handoffs, because they may occur at haphazard times, in variable locations, and without very much advance notice. For this reason, several attempts have been made to incorporate peers in evaluations of handoff practices.[5, 39, 40] Using peers to conduct evaluations also has the advantage that peers are more likely to be familiar with the patients being handed off and might recognize handoff flaws that external evaluators would miss. Nonetheless, peer evaluations have some important liabilities. Peers may be unwilling or unable to provide honest critiques of their colleagues given that they must work closely together for years. Trainee peers may also lack sufficient clinical expertise or experience to accurately assess competence. In our study, we found that peers gave consistently higher marks to their colleagues than did external evaluators, suggesting they may have found it difficult to criticize their colleagues. We conclude that peer evaluation alone is likely an insufficient means of evaluating handoff quality.

Supervising residents gave very similar marks as intern peers, suggesting that they also are unwilling to criticize, are insufficiently experienced to evaluate, or alternatively, that the peer evaluations were reasonable. We suspect the latter is unlikely given that external evaluator scores were consistently lower than peers. One would expect the external evaluators to be biased toward higher scores given that they are not familiar with the patients and are not able to comment on inaccuracies or omissions in the sign‐out.

The tool appeared to perform less well in most cases for recipients than for providers, with a narrower range of scores and low‐weighted kappa scores. Although recipients play a key role in ensuring a high‐quality sign‐out by paying close attention, ensuring it is a bidirectional conversation, asking appropriate questions, and reading back key information, it may be that evaluators were unable to place these activities within the same domains that were used for the provider evaluation. An altogether different recipient evaluation approach may be necessary.[41]

In general, scores were clustered at the top of the score range, as is typical for evaluations. One strategy to spread out scores further would be to refine the tool by adding anchors for satisfactory performance not just the extremes. A second approach might be to reduce the grading scale to only 3 points (unsatisfactory, satisfactory, superior) to force more scores to the middle. However, this approach might limit the discrimination ability of the tool.

We have previously studied the use of this tool among nurses. In that study, we also found consistently higher scores by peers than by external evaluators. We did, however, find a positive effect of experience, in which more experienced nurses received higher scores on average. We did not observe a similar training effect in this study. There are several possible explanations for the lack of a training effect. It is possible that the types of handoffs assessed played a role. At UCM, some assessed handoffs were night staff to day staff, which might be lower quality than day staff to night staff handoffs, whereas at Yale, all handoffs were day to night teams. Thus, average scores at UCM (primarily hospitalists) might have been lowered by the type of handoff provided. Given that hospitalist evaluations were conducted exclusively at UCM and housestaff evaluations exclusively at Yale, lack of difference between hospitalists and housestaff may also have been related to differences in evaluation practice or handoff practice at the 2 sites, not necessarily related to training level. Third, in our experience, attending physicians provide briefer less‐comprehensive sign‐outs than trainees, particularly when communicating with equally experienced attendings; these sign‐outs may appropriately be scored lower on the tool. Fourth, the great majority of the hospitalists at UCM were within 5 years of residency and therefore not very much more experienced than the trainees. Finally, it is possible that skills do not improve over time given widespread lack of observation and feedback during training years for this important skill.

The high internal consistency of most of the subdomains and the loading of all subdomains except setting onto 1 factor are evidence of convergent construct validity, but also suggest that evaluators have difficulty distinguishing among components of sign‐out quality. Internal consistency may also reflect a halo effect, in which scores on different domains are all influenced by a common overall judgment.[42] We are currently testing a shorter version of the tool including domains only for content, professionalism, and setting in addition to overall score. The fact that setting did not correlate as well with the other domains suggests that sign‐out practitioners may not have or exercise control over their surroundings. Consequently, it may ultimately be reasonable to drop this domain from the tool, or alternatively, to refocus on the need to ensure a quiet setting during sign‐out skills training.

There are several limitations to this study. External evaluations were conducted by personnel who were not familiar with the patients, and they may therefore have overestimated the quality of sign‐out. Studying different types of physicians at different sites might have limited our ability to identify differences by training level. As is commonly seen in evaluation studies, scores were skewed to the high end, although we did observe some use of the full range of the tool. Finally, we were limited in our ability to test inter‐rater reliability because of the multiple sources of variability in the data (numerous different raters, with different backgrounds at different settings, rating different individuals).

In summary, we developed a handoff evaluation tool that was easily completed by housestaff and attendings without training, that performed similarly in a variety of different settings at 2 institutions, and that can in principle be used either for peer evaluations or for external evaluations, although peer evaluations may be positively biased. Further work will be done to refine and simplify the tool.


Disclosures: Development and evaluation of the sign‐out CEX was supported by a grant from the Agency for Healthcare Research and Quality (1R03HS018278‐01). Dr. Arora is supported by a National Institute on Aging (K23 AG033763). Dr. Horwitz is supported by the National Institute on Aging (K08 AG038336) and by the American Federation for Aging Research through the Paul B. Beeson Career Development Award Program. Dr. Horwitz is also a Pepper Scholar with support from the Claude D. Pepper Older Americans Independence Center at Yale University School of Medicine (P30AG021342 NIH/NIA). No funding source had any role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality, the National Institute on Aging, the National Institutes of Health, or the American Federation for Aging Research. Dr. Horwitz had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. An earlier version of this work was presented as a poster presentation at the Society of General Internal Medicine Annual Meeting in Orlando, Florida on May 9, 2012. Dr. Rand is now with the Department of Medicine, University of Vermont College of Medicine, Burlington, Vermont. Mr. Staisiunas is now with the Law School, Marquette University, Milwaukee, Wisconsin. The authors declare they have no conflicts of interest.







Handoff CEX scores by site of evaluation

Median (Range)P‐valueMedian (Range)P‐value
UCYale UCYale
N=172N=170 N=163N=167
Setting7 (29)7 (39)0.327 (29)7 (39)0.36
Organization8 (29)7 (39)0.307 (29)8 (59)0.001
Communication7 (19)7 (39)0.677 (29)8 (49)0.03
Content7 (29)7 (29) N/AN/AN/A
Judgment8 (39)7 (39)0.607 (39)8 (49)0.001
Professionalism8 (29)8 (39)0.678 (39)8 (49)0.35
Overall7 (29)7 (39)0.417 (29)8 (49)0.005



Spearman correlation, recipients (N=330)


All p values <0.0001



Factor analysis results for provider evaluations

Rotated Factor Pattern (Standardized Regression Coefficients) N=336


Online-Only Materials

Microsoft Office document icon Supplementary Information (1)55.5 KB
   Comments ()