The early identification of clinical deterioration among adult hospitalized patients remains a challenge.1 Delayed identification is associated with increased morbidity and mortality, unplanned intensive care unit (ICU) admissions, prolonged hospitalization, and higher costs.2,3 Earlier detection of deterioration using predictive algorithms of vital sign monitoring might avoid these negative outcomes.4 In this scoping review, we summarize current algorithms and their evidence.
Vital signs provide the backbone for detecting clinical deterioration. Early warning scores (EWS) and outreach protocols were developed to bring structure to the assessment of vital signs. Most EWS claim to predict clinical endpoints such as unplanned ICU admission up to 24 hours in advance.5,6 Reviews of EWS showed a positive trend toward reduced length of stay and mortality. However, conclusions about general efficacy could not be generated because of case heterogeneity and methodologic shortcomings.4,7 Continuous automated vital sign monitoring of patients on the general ward can now be accomplished with wearable devices.8 The first reports on continuous monitoring showed earlier detection of deterioration but not improved clinical endpoints.4,9 Since then, different reports on continuous monitoring have shown positive effects but concluded that unprocessed monitoring data per se falls short of generating actionable alarms.4,10,11
Predictive algorithms, which often use artificial intelligence (AI), are increasingly employed to recognize complex patterns or abnormalities and support predictions of events in big data sets.12,13 Especially when combined with continuous vital sign monitoring, predictive algorithms have the potential to expedite detection of clinical deterioration and improve patient outcomes. Predictive algorithms using vital signs in the ICU have shown promising results.14 The impact of predictive algorithms on the general wards, however, is unclear.
The aims of our scoping review were to explore the extent and range of and evidence for predictive vital signs–based algorithms on the adult general ward; to describe the variety of these algorithms; and to categorize effects, facilitators, and barriers of their implementation.15
MATERIALS AND METHODS
We performed a scoping review to create a summary of the current state of research. We used the five-step method of Levac and followed the Preferred Reporting Items for Systematic reviews and Meta-Analyses Extension for Scoping Reviews guidelines.16,17 (Appendix 1).
PubMed, Embase, and CINAHL databases were searched for English-language articles written between January 1, 2010, and November 20, 2020. We developed the search queries with an experienced information scientist, and we used database-specific terms and strategies for input, clinical outcome, method, predictive capability, and population (Appendix 2). Additionally, we searched the references of the selected articles, as well as publications citing these articles.
All studies identified were screened by title and abstract by two researchers (RP and YE). The selected studies were read in their entirety and checked for eligibility using the following inclusion criteria: automated algorithm; vital signs-based; real-time prediction; of clinical deterioration; in an adult, general ward population. In cases where there were successive publications with the same algorithm and population, we selected the most recent study.
For screening and selection, we used the Rayyan QCRI online tool (Qatar Computing Research Institute) and Endnote X9 (Clarivate Analytics). We extracted information using a data extraction form and organized it into descriptive characteristics of the selected studies (Table 1): an input data table showing number of admissions, intermittent or continuous measurements, vital signs measured, laboratory results (Appendix Table 1), a table summarizing study designs and settings (Appendix Table 2), and a prediction performance table (Table 2). We report characteristics of the populations and algorithms, prediction specifications such as area under the receiver operating curve (AUROC), and predictive values. Predictive values are affected by prevalence, which may differ among populations. To compare the algorithms, we calculated an indexed positive predictive value (PPV) and a number needed to evaluate (NNE) using a weighted average prevalence of clinical deterioration of 3.0%.
We defined clinical deterioration as endpoints, including rapid response team activation, cardiopulmonary resuscitation, transfer to an ICU, or death. Real-time was defined by the ability to automatically update predictions as new measurements are added. Predictions were defined as data-derived warnings for events in the near future. Prediction horizon was defined as the period for which a prediction is made. Special interest was given to algorithms that involved AI, which we defined as any form of machine learning or other nonclassical statistical algorithm.
Effects, facilitators, and barriers were identified and categorized using ATLAS.ti 8 software (ATLAS.ti) and evaluated by three researchers (RP, MK, and THvdB). These were categorized using the adapted frameworks of Gagnon et al18 for the barriers and facilitators and of Donabedian19 for the effects (Appendix 3).
The Gagnon et al framework was adapted by changing two of four domains—that is, “Individual” was changed to “Professional” and “Human” to “Physiology.” The domains of “Technology” and “Organization” remained unchanged. The Donabedian domains of “Outcome,” “Process,” and “Structure” also remained unchanged (Table 3).
We divided the studies into two groups: studies on predictive algorithms with and without AI when reporting on characteristics and performance. For the secondary aim of exploring implementation impact, we reported facilitators and barriers in a narrative way, highlighting the most frequent and notable findings.
As shown in the Figure, we found 1,741 publications, of which we read the full-text of 109. There were 1,632 publications that did not meet the inclusion criteria. The publications by Churpek et al,20,21 Bartkiowak et al,22 Edelson et al,23 Escobar et al,24,25 and Kipnis et al26 reported on the same algorithms or databases but had significantly different approaches. For multiple publications using the same algorithm and population, the most recent was named with inclusion of the earlier findings.20,21,27-29 The resulting 21 papers are included in this review.
Descriptive characteristics of the studies are summarized in Table 1. Nineteen of the publications were full papers and two were conference abstracts. Most of the studies (n = 18) were from the United States; there was one study from South Korea,30 one study from Portugal,31 and one study from the United Kingdom.32 In 15 of the studies, there was a strict focus on general or specific wards; 6 studies also included the ICU and/or emergency departments.
Two of the studies were clinical trials, 2 were prospective observational studies, and 17 were retrospective studies. Five studies reported on an active predictive model during admission. Of these, 3 reported that the model was clinically implemented, using the predictions in their clinical workflow. None of the implemented studies used AI.
All input variables are presented in Appendix Table 1. In 10 of the studies, vital signs were combined with laboratory results; in 13 of the studies, vital signs were combined with patient characteristics. All of the studies used data derived from electronic medical records (EMRs), except for Bai et al33 and Hu et al,34 who used single-source waveforms directly from the bedside monitor. Three studies focused on continuous vital sign measurements.27,33,34Most authors reported an AUROC to describe the predictive value of their algorithms. As shown in Table 2, AUROCs varied from 0.65 to 0.95, with indexed PPVs between 0.24 and 0.75. Sensitivity ranged from 7.2% to 52.5% in non-AI models and up to 82.4% in AI models. Prediction definitions, horizons, and the reported metrics differed too much to directly compare studies.
The non-AI algorithm prediction horizons ranged from 4 to 24 hours, with a median of 24 hours (interquartile range [IQR], 12-24 hours). The AI algorithms ranged from 2 to 48 hours and had a median horizon of 14 hours (IQR, 12-24 hours).
We found three studies reporting patient outcomes. The most recent of these was a large multicenter implementation study by Escobar et al25 that included an extensive follow-up response. This study reported a significantly decreased 30-day mortality in the intervention cohort. A smaller randomized controlled trial reported no significant differences in patient outcomes with earlier warning alarms.27 A third study reported more appropriate rapid response team deployment and decreased mortality in a subgroup analysis.35
Effects, Facilitators, and Barriers
As shown in the Appendix Figure and further detailed in Table 3, the described effects were predominantly positive—57 positive effects vs 11 negative effects. These positive effects sorted primarily into the outcome and process domains.
All of the studies that compared their proposed model with one of various warning systems (eg, EWS, National Early Warning Score [NEWS], Modified Early Warning Score [MEWS]) showed superior performance (based on AUROC and reported predictive values). In 17 studies, the authors reported their model as more useful or superior to the EWS.20-23,26-28,34,36-41 Four studies reported real-time detection of deterioration before regular EWS,20,26,42 and three studies reported positive effects on patient-related outcomes.26,35 Four negative effects were noted on the controllability, validity, and potential limitations.27,42
There were 26 positive effects on the clinical process mentioned, 7 of which pointed out the effects of earlier, predictive alarming. Algorithms with higher PPVs reported greater rates of actionable alarms, less alarm fatigue, and improved workflow.21,22,24-27,30,32,33,35-38,40 Potential alarm fatigue was named as a barrier.27,42 Smoother scoring instead of binned categories was mentioned positively.24,26In the infrastructure domain, very few items were reported. The increased need for education on the used techniques was reported once as a negative effect.34 One of the positive infrastructural effects noted was more efficient planning and use of resources.24,37,40We identified 57 facilitators and 48 barriers for the clinical implementation and use of real-time predictive analytics (Appendix Figure). In the Technology domain, there were 18 facilitators and 20 barriers cited, and in the Organization domain, 25 and 14, respectively. They were equally present in the Professional and Physiology domains (6 vs 5, 8 vs 9).
Of the 38 remarks in the Technology domain, difficulty with implementation in daily practice was a commonly cited barrier.22,24,40,42 Difficulties included creating real-time data feeds out of the EMR, though there were mentions of some successful examples.25,27,36 Difficulty in the interpretability of AI was also considered a potential barrier.30,32,33,35,39,41 There were remarks as to the applicability of the prolonged prediction horizon because of the associated decoupling from the clinical view.39,42Conservative attitudes toward new technologies and inadequate knowledge were mentioned as barriers.39 Repeated remarks were made on the difficulty of interpreting and responding to a predicted escalation, as the clinical pattern might not be recognizable at such an early stage. On the other hand, it is expected that less invasive countermeasures would be adequate to avert further escalation. Earlier recognition of possible escalations also raised potential ethical questions, for instance, when to discuss palliative care.24
The heterogeneity of the general ward population and the relatively low prevalence of deterioration were mentioned as barriers.24,30,38,41 There were also concerns that not all escalations are preventable and that some patient outcomes may not be modifiable.24,38Many investigators expected reductions in false alarms and associated alarm fatigue (reflected as higher PPVs). Furthermore, they expected workflow to improve and workload to decrease.21,23,27,31,33,35,38,41 Despite the capacity of modern EMRs to store large amounts of patient data, some investigators felt improvements to real-time access, data quality and validity, and data density are needed to ensure valid associated predictions.21,22,24,32,37
As the complexity and comorbidity of hospitalized adults grow, predicting clinical deterioration is becoming more important. With an ever-increasing amount of available patient data, real-time algorithms can predict the patient’s clinical course with increasing accuracy, positively affecting outcomes.4,21,25,43 The studies identified in this scoping review, as measured by higher AUROC scores and improved PPVs, show that predictive algorithms can outperform more conventional EWS, enable earlier and more efficient alarming, and be successfully implemented on the general wards. However, formal meta-analysis was made infeasible by differences in populations, use of different endpoint definitions, cut-off points, time-horizons to prediction, and other methodologic heterogeneity.
There are several important limitations across these studies. In a clinical setting, these models would function as a screening test. Almost all studies report an AUROC; however, sensitivity and PPV or NNE (defined as 1/PPV) may be more useful than AUROC when predicting low-frequency events with high-potential clinical impact.44 Assessing the NNE is especially relevant because of its relation to alarm fatigue and responsiveness of clinicians.43 Alarm fatigue and lack of adequate response to alarms were repeatedly cited as potential barriers for application of automated scores. A more useful metric might be NNE over a certain timeframe and across a specified number of patients to more clearly reflect associated workload. Future studies should include these metrics as indicators of the usability and clinical impact of predictive models. This review could not assess PPV or NNE systematically due to inconsistencies in the reporting of these metrics.
Although the results of our scoping review are promising, there are limited data on clinical outcomes using these algorithms. Only three of five algorithms were used to guide clinical decision-making.25,27,35 Kollef et al27 showed shorter hospitalizations and Evans et al35 found decreased mortality rates in a multimorbid subgroup. Escobar et al25 found an overall and consistent decrease in mortality in a large, heterogenic population of inpatients across 21 hospitals. While Escobar et al’s findings provide strong evidence that predictive algorithms and structured follow-up on alarms can improve patient outcomes, it recognizes that not all facilities will have the resources to implement them.25 Dedicated round-the-clock follow-up of alarms has yet to be proven feasible for smaller institutions, and leaner solutions must be explored. The example set by Escobar et al25 should be translated into various settings to prove its reproducibility and to substantiate the clinical impact of predictive models and structured follow-up.
According to expert opinion, the use of high-frequency or continuous monitoring at low-acuity wards and AI algorithms to detect trends and patterns will reduce failure-to-rescue rates.4,9,43 However, most studies in our review focused on periodic spot-checked vital signs, and none of the AI algorithms were implemented in clinical care (Appendix Table 1). A significant barrier to implementation was uncertainty surrounding how to react to generated alarms.9,45 As algorithms become more complex and predict earlier, interpretability and causality in general can diminish, and the response to this type of alarm will be different from that of an acute warning from an EWS.
The assessment of predictive algorithm protocols must include their impact on clinical workflow, workload, and resource utilization. Earlier detection of deterioration can potentially allow coordinated alarm follow-up and lead to more efficient use of resources.20,21,31,43,46,47
Greater numbers of variables do not always improve the quality of monitoring. For example, in one study, an algorithm combining only heart rate, respiration rate, and age outperformed an EWS that tracked six vital sign measures.23 Algorithms using fewer variables may facilitate more frequent and less complex error-sensitive monitoring. Leaner measurements may also lead to higher patient and clinician acceptance.43,45The end goal of implementing predictive algorithms on the general ward is to provide timely, reliable, and actionable clinical decision support.43 As shown in a recent study by Blackwell et al,48 multiple prediction models for specific clinical events may increase interpretability and performance. Disease-specific algorithms may complement general algorithms for clinical deterioration and enhance overall performance.
STRENGTHS AND LIMITATIONS
We performed a comprehensive review of the current literature using a clear and reproducible methodology to minimize the risk of missing relevant publications. The identified research is mainly limited to large US centers and consists of mostly retrospective studies. Heterogeneity among inputs, endpoints, time horizons, and evaluation metrics make comparisons challenging. Comments on facilitators, barriers, and effects were limited. Positive publication bias may have led to overrepresentation of models showing clinical benefit.
RECOMMENDATIONS FOR FUTURE RESEARCH
Artificial intelligence and the use of continuous monitoring hold great promise in creating optimal predictive algorithms. Future studies should directly compare AI- and non-AI-based algorithms using continuous monitoring to determine predictive accuracy, feasibility, costs, and outcomes. A consensus on endpoint definitions, input variables, methodology, and reporting is needed to enhance reproducibility, comparability, and generalizability of future research. The current research is limited to a few research groups, predominantly in the United States. More international research could enhance validity and increase applicability across varying populations and settings. Greater collaboration would accelerate research and enhance short-cycled continuous improvements. Sharing databases with different populations, variables, and outcomes, such as the Medical Information Mart for Intensive Care database,49 could help develop, test, and compare models and contribute to consensus in data standardization and consistent reporting of results. Studies should be designed to determine clinical, societal, and economic effects in accordance with the Quadruple Aim principle.50 Successful implementation will depend not only on improved patient outcomes but also on cost-effectiveness, robust statistics, and end-user acceptance. Follow-up protocols and workflows also should be studied and optimized.
Predictive analytics based on vital sign monitoring can identify clinical deterioration at an earlier stage and can do so more accurately than conventional EWS. Implementation of such monitoring can simultaneously decrease alarm-related workload and enhance the efficiency of follow-up. While there is also emerging evidence of associated mortality reduction, it may be too soon to know how representative these findings are. The current literature is limited by heterogeneity across populations studied, monitoring frequency, definitions of deterioration, and clinical outcomes. Consensus is therefore needed to better compare tools and harmonize results. Artificial intelligence and continuous monitoring show great promise in advancing the field; however, additional studies to assess cost, actionability of results, and end-user acceptance are required.