Has your department been relying on clinical gestalt to risk stratify patients, perhaps using something you heard or read about in a short paper (perhaps Twitter), maybe something like post-exertional oxygenation has caught your eye, and well it sure sounds like a good idea…….
Determining who is sick, who is going deteriorate, and who may die have been burning questions for front line physicians since this pandemic began. Nearly as quickly as we asked the question the purported answers started pouring in. Risk prediction algorithms for COVID-19 have to date been unsuitable for clinical practice, due to a mixture of issues with the creation of the algorithm (derivation), the testing of it (validation) and the small samples being used.
Wynants et al1 continue to publish a fantastic living systematic review of COVID-19 prediction models, and to date have not found any without a high risk of bias. However, a recently proposed model by Knight et al2 is yet to be included in this review, and may buck the trend. Hopefully going a long way to answering the question of: who may die?
What did they do?
Knight et al used a dataset called ISARIC to build a multivariable model to predict in-hospital mortality for patients in which COVID-19 was highly suspicious. The dataset was split into a derivation (creation) dataset consisting of 35,463 patients from the England, Wales and Scotland, and a testing (validation) dataset that consisted of a further 22,361 patients.
Using pre-selected candidate variables they developed a logistic regression and a machine learning model. Then these two models were compared to pre-existing pneumonia and COVID-19 models in the validation dataset. Knight et al created a scenario to compare the best of logistic regression, machine learning and the pre-existing models.
There is more detail in the paper describing how the candidate variables were selected for analysis, what determined which variables were included in the model and the handling of missing data. It is important but is niche so I am not going to go into detail here, at all, except to say it was relatively robust.
What did they make?
The logistic regression model was converted into a scaled scoring system giving points for each variable e.g. 1 for male and 0 for female. The total score was then used to allocate patients to a risk group. They included age, sex, morbidities according to the Charleston index (including obesity), respiratory rate, peripheral O2 saturation, GCS, urea and CRP.
|Sex at birth – Female||0|
|No. of comorbidities 0||0|
|Resp Rate (breaths per min) <20||0|
|Peripheral Oxygen saturation on room air >=92%||0|
|GCS = 15||0|
|Urea (mmol/L) ≤7||0|
|7 – 14||1|
|CRP (mg/dL) <50||0|
Risk groups were suggested to be …
|Risk group||Score total||Proportion of patients||Mortality|
|Low||0-3||7.6 %||1.7 %|
|Intermediate||4-8||23.3 %||9.1 %|
|High||9-14||51.0 %||34.9 %|
|Very High||>=15||18.4 %||66.2 %|
The area under the curve for this logistic regression model was 0.786 (95% CI 0.781 to 0.790) compared to the machine learning model with 0.796 (95% CI 0.786 to 0.807) in the derivation dataset. The best in-class machine learning model was marginally better; however it is not feasible to be used in widespread clinical practice, and the authors only offer it as a comparator. We are not missing out on much, as this simple logistic regression model that can be counted on your hand is just about as good as the best machine learning had to offer.
How did they test it?
The authors split their available data and used 22,361 cases to validate and test their model. They found similar statistics for accuracy as in the derivation data, AUROC of 0.767 (95%CI 0.760 to 0.773) vs 0.786 (95% CI 0.781 to 0.790). The accuracy tends to drop in the validation data set but this small drop implies that the model is stable, not over-fitted and therefore more generalisable (at least to UK patients where the model has been developed and tested).
The authors did not stop there: they also compared their new model to a series of other risk scores. Again, the 4C mortality score came out best.
|Risk Tool||AUROC (95%CI)|
|SOFA||0.614 (0.530 to 0.698)|
|qSOFA||0.622 (0.615 to 0.630)|
|Surgisphere||0.630 (0.622 to 0.639)|
|SMARTCOP||0.645 (0.593 to 0.697)|
|NEWS||0.654 (0.645 to 0.662)|
|DL||0.669 (0.660 to 0.678)|
|SCAP||0.675 (0.620 to 0.729)|
|CRB65||0.683 (0.676 to 0.691)|
|COVID-GRAM||0.706 (0.675 to 0.736)|
|Xie score||0.718 (0.710 to 0.725)|
|A-DROP||0.736 (0.728 to 0.744)|
|PSI||0.736 (0.683 to 0.790)|
|E-CURB65||0.764 (0.740 to 0.788)|
|4C Mortality Score||0.774 (0.767 to 0.782)|
Room for improvement…
This has the potential to be the best COVID-19 risk prediction tool to date in terms of results and size of dataset used to develop (and test) the model. However, whilst there are always going to be areas for improvement, there is one large potential issue.
Is it too good? Normally when a model is created and then tested in separate cohorts (derived and validated), almost always you see a substantial drop in the predictive power of the model. This did not happen with 4C and the split ISARIC dataset. I am not implying malice, this could be a model that bucks the trend and is fantastic, or it could be an error in the code. The later would cast the validity of the whole model into question. Either way, according to a letter from expert modelling statisticians3, it needs checking. This ought to be quick, hopefully Knight et al will release the code to an independent checker and it will all be cleared up, but the letter in the BMJ expressing these concerns is very much worth a read.
Other minor areas for improvement
1. They split the data. There is an argument to not split your data into a derivation and validation dataset. Use the whole dataset in its full strength to derive the algorithm and then attempt to account for overfitting by applying a shrinkage factor (technically called bootstrap validation).
2. Might not be a fair comparison. The author’s comparison with other scores may not be as robust as it appears; the number of complete cases used to calculate each score varied from 197 to 19,361 and they didn’t have available the data to test some well known scores such as APACHE II.
3. Missed some variables. The authors could only derive the algorithm with what they had, and ISARIC did not collect data on variables which are now being considered in other models such as post-exertional oxygenation.
4. Using an inpatient cohort to make front door decisions. The algorithm was derived from a large inpatient cohort, but the authors state that the outcome could be used to make the decision to admit or send home from the front door. What if the patients who are low risk by this new score, were only surviving because they were given oxygen in hospital? I think care is needed before the model is used in practice to make front door decisions.
5. Continuous variables made categorical. Age is a continuous variable but if you use it in a model as a categorical variable you lose information and predictive power. For example age in this model is included as 50-59 =2 points and 60-69 = 4 points. However, nothing dramatically changes on someone’s 60th birthday: age is a continuum. Splitting age into groups like this means that the model might be overestimating risk for a 50 year old and under estimating for a 59 year old.
6. Simplification for clinical use. The algorithm is in reality a large equation, but for use on the shop floor it was converted into an integer score so it was more usable. This sacrifices predictive performance for simplicity. However given how ubiquitous computers are, maybe we could have coped with the complicated version. For example, it would be relatively easy to create an app.
7. External validation is the final accreditation step for clinical algorithms, but this has not yet been undertaken for the 4C model. Ideally we would get external validation from a UK cohort prior to implementation, but such a cohort probably does not exist. ISARIC captured such a large proportion of the cases that any of study that was active at the same time may have easily co-enrolled thereby confounding any external validation. The best we may get is an external validation from another country with a similar healthcare system and demographic. However, would that over-rule the UK use case in such a large local dataset, I don’t think so.
Global health perspective
For a paper that includes the World Health Organization (WHO) in its title, this risk prediction tool is very unlikely to share similar application in low- and middle-income countries (LIMCs). COVID-19 is the closest high-income countries (HICs) have come to experiencing the level of powerlessness healthcare providers in LMICs have experienced for decades. In response, COVID-19 research has been prioritised, funded and made more accessible than any other infectious disease research topic since the Ebola outbreak of 2014. Given that LMICs make up 83% of the global population, it is disappointing that the same could not have been achieved for say Malaria, Tuberculosis, HIV and injury research a long time ago. But I digress…
Given the invariably damaged chain of survival in most LMICs (and less resourced parts of some HICs), a COVID prediction tool should be capable of predicting the resources that patients will require for their care. This will in turn assist clinicians to better allocate already limited resources.
It is worth bearing in mind that universal healthcare (such as provided by the NHS) is not a global standard. In LMICs (and some HICs) the cost of admission to healthcare centres may severely reduce the disposable income of many relatives, friends and others. Add the disproportionately poor outcomes recorded in some LMIC healthcare settings and you find yourself in a high stakes, low odds poker game. Not great.
Physiological scoring is widely accessible in LMICs, except for saturation monitoring which is more variable. Triage tools, such as the South African Triage Scale is based on physiological scoring and performs well even in novice hands. The APGAR score used in maternity is another example. So, scoring would not be the problem. What would be challenging would be access to laboratory tests: urea and CRP testing would not be as accessible everywhere as it ought to be.
Where to from here for LMICs on risk prediction? One could remove the lab variables, but that would affect the accuracy and validity. The alternative is deriving a bespoke risk prediction tool using similar regression methodology but with LMIC data. This would likely have to be pooled data from various sources. Societies such as the African Federation for Emergency Medicine is likely well-placed to help with the logistics of such a project.
A tool identified as having WHO involvement should ideally cater for more than 17% of the global population. Otherwise, with all the best intentions, we’re not in this together.
This has the potential to be the best risk prediction score out there to date
1) it has good discriminatory performance
2) it uses one of the best and largest COVID-19 datasets in the world
3) it has been thoroughly internally validated
The 4C model could guide inpatient clinical care, such as whether to admit to critical care and also recalibrate our clinical gestalt as to how sick the ‘happy hypoxic’ patient is in front of us.
However we need to see how the authors reply to the issues raised. Particularly because an independent validation study of similar power is not likely, and this may be all we have to go on.
Is this algorithm suitable for clinical practice? No, but it is so close!
Dr Stevan R. Bruijns, Editors-in-Chief, African Journal of Emergency Medicine for the global health perspective
Dr Glen Martin, Lecturer in Health Data Sciences from the University of Manchester who gave his expert advice regarding the limitations of this paper
1) Wynants, L., Van Calster, B., Bonten, M.M., Collins, G.S., Debray, T.P., De Vos, M., Haller, M.C., Heinze, G., Moons, K.G., Riley, R.D. and Schuit, E., 2020. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. bmj, 369.
2) Knight SR, Ho A, Pius R, Buchan I, Carson G, Drake TM, Dunning J, Fairfield CJ, Gamble C, Green CA, Gupta R. Risk stratification of patients admitted to hospital in the United Kingdom with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of a multivariable prediction model for mortality. British Medical Journal. 2020 Aug 25.
3) Riley R, Collins G, van Smeden M, Snell K, Van Calster B, Wynants L. Rapid Response: Is the 4C Mortality Score fit for purpose? Some comments and concerns. BMJ [Internet]. 2020 Sep 15 [cited 2020 Sep 15];2020;370:m3339. Available from: https://www.bmj.com/content/370/bmj.m3339/rr-3