Learning From Biological and Computational Machines: Importance of SARS-CoV-2 Genomic Surveillance, Mutations and Risk Stratification

The global coronavirus disease 2019 (COVID-19) pandemic has demonstrated the range of disease severity and pathogen genomic diversity emanating from a singular virus (severe acute respiratory syndrome coronavirus 2, SARS-CoV-2). This diversity in disease manifestations and genomic mutations has challenged healthcare management and resource allocation during the pandemic, especially for countries such as India with a bigger population base. Here, we undertake a combinatorial approach toward scrutinizing the diagnostic and genomic diversity to extract meaningful information from the chaos of COVID-19 in the Indian context. Using methods of statistical correlation, machine learning (ML), and genomic sequencing on a clinically comprehensive patient dataset with corresponding with/without respiratory support samples, we highlight specific significant diagnostic parameters and ML models for assessing the risk of developing severe COVID-19. This information is further contextualized in the backdrop of SARS-CoV-2 genomic features in the cohort for pathogen genomic evolution monitoring. Analysis of the patient demographic features and symptoms revealed that age, breathlessness, and cough were significantly associated with severe disease; at the same time, we found no severe patient reporting absence of physical symptoms. Observing the trends in biochemical/biophysical diagnostic parameters, we noted that the respiratory rate, total leukocyte count (TLC), blood urea levels, and C-reactive protein (CRP) levels were directly correlated with the probability of developing severe disease. Out of five different ML algorithms tested to predict patient severity, the multi-layer perceptron-based model performed the best, with a receiver operating characteristic (ROC) score of 0.96 and an F1 score of 0.791. The SARS-CoV-2 genomic analysis highlighted a set of mutations with global frequency flips and future inculcation into variants of concern (VOCs) and variants of interest (VOIs), which can be further monitored and annotated for functional significance. In summary, our findings highlight the importance of SARS-CoV-2 genomic surveillance and statistical analysis of clinical data to develop a risk assessment ML model.

All Keywords
【저자키워드】 COVID-19, SARS-CoV-2, machine learning, Genomic surveillance, risk stratification, healthcare, 【초록키워드】 coronavirus disease, Evolution, Coronavirus disease 2019, coronavirus, pandemic, Mutation, severe COVID-19, disease severity, machine learning, Genomic surveillance, India, diagnostic, variants of concern, C-reactive protein, risk, Symptom, severe acute respiratory syndrome Coronavirus, cough, virus, variants, Risk assessment, Probability, Cohort, ROC, pathogen, management, Algorithm, VOCs, Patient, variants of interest, age, Genomic analysis, dataset, Severe patient, VOIs, respiratory, correlation, information, characteristic, genomic, resource, parameters, predict, genomic sequencing, blood urea, Frequency, severe disease, Respiratory Support, statistical analysis, respiratory rate, best, urea, receiver operating characteristic, C-reactive protein (CRP, physical symptoms, leukocyte, singular, genomic mutations, breathlessness, acute respiratory syndrome, acute respiratory syndrome coronavirus, acute respiratory syndrome coronavirus 2, Clinical data, ML model, disease manifestation, parameter, SARS-CoV-2 genomic surveillance, ML models, patient severity, approach, country, FIVE, feature, statistical, highlight, tested, performed, develop, significantly, the patient, clinically, functional, absence, demonstrated, correlated, Observing, 【제목키워드】 learning, Biological, Importance,

글로벌 코로나바이러스 질병 2019(COVID-19) 대유행은 단일 바이러스(중증 급성 호흡기 증후군 코로나바이러스 2, SARS-CoV-2)에서 발생하는 질병 심각도 및 병원체 게놈 다양성의 범위를 보여주었습니다. 질병 발현 및 게놈 돌연변이의 이러한 다양성은 특히 인구 기반이 더 큰 인도와 같은 국가에서 팬데믹 기간 동안 의료 관리 및 자원 할당에 도전했습니다. 여기에서 우리는 인도 맥락에서 COVID-19의 혼돈에서 의미 있는 정보를 추출하기 위해 진단 및 게놈 다양성을 조사하기 위한 조합 접근 방식을 수행합니다. 호흡기 지원 샘플이 있거나 없는 해당하는 임상적으로 포괄적인 환자 데이터 세트에서 통계적 상관 관계, 기계 학습(ML) 및 게놈 시퀀싱 방법을 사용하여 심각한 COVID-19 발병 위험을 평가하기 위한 특정 중요한 진단 매개변수 및 ML 모델을 강조합니다. 이 정보는 병원체 게놈 진화 모니터링을 위한 코호트의 SARS-CoV-2 게놈 기능을 배경으로 더욱 맥락화됩니다. 환자의 인구통계학적 특징과 증상을 분석한 결과 연령, 호흡곤란, 기침이 중증 질환과 유의하게 관련이 있는 것으로 나타났습니다. 동시에 신체 증상이 없다고 보고한 중증 환자는 발견되지 않았습니다. 생화학적/생리학적 진단 매개변수의 경향을 관찰한 결과, 호흡수, 총 백혈구 수(TLC), 혈액 요소 수준 및 C-반응성 단백질(CRP) 수준이 중증 질환 발병 확률과 직접적인 상관 관계가 있음을 확인했습니다. 환자의 중증도를 예측하기 위해 테스트한 5가지 다른 ML 알고리즘 중에서 다층 퍼셉트론 기반 모델이 ROC(수신자 작동 특성) 점수 0.96 및 F1 점수 0.791로 가장 우수한 성능을 보였습니다. SARS-CoV-2 게놈 분석은 기능적 중요성에 대해 추가로 모니터링하고 주석을 달 수 있는 관심 변이체(VOC) 및 관심 변이체(VOI)에 대한 글로벌 주파수 플립 및 향후 주입을 통한 돌연변이 세트를 강조했습니다. 요약하면, 우리의 연구 결과는 위험 평가 ML 모델을 개발하기 위해 SARS-CoV-2 게놈 감시 및 임상 데이터의 통계 분석의 중요성을 강조합니다.