Abstract
The fast, exponential increase of COVID-19 infections and their catastrophic effects on patients’ health have required the development of tools that support health systems in the quick and efficient diagnosis and prognosis of this disease. In this context, the present study aims to identify the potential factors associated with COVID-19 infections, applying machine learning techniques, particularly random forest, chi-squared, xgboost, and rpart for feature selection; ROSE and SMOTE were used as resampling methods due to the existence of class imbalance. Similarly, machine and deep learning algorithms such as support vector machines, C4.5, random forest, rpart, and deep neural networks were explored during the train/test phase to select the best prediction model. The dataset used in this study contains clinical data, anthropometric measurements, and other health parameters related to smoking habits, alcohol consumption, quality of sleep, physical activity, and health status during confinement due to the pandemic associated with COVID-19. The results showed that the XGBoost model got the best features associated with COVID-19 infection, and random forest approximated the best predictive model with a balanced accuracy of 90.41% using SMOTE as a resampling technique. The model with the best performance provides a tool to help prevent contracting SARS-CoV-2 since the variables with the highest risk factor are detected, and some of them are, to a certain extent controllable.
Keywords: COVID-19; feature selection; imbalanced data; machine learning; predictive model.
【저자키워드】 COVID-19, machine learning, Feature selection, imbalanced data, predictive model., 【초록키워드】 SARS-CoV-2, pandemic, Prognosis, deep learning, physical activity, alcohol consumption, machine learning, Infection, Diagnosis, health systems, smoking, alcohol, risk factor, Health, infections, Accuracy, COVID-19 infection, Sleep, Algorithm, Patient, Random forest, Health status, dataset, health system, disease, Feature selection, Predictive, Support, Factor, Clinical data, COVID-19 infections, help, machine, random, variable, Effect, Prevent, feature, catastrophic, ROSE, highest, identify, required, provide, were used, health parameter, with COVID-19, 【제목키워드】 Factor, individual,