There is an urgent need to elucidate the underlying mechanisms of coronavirus disease (COVID-19) so that vaccines and treatments can be devised. Severe acute respiratory syndrome coronavirus 2 has genetic similarity with bats and pangolin viruses, but a comprehensive understanding of the functions of its proteins at the amino acid sequence level is lacking. A total of 4320 sequences of human and nonhuman coronaviruses was retrieved from the Global Initiative on Sharing All Influenza Data and the National Center for Biotechnology Information. This work proposes an optimization method COVID-Pred with an efficient feature selection algorithm to classify the species-specific coronaviruses based on physicochemical properties (PCPs) of their sequences. COVID-Pred identified a set of 11 PCPs using a support vector machine and achieved 10-fold cross-validation and test accuracies of 99.53% and 97.80%, respectively. These findings could provide key insights into understanding the driving forces during the course of infection and assist in developing effective therapies.
【저자키워드】 machine learning, support vector machines, Physicochemical properties, SARS-CoV-2 classification,