Abstract
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.
Keywords: COVID-19; PCA; SARS-CoV-2; UMAP; t-SNE.
【저자키워드】 COVID-19, SARS-CoV-2, PCA, UMAP, t-SNE., 【초록키워드】 coronavirus disease, Evolution, Coronavirus disease 2019, coronavirus, mutations, severe acute respiratory syndrome Coronavirus, principal component analysis, Phylogenetic analysis, Accuracy, SARS-CoV-2 genome, Effectiveness, Clustering, understanding, dataset, UMAP, isolates, SARS-CoV-2 genome sequences, Principal component, K-means clustering, acute respiratory syndrome, acute respiratory syndrome coronavirus, acute respiratory syndrome coronavirus 2, growth, projection, SARS-CoV-2 genome sequence, datasets, transmission of SARS-CoV-2, isolate, combating, controlling, IMPROVE, caused, unique, increasingly, 【제목키워드】 dataset, SARS-CoV-2 mutation,