Summary We introduce VERSO, a two-step framework for the characterization of viral evolution from sequencing data of viral genomes, which is an improvement on phylogenomic approaches for consensus sequences. VERSO exploits an efficient algorithmic strategy to return robust phylogenies from clonal variant profiles, also in conditions of sampling limitations. It then leverages variant frequency patterns to characterize the intra-host genomic diversity of samples, revealing undetected infection chains and pinpointing variants likely involved in homoplasies. On simulations, VERSO outperforms state-of-the-art tools for phylogenetic inference. Notably, the application to 6,726 amplicon and RNA sequencing samples refines the estimation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) evolution, while co-occurrence patterns of minor variants unveil undetected infection paths, which are validated with contact tracing data. Finally, the analysis of SARS-CoV-2 mutational landscape uncovers a temporal increase of overall genomic diversity and highlights variants transiting from minor to clonal state and homoplastic variants, some of which fall on the spike gene. Available at: https://github.com/BIMIB-DISCo/VERSO . Graphical abstract Highlights • The analysis of raw sequencing data improves the reconstruction of viral evolution • Our method reconstructs robust phylogenies with noisy data and sampling limitations • The dissection of intra-host genomic diversity reveals undetected infection chains • The identification of positively selected variants may drive experimental research The Bigger Picture The gravity of the COVID-19 pandemic has fostered a surge of works analyzing SARS-CoV-2 consensus sequences to reconstruct phylogenomic models of its evolution and diffusion. Yet, such approaches do not account for intra-host genomic diversity and may deliver inaccurate predictions in conditions of noisy data and sampling limitations. We propose VERSO, a data-science framework for the characterization of viral evolution from sequencing data. By accounting for uncertainty in the data, VERSO delivers robust phylogenies also in conditions of limited sampling and noisy observations. Additionally, the in-depth characterization of the intra-host genomic diversity of samples allows one to identify undetected infection chains and clusters and to intercept variants possibly undergoing positive selection. Accordingly, the joint application of our method and data-driven epidemiological models may deliver a high-precision platform for contact tracing and pathogen surveillance and characterization. The generation of reliable phylogenomic models describing the evolution of SARS-CoV-2 is essential to explain its diffusion and to possibly predict the next evolutionary steps. We introduce a data-science framework that is an improvement on existing methods, by accounting for noise and sampling limitations in sequencing data and by dissecting the intra-host diversity of single samples. The application to large-scale datasets demonstrates that our approach can improve the estimation of SARS-CoV-2 evolution, refine contact tracing, and pinpoint possibly hazardous mutations.
【저자키워드】 COVID-19, SARS-CoV-2, Genomic surveillance, viral evolution, Viral variants, Phylogenomics, intra-host genomic diversity, 【초록키워드】 Evolution, coronavirus, Positive selection, COVID-19 pandemic, mutations, variant, Infection, Contact tracing, pathogen, Surveillance, spike gene, RNA sequencing, Phylogeny, Research, Evolution of SARS-CoV-2, Cluster, viral genomes, SARS-CoV-2 evolution, dataset, epidemiological, genomic, predict, platform, Intra-host diversity, Frequency, Analysis, phylogenetic inference, homoplasies, acute respiratory syndrome, Abstract, profiles, consensus sequences, limitation, sequencing data, diffusion, limitations, approach, data-driven, highlight, robust, joint, raw sequencing data, IMPROVE, selected, identify, involved, condition, reveal, explain, co-occurrence, homoplastic variants, infection chain, outperform, SARS-CoV-2 consensus sequence, 【제목키워드】 Phylogeny, genomic, quantification, robust, viral sample,