In this work we show that Incremental Machine Learning can be used to predict the classification of emerging SARS-CoV-2 lineages, dynamically distinguishing between neutral variants and non-neutral ones, i.e. variants of interest or variants of concerns. Starting from the Spike protein primary sequences collected in the GISAID db, we have derived a set of k-mers features, i.e., aminoacid subsequences with fixed length k. We have then implemented a Logistic Regression Incremental Learner that was monthly tested on the variants collected since February 2020 until October 2021. The average value of balanced accuracy of the classifier is 0.72 ± 0.2, which increased to 0.78 ± 0.16 in the last 12 months. The alpha, beta, gamma, eta, kappa and delta variants were recognized as non-neutral variants with mean recall ∼90%. In summary, incremental learning proved to be a useful instrument for pandemic surveillance, given its capability to update the model on new data over time.
【저자키워드】 Incremental Machine Learning, SARS-Cov-2 variant prediction.,