Lessons learnt from using the machine learning random forest algorithm to predict virulence in Streptococcus pyogenes

Sean Buckley

doi:10.25907/00114

Streptococcus pyogenes (Group A Streptococcus: GAS) is a strictly human bacterial pathogen which is responsible for over half a million deaths annually. The GAS genome encodes numerous virulence factors, that cause multiple diseases of variable severity and outcome. Importantly, there is no licenced vaccine for GAS, and a subset of GAS strains have recently shown increased resistance to frontline antibiotic treatments. The gold standard for GAS genotyping, the emm-type, is based on the variability of the nucleotide sequence of the 5’ end of emm which encodes a surface-expose protein that is important for host immunity evasion. GAS two-component systems (TCSs) and response regulator (RR) transcription factors compose complex transcription regulatory networks that control the initiation of transcription. Therein, shaping the gene expression profile of GAS as it adapts, survives and thrives in a diversity of tissues throughout infection. mga is an important GAS RR that controls the expression of up to 10% of the genes in the GAS genome including emm. mga is encoded adjacent to emm by one of two divergent allele types (mga-1 or mga-2), which are correlated with tissue preference and are likely to influence virulence. The study of bacterial virulence is changing focus and expanding in scope away from the ‘harmful pathogen’ to encompass the dynamism of host-microbe interactions, all whilst considering the host immunity and microbiome, and abiotic environmental factors. In the era of whole genome sequencing (WGS), opportunity exists to increase our understanding of GAS virulence by developing comparative genomics techniques that employ machine learning (ML) and genome-wide association studies (GWAS) to interrogate the increasing abundance of high quality genomic data. Albeit with a prevailing shortage of high-quality, standardised GAS virulence phenotype metadata which needs to be addressed while moving forward. I catalogued the distribution and diversity of the nucleotide sequences of the TCSs and RRs of 944 GAS genomes, and curated the available phenotype metadata. I developed a tool to type the alleles of these loci, and using phylogenetic and concordance metrics I measured associations with the phylogenetic delineation and virulence metadata. I observed strain-dependent (that is, emm type-dependent) variation in the TCS and RR allele types. However, I saw no strong associations between individual allele types and virulence when applying phylogenetics and concordance metrics. I discovered novel recombination events including mga2-switching, emm-switching, and chimeric emm-enn events. I proposed that a subset of the RRs constituted a novel GAS typing system, with inherent advantages over the emm-based systems. Following on, I applied the random forest ML algorithm to variation in 53 of these RR types to predict strain and virulence phenotype metadata. I was able to predict strain, country of sampling, and the invasiveness of the isolate with high accuracy, using surprisingly few RR alleles. However, I was unable to predict tissue tropism or clinical outcomes. Importantly while investigating the causes of inaccurate predictions of emm type, I discovered several rare anomalies of the mga regulon, and subsequently developed models to explain them. These included a novel cell wall-spanning domain of the enn gene (SF5) that redefines the emm pattern typing system, subcategories of chimeric emm-enn events in which the emm subtype was retained or changed that I defined as ‘likewise’ and ‘contrariwise’, and a model to explain the time-dependent excision of genes of the mga regulon. Furthermore, I contextualized my key findings amongst recent studies in the field of WGS-derived comparative genomic methods that have employed GWAS and ML. From which I communicate several lessons that were learnt. I developed an ML-based workflow in which the microbe, host, and microbiome are sampled and sequenced at the same time, composing a ‘linked’ genome set, that better represents the infection event. The workflow includes a system for the quality control of the collection of high-quality, standardised GAS virulence phenotype metadata. My workflows and models have advanced the understanding of GAS phylogenetic delineation and epidemiology. Moreover, they serve as templates for testing associations with hitherto untested GAS phenotype traits.

Lessons learnt from using the machine learning random forest algorithm to predict virulence in Streptococcus pyogenes

Files and links (1)

Abstract

Details

Metrics