Conservation of vaccine antigen sequences encoded by sequenced strains of Streptococcus equi subsp. equi

Summary Background Streptococcus equi subspecies equi (S equi) is the cause of Strangles, one of the most prevalent diseases of horses worldwide. Variation within the immunodominant SeM protein has been documented, but a new eight‐component fusion protein vaccine, Strangvac, does not contain live S equi or SeM and conservation of the antigens it contains have not been reported. Objective To define the diversity of the eight Strangvac antigens across a diverse S equi population. Study design Genomic description. Methods Antigen sequences from the genomes of 759 S equi isolates from 19 countries, recovered between 1955 and 2018, were analysed. Predicted amino acid sequences in the antigen fragments of SEQ0256(Eq5), SEQ0402(Eq8), SEQ0721(EAG), SEQ0855(SclF), SEQ0935(CNE), SEQ0999(IdeE), SEQ1817(SclI) and SEQ2101(SclC) in Strangvac and SeM were extracted from the 759 assembled genomes and compared. Results The predicted amino acid sequences of SclC, SclI and IdeE were identical across all 759 genomes. CNE was truncated in the genome of five (0.7%) isolates. SclF was absent from one genome and another encoded a single amino acid substitution. EAG was truncated in two genomes. Eq5 was truncated in four genomes and 123 genomes encoded a single amino acid substitution. Eq8 was truncated in three genomes, one genome encoded four amino acid substitutions and 398 genomes encoded a single amino acid substitution at the final amino acid of the Eq8 antigen fragment. Therefore, at least 1579 (99.9%) of 1580 amino acids in Strangvac were identical in 743 (97.9%) genomes, and all genomes encoded identical amino acid sequences for at least six of the eight Strangvac antigens. Main limitations Three hundred and seven (40.4%) isolates in this study were recovered from horses in the UK. Conclusions The predicted amino acid sequences of antigens in Strangvac were highly conserved across this collection of S equi.


| INTRODUC TI ON
The host-restricted pathogen Streptococcus equi subspecies equi (S equi) causes the disease strangles, which is one of the most prevalent infectious diseases of horses worldwide. 1-5 S equi infects horses via the nose to nose contact with diseased animals, ingestion of contaminated food or water or contact with other fomites. 4 Once within the mouth or nose, S equi attaches to and invades the lingual and palatine tonsils via an array of cell surface receptors, 6-8 before transitioning to the lymph nodes of the head and neck within a few hours of infection. 9 Within the lymph nodes, S equi uses a multitude of immune-evasion strategies to neutralise the effects of the innate immune system and establish infection. [10][11][12][13][14] Active recruitment of neutrophils to infected lymph nodes and failure of the immune system to kill S equi results in enlargement of lymph nodes and formation of abscesses, which may be refractory to antibiotic treatment. 4 Treatment of cases with antibiotics may also impede the development of a humoral immune response 15 and select for resistant strains. [16][17][18] Abscesses within the lymph nodes eventually burst, draining from the head of affected animals, releasing S equi into the local environment and providing an opportunity for transmission to naïve animals. 4 Most horses then recover from the disease. However, some recovered horses remain persistently infected "carriers" of S equi, providing long-term potential for transmission of S equi through contact with naïve animals. 4,[19][20][21] Therefore, biosecurity, diagnostic testing and vaccination measures to prevent the establishment of infection are of vital importance to control the spread of this disease. 4,[22][23][24][25] The development of rapid methodologies for the generation and analysis of genome sequence information has shed unprecedented light on the evolution and transmission of S equi. [17][18][19]26,27 In a recent study, the genomes of a population of 670 isolates from 19 countries were found to cluster into six Bayesian analysis of population structure (BAPS) groups, based upon polymorphisms within a core genome comprising 1286 loci, with an average of 90.9 single nucleotide polymorphisms separating each group. 19 The calculated mean substitution rate per core genome site per year was 5.22 × 10 −7 , suggesting that, within a core genome of 1.8 Mb, 26 approximately one base substitution accumulates per year. This dates the emergence of the contemporary strains of S equi to around the time of World War I. 17,28 However, genes encoding antigenic proteins may be subject to much greater selective pressure, leading to much more rapid genetic change. For example, genetic variation in the gene encoding the immunodominant SeM protein has been exploited in strain typing schemes, with 242 alleles described to date. 11,17,19,[29][30][31] The SeM protein is used within several cell extract and live vaccines that target S equi. [32][33][34][35][36] However, the multicomponent fusion protein vaccine Strangvac does not contain SeM and instead uses eight different S equi proteins, the diversity of which has not been described previously. 37 Thus, in this study, we examined the diversity of Strangvac antigens by combining and analysing three published genome collections 18,19,27 that together comprised 759 S equi isolates from 19 countries, which were recovered from horses between 1955 and 2018.

| Study collection
The origins of the genomes of the 759 isolates (three collections) of S equi analysed in this study are listed in Table S1. One of the collections, of 54 genomes of S equi isolates recovered from outbreaks in the USA, has been used previously to study the transmission of S equi in Texas and Kentucky (Collection A). 18 The largest collection of genomes used here, from 670 S equi isolates, 19 included the complete genome sequence of Se4047, which was used as the reference genome in this study as it was the first S equi genome sequenced to completion. 26 Of these, 224 genomes were from a study of the effects of persistent infection 17 and 445 genomes from a study of the international transmission of S equi (Collection B). 19 Finally, a collection of 35 genomes of isolates recovered from the USA (n = 21) or Sweden (n = 14) that have been used previously to examine the effects of persistent infection was included (Collection C). 27 Together, the combined collection of 759 S equi genomes originated from isolates recovered between 1955 and 2018 from 19 countries that comprised: Argentina (n = 15), Australia (n = 26), Belgium (n = 14), Canada (n = 1), France (n = 14), Germany (n = 12), Ireland (n = 16), Israel (n = 14), Japan (n = 12), Kuwait (n = 1), the Netherlands (n = 17), New Zealand (n = 4), Poland (n = 11), Saudi Arabia (n = 5), Spain (n = 2), Sweden (n = 26), the United Arab Emirates UAE (n = 119), the UK (n = 307) and the USA (n = 143) ( Table S1). Therefore, at least 1579 (99.9%) of 1580 amino acids in Strangvac were identical in 743 (97.9%) genomes, and all genomes encoded identical amino acid sequences for at least six of the eight Strangvac antigens.
Main limitations: Three hundred and seven (40.4%) isolates in this study were recovered from horses in the UK.

Conclusions:
The predicted amino acid sequences of antigens in Strangvac were highly conserved across this collection of S equi.

K E Y W O R D S
genetic conservation, horse, S equi, strangles, vaccine antigens

| Phylogenomic analysis
Genome assemblies for all 759 isolates were uploaded into the Pathogenwatch bioresource for S equi (https://cgps.gitbo ok.io/ patho genwa tch/) and phylogenetic reconstruction of the combined populations was generated as described previously. 19 The collection in Pathogenwatch can be accessed at https://patho gen.watch/ colle ction/ j3qp5 viupj jh-antig en-varia tion. A curated set of 1286 loci in the core genome of the Se4047 reference, excluding the mobile genetic elements (φSeq1, φSeq2, φSeq3, φSeq4, ICESe1 or ICESe2), insertion sequences and sortase-processed proteins, was used for typing purposes. 17,26 Alleles of loci for which multiple copies were encoded within the S equi genome, including hasC1 and hasC2, were also omitted. 26 BLAST matches of the 1286 loci across each genome relative to the core genome of the Se4047 reference were extracted and aligned using MAFFT, 38 and a database of the core genome segments with a per cent identity was constructed. Hits below 80% core gene length or identity were removed as fragments.
Each specific combination of substitutions within the core genome loci relative to the Se4047 reference 26 was assigned an allele. Indels were excluded from further analysis, as they are often the result of assembly or sequencing error. 38 The variant sites between each pair of assemblies were then used to construct dendrograms using the APE package. 39 The resulting tree was midpoint-rooted using the phangorn package. 40 The phylogenetic reconstruction and associated metadata were visualised using Microreact 41 and can be viewed at https://micro react.org/proje ct/8knxe bFjP9 6CrKj v3uA9xY.

| Extraction of antigen sequences
The DNA sequence encoding the antigen fragments of

| The genomes of S equi isolates generally clustered with others from the same geographical regions
The genomes from Collections A and C clustered closest to those of isolates recovered from the same geographical region (the USA or Sweden) (https://micro react.org/proje ct/8knxe bFjP9 6CrKj v3uA9xY). However, isolates, ER14_125 and ER14_140, recovered from horses in Texas during 2014, which clustered most closely with a group of isolates from the UAE and Saudi Arabia recovered between 2013 and 2015 ( Figure 2). These data provide further evidence of a link between outbreaks in the UAE, Saudi Arabia and the USA that may have been associated with the international transport of horses.

| The predicted amino acid sequences of the antigens targeted by Strangvac were conserved across the combined collections of S equi
Strangvac contains eight antigens based on the 1866 strain of S equi, which was recovered from a horse with strangles in Hälsingland, Sweden, in 2000 ( Table S1). The core genome of strain 1866 clustered into BAPS2, which was the most prevalent type of S equi causing strangles in horses within Europe in Collections B and C. The core genome of strain 4047 (UK4047 in Table S1), which served as a reference genome for this study and was used as the challenge strain in Strangvac vaccine trials, clustered into BAPS5, the second most prevalent type of S equi recovered from European horses.
Sequence analysis confirmed that the amino acid sequences of SclC, SclI and IdeE in Strangvac were identical to the predicted amino acid sequences of the homologous proteins encoded by all 759 (100%) genomes in the combined collection ( Figure 3, Table 1 and Table S1).
The predicted amino acid sequence of CNE was identical to that  Table 1 and Table S1). Therefore, the amino acid sequence of CNE in Strangvac was fully conserved in all 434 European genomes and all 144 North American genomes.
The predicted amino acid sequence of SclF was identical to that used in Strangvac in 757 (99.7%) of the 759 S equi genomes in the combined collection ( Figure 3, Table 1 and Table S1). The BAPS1 strain NZLU, which was recovered from a horse in New Zealand  Table 1 and Table S1). The BAPS1 strain USA07_22, which was recovered from a horse in Indiana in 2009, and the BAPS2 strain UAE0015_3 (ST-179), which was recovered from a horse in Dubai in 2014, contained a truncation in the predicted EAG amino acid sequence. Therefore, the amino acid sequence of EAG in Strangvac was fully conserved in all 434 European genomes and 143 (99.3%) of 144 North American genomes.
The predicted amino acid sequence of Eq5 was identical to that used in Strangvac in 632 (83.3%) of the 759 S equi genomes in the combined collection ( Figure 3, Table 1 and Table S1). The 123 strains in BAPS5 contained isoleucine to leucine substitution at amino acid position 201, which is conservative and not predicted to alter significantly the antigenicity of Eq5. In support of this, the Se4047  Table 1 and Table S1). This is the final amino acid of the Eq8 fragment in Strangvac and is not predicted to affect significantly the antigenicity of this protein relative to Strangvac. Therefore, the amino acid sequence of Eq8

| The predicted amino acid sequences of SeM varied across the combined collections of S equi
A total of 111 different SeM alleles were identified across the combined collection of 759 S equi genomes (Table S1 and Table S2).
Analysis of the predicted 109 amino acids encoded by the 5' variable region of the SeM gene of these 759 genomes revealed that 44 (40%) F I G U R E 1 Distribution of isolates from Collections A, B and C into the six Bayesian analysis of population structure (BAPS) groups. Midpoint-rooted phylogenetic reconstruction of the Streptococcus equi population visualised in Microreact. The dendrogram was constructed from pairwise cgMLST scores using the APE package. 39 The resulting tree was midpoint-rooted using the phangorn package. 40 The scale bar relates to horizontal branch length and indicates the number of cgSNPs proposed to have occurred on the branches. Green and red circles indicate Collections A 18   Fifteen (2%) of the strains in the combined collection of S equi contained deletions or insertions in the 5' region of the SeM gene, leading to the production of a truncated product (Table S1). On average, the 744 S equi isolates which contained a full-length SeM gene encoded a product that contained 2.9 amino acid changes (2.7%) relative to the consensus 109 amino acids of the N-terminal region of SeM.

| DISCUSS ION
The collection of S equi genomes used in this study is the most comprehensive to date. However, large proportions of the genomes were from isolates recovered from horses residing in the UK and the USA (40.4% and 18.8%, respectively). Continued

F I G U R E 2
Relationships of Bayesian analysis of population structure (BAPS) group 1 isolates from outbreaks in the USA, UAE and Saudi Arabia. Midpoint-rooted phylogenetic reconstruction of a BAPS1 subgroup of the Streptococcus equi population visualised in Microreact. The dendrogram was constructed from pairwise cgMLST scores using the APE package. 39 The resulting tree was midpoint-rooted using the phangorn package. 40 The scale bar relates to horizontal branch length and indicates the number of cgSNPs proposed to have occurred on the horizontal branches. Coloured circles indicate the country from which the isolates originated, as indicated in the key.  Immune responses towards the antigens within Strangvac conferred significant levels of protection to vaccinated ponies against infection with S equi following experimental challenge. 37 Therefore, the high level of conservation of the proteins used in Strangvac, relative to the variation identified in SeM was surprising. The variability of Strangvac antigens may be restricted by functional constraints that do not affect variation in SeM to the same degree. 46 The SeM protein of S equi is immunodominant 53 and it is evident that selective pressure on SeM leads to variation in this important protein. 46 However, vaccines based on SeM failed to confer protection to horses. 48 We speculate that the immunodominance of SeM during natural infection may divert selective pressure exerted by the equine immune response away from targeting other, less immunogenic, antigens the response to which may be more protective.

| CON CLUS ION
The predicted amino acid sequences of antigens in Strangvac were

AUTH O R CO NTR I B UTI O N S
All authors contributed to data analysis and interpretation and manuscript preparation and approved the final manuscript.

E TH I C A L A N I M A L R E S E A RCH
This study analysed publicly available genome sequencing data from previously published studies. No experimental animals or clinical cases were sampled during this study.

I N FO R M ED CO N S ENT
Not applicable.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/evj.13552.

DATA AVA I L A B I L I T Y S TAT E M E N T
The Illumina sequences used in this study were deposited previously at the National Center for Biotechnology Information (NCBI) under the accession numbers SUB6350545, 18 PRJEB38019 19 and