Genetic diversity and population structure of Tenacibaculum maritimum, a serious bacterial pathogen of marine fish: from genome comparisons to high throughput MALDI-TOF typing

Tenacibaculum maritimum is responsible for tenacibaculosis, a devastating marine fish disease. This filamentous bacterium displays a very broad host range and a worldwide geographical distribution. We analyzed and compared the genomes of 25 T. maritimum strains, including 22 newly draft-sequenced genomes from isolates selected based on available MLST data, geographical origin and host fish. The genome size (~3.356 Mb in average) of all strains is very similar. The core genome is composed of 2116 protein-coding genes accounting for ~75% of the genes in each genome. These conserved regions harbor a moderate level of nucleotide diversity (~0.0071 bp−1) whose analysis reveals an important contribution of recombination (r/m ≥ 7) in the evolutionary process of this cohesive species that appears subdivided into several subgroups. Association trends between these subgroups and specific geographical origin or ecological niche remains to be clarified. We also evaluated the potential of MALDI-TOF-MS to assess the variability between T. maritimum isolates. Using genome sequence data, several detected mass peaks were assigned to ribosomal proteins. Additionally, variations corresponding to single or multiple amino acid changes in several ribosomal proteins explaining the detected mass shifts were identified. By combining nine polymorphic biomarker ions, we identified combinations referred to as MALDI-Types (MTs). By investigating 131 bacterial isolates retrieved from a variety of isolation sources, we identified twenty MALDI-Types as well as four MALDI-Groups (MGs). We propose this MALDI-TOF-MS Multi Peak Shift Typing scheme as a cheap, fast and an accurate method for screening T. maritimum isolates for large-scale epidemiological surveys.


Introduction
The rapid development of intensive aquaculture has been associated with a dramatic increase in outbreaks of infectious diseases [1,2]. Additionally, the international spread of pathogens through the trade of fish and eggs or as a response to environmental changes has been documented for some important viruses and bacteria [3,4]. In this context, the success and sustainability of aquaculture largely depend on the understanding of the evolution and epidemiology of pathogens [1]. Among those, several species of the genus Tenacibaculum (family Flavobacteriaceae, phylum Bacteroidetes) are responsible for diseases collectively designated as tenacibaculosis, a very serious bacterial condition of many commercial marine Open Access *Correspondence: eric.duchaud@inrae.fr 1 Université Paris-Saclay, INRAE, UVSQ, VIM 78350 Jouy-En-Josas, France Full list of author information is available at the end of the article fish species leading to considerable economic losses [5,6]. Tenacibaculum maritimum (formerly Flexibacter maritimus) was the first species to be characterized and probably the best-known pathogen in the genus. Moreover, T. maritimum can affect many feral, captive, and cultured fish species [5,7] and has been repeatedly identified in many marine aquaculture systems worldwide. Diseased fish usually exhibit a diversity of external symptoms including corroded mouth, skin ulcers, fin necrosis, and rotted tail. Skin lesions are often colonized by opportunistic pathogens such as Vibrio spp. So far, only one vaccine is commercially available, but it is restricted to the protection of turbot. Hence, the control of T. maritimum outbreaks essentially relies on the use of antibiotics, sometimes combined with external disinfectants [8].
Reliable methods for studying the relationships between isolates of the same bacterial species (i.e., strain typing) are a key step for understanding the population structure, the spreading and the epidemiology of pathogens. Different typing methods have been proposed for epidemiological investigations of T. maritimum [9]. Three serotypes displaying varying degrees of association with host fish species have been reported [10,11] and different molecular technics have been used to determine the intraspecific diversity of T. maritimum [12,13]. Serological data were compared with several PCR-based methods [14]. In 2014, we proposed a Multi Locus Sequence Analysis (MLSA) scheme [15] that proved to be a powerful discriminating tool for isolate identification and taxonomic affiliation [16,17]. MLSA revealed an unforeseen diversity including several, yet undescribed, pathogenic species in Norway [18]. In addition, this 11-locus sequenced-based method revealed a high number of distinct genotypes for the species T. maritimum, suggesting an endemic distribution of strains without significant contribution of long-distance dissemination linked to international fish movements. More recently, whole-cell matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) was used for the differentiation of several fish-pathogenic Tenacibaculum species [14,19]. However, these studies did not reveal any relationships between the proteomic profiles and the source of isolation of the strains for any of the Tenacibaculum species analyzed, and no biomarker below the species level (e.g., serotype-specific peaks) was detected for T. maritimum. Meanwhile, the complete genome of the T. maritimum type strain [20] as well as the draft genomes of the type strains and several field isolates of T. dicentrarchi and "T. finnmarkense" have been recently published [21], paving the way to comparative genomics.
In this study, we analyzed and compared 25 genomes (including 22 newly draft-sequenced) of T. maritimum isolates from various geographical origins and host fish species to draw a global picture of the genomic diversity of the species. In addition, we developed a genome-based MALDI-TOF MS scheme and typed 131 field isolates. While this technique is commonly used for species identification, bacterial characterization below the species level is much more challenging, requiring the identification of subtle differences between strains [22]. We also propose a dedicated website that allows both identification and typing of new Tenacibaculum isolates [23].

Bacterial strains
Tenacibaclum maritimum strains were grown in marine 2216E broth (Difco, Becton, Dickinson and Co., Franklin Lakes, New Jersey, USA) for 24 h at 28 °C and 170 rpm. Stock cultures were preserved in marine 2216E broth containing 20% (v/v) glycerol at −80 °C. The 25 strains used in this study are listed in Table 1 and the bacterial isolates subjected to MALDI-TOF MS analysis are listed in Additional file 1.

Genome sequencing, assembly and annotation
Following centrifugation of the liquid culture, genomic DNA was extracted from the pellet using the Wizard genomic DNA purification kit (Promega, Madison, Wisconsin, USA). The genomes were sequenced using Illumina (HiSeq, 100 paired-end or MiSeq, 300 paired-end) and genome assemblies were performed using Spades and Velvet on the PATRIC website with default settings [24]. The resulting contigs (> 2000 bp) were integrated into the MicroScope platform [25].

Genome analysis and comparisons
Average Nucleotide Identity analyses were performed using the ANIm method [26] with the Python module Pyani [27] using proposed threshold of ≈ 95-96% for species delineation. Genome annotation, including manual curation, and comparisons, including pan and core genome computation, were performed using the web interface MicroScope [28], which allows graphic visualization enhanced by a synchronized representation of synteny groups [29]. Persistent, shell and cloud genomes were computed using the PPanGGOLiN 0.1.4 software [30]. Considering the low levels of sequence divergence at typical core genome loci previously reported for T. maritimum [15], we chose a cutoff of 80% identity and 80% on the minimal coverage of the length between the aligned portions of two proteins to determine whether two CDSs were members of the same gene family.
The 25 genomes were aligned using Snippy [31] with the genome of strain NCIMB 2154 T (the type strain of T. maritimum) [20] serving as a reference. The resulting whole-genome alignment was used for phylogenetic The list of contributors is available in Additional file 1.
tree reconstruction using Gubbins [32] and the Phylip package version 3.6 released by Felsenstein in 2005, available online at [33] and originally released in 1980 [34]. Regions of high diversity (high number of single nucleotide polymorphisms [SNPs]) presumably linked to recombination events were masked at each Gubbins iteration. A Maximum Likelihood tree was built at the fifth iteration using non-masked polymorphic sites. Neighbor-joining and parsimony trees were obtained with Phylip suite v3.696 (programs dnadist, neighbor, and dnapars) after removing positions with gaps in the alignment. Custom Perl and R scripts were used for the nucleotide diversity analysis (computation of the average pairwise nucleotide diversity and of the homoplasy index) and graphical representations (including R library "ape" for the drawing of phylogenetic trees). An estimate of the ratio of recombination and mutation (r/m) based on the analysis of SNPs between pairs of closely related isolates was obtained using a two-state hidden-Markov model (HMM), as described in Duchaud et al. [35].

Preparation of bacterial samples for MALDI-TOF MS
The ethanol/formic acid extraction procedure, as described by Mellmann et al. [36] was used. Briefly, a colony of a fresh overnight culture was inoculated in 5 mL marine 2216E broth and cultivated at 28 °C for 24 h with shaking (170 rpm). The resulting bacterial culture was centrifuged at 12 000 g in a desktop centrifuge for 2 min and the supernatant discarded. About 10 mg of the resulting bacterial pellet was transferred in a clean Eppendorf tube with 300 μL of ultra-pure water (Acros organics, New Jersey, USA) and vigorously mixed to resuspend the cells. 900 μL of 100% ethanol (VWR Chemicals, Radnor, Pennsylvanie, USA) was added into the tube and mixed again. The mix was directly processed or kept at room temperature (RT) up to 1 month for the assessment of the ethanol-fixed bacteria protocol. The tube was centrifuged at 12 000 g in a desktop centrifuge for 2 min and the supernatant discarded. The tube was centrifuged for 2 additional minutes and the residual ethanol removed and the pellet was dried at RT. Thirty μL of 70% formic acid was added and mixed thoroughly by pipetting. An equal volume of acetonitrile was then added to the tube, mixed carefully and then centrifuged at 12 000 g in a desktop centrifuge for 2 min. One μL of the supernatant was dropped onto a 96-spot polished steel target. The sample spot was dried at RT and 1 μL of matrix solution (α-cyano-4-hydroxycinnamic acid in 50% acetonitrile, 47.5% water and 2.5% trifluoroacetic acid) was then added. The sample spot was finally air dried again before analysis. A calibration of the MALDI-TOF mass spectrometer was performed using Bruker bacterial test standard for each series of acquisitions with a mass tolerance limit of ± 300 parts-per-million (ppm). To evaluate alternative sample preparation procedures, bacteria were cultivated on solid medium [i.e., marine 2216E broth and 15 g/L agar (Invitrogen, Illkirch, France)] at 28 °C for 24 h. About 10 bacterial colonies were collected and processed as the bacterial pellet obtained from liquid culture. For the assessment of the direct colony picking protocol, a single bacterial colony was picked with a sterile toothpick and directly deposited onto a 96-spot polished steel target and processed as the sample spot previously mentioned.

Peak shift characterization using genomic data
Ribosomal protein sequences were retrieved from the MicroScope annotation platform using the 25 available T. maritimum genomes. The theoretical mass weight for each sequence was computed with the mw() function from the Peptides R-package and TermiNator [37] was used to predict the first methionine cleavage resulting in a theoretical loss of 131.2 Da. The theoretical masses obtained were compared to the peak list and spectra were screened to retrieve peak shifts using the predicted masses of presumptive polymorphic biomarkers. Several other peak shifts were identified during this step but were not kept for typing purpose because of the stringent criteria used (see section "Results").

MALDI-TOF data analysis algorithm and implementation in R
We used different R packages dedicated to mass spectrometry analysis from the BioConductor repository [38], i.e. MALDIquant Foreign, MALDIquant, MALDIrppa and MassSpecWavelet. Using these packages, any type of MALDI-TOF data can be loaded from any manufacturer. Raw data generated by Microflex ® Bruker were imported into R environment by MALDIquant Foreign. The spectra pre-processing steps (i.e., intensity transformation, baseline correction, intensity correction, spectra alignment, peak detection), curve smoothing and peak detection on average spectra (mean spectra) were performed using MALDIquant, MALDIrppa and MassSpecWavelet [39], respectively. This method is based on local maxima detection at different scales, and the retained peaks are observed at many scales as true maxima (true peaks). Thus, it theoretically gets rid of noisy peaks more efficiently than classical algorithms (e.g., about 300 peaks were detected with MALDIquant/MALDIrppa with soft parameters whereas around 100 peaks were detected with MassSpecWavelet for each T. maritimum average spectrum). The resulting peak list is then compared to a reference peak list that encompasses either: (i) full spectra of the type strains of Tenacibaculum species, (ii) species-specific biomarkers or (iii) subtyping biomarkers. We considered two peaks as matching between a sample and the reference with a tolerance of 700 ppm. MALDI-quantTypeR is a home-made R application developed for MALDI type assignment [23].

General genome features
The origins of the sequenced genomes, the main sequencing and assembly data and the genomic characteristics are listed in Table 1 The core-genome encompassed all the predicted toxins and virulence factors (i.e., cholesterol-dependent cytolysin, collagenase, sphingomyelinase, ceramidase, chondroitin AC lyase, streptopain family protease, sialidase, iron uptake systems and T9SS components) previously identified in strain NCIMB 2154 T [20]. In addition, the persistent-genome, equivalent to a relaxed core-genome (i.e., genes conserved in all but a few genomes) encompassed about 2500 gene families representing about 81% of the CDSs. There were about 10% of shell-genome genes (i.e., genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities) and about 9% of cloud-genome (i.e., genes found at a very low frequency). These two later values were likely overestimated since the newly sequenced genomes were not fully assembled and because of pseudogenization events and gaps between contigs that lead to gene fragments that artificially increase the total number of shell-genome and cloud-genome gene families. Strikingly, most of the shell-and cloud-genome genes were encompassed in genomic islands (region of genomic plasticity). Each genome contained an average of 26 (SD = 4.65) genomic islands (Table 1). Interestingly, strain FC isolated in Chile encompassed a unique (i.e., not identified in other strains), 83-kb long (MARITFC_v1_10015 to MAR-ITFC_v1_10098) genomic island that displayed prophage characteristics (i.e., located at an Arg tRNA and bordered with an integrase/recombinase encoding gene). This island contained many genes predicted to be involved in heavy metal resistance, including genes encoding proteins of the copper oxidase family and different cobaltzinc-cadmium efflux pumps. We focused on the two fully PacBio-assembled genomes (i.e., strains NCIMB 2154 T and TM-KORJJ) contained in our dataset to accurately identify genes encompassed in these islands. Strain NCIMB 2154 T contained 29 genomic islands (encompassing 384 genes and corresponding to 13% of the total number of CDSs) while strain TM-KORJJ contained 24 genomic islands (encompassing 351 genes and corresponding to 12% of the total number of CDSs). Most of these islands displayed classical prophage characteristics and encompassed CDSs encoding for phage structural proteins, integrases, transposases, insertion sequences, restriction/modification systems and Vgr/Rhs elements or their scars.

Nucleotide diversity and population structure
To estimate the nucleotide diversity, we first used assembled draft genomes and computed the average nucleotide identity (ANI) between pairs of genomes. ANI values ranged between 98.18% and 100% (Additional file 2), far above the species delineation threshold of 95-96% [26], revealing a cohesive species and confirming that all the isolates included in this genomic study indeed belonged to the species T. maritimum.
Using a whole genome alignment built on strain NCIMB 2154 T as a reference and corresponding to 2587 083 bp of conserved concatenated sequence, we identified a total of 86 217 SNPs (mean 21 126 SNPs ± SD 9795) out of which 84 694 (98.2%) where bi-allelic. The average pairwise nucleotide divergence between isolates (Table 1) amounted to 18 496 SNPs corresponding to a diversity π of 0.0071 bp −1 with a maximum of 0.0152 bp −1 . The smallest divergence was 0.0021 bp −1 (5477 SNPs), reflecting an absence of very closely related isolate pairs outside of the three isolates Aq18-85, Aq18-88 and Aq18-89 (no SNPs found between them).
The whole genome alignment was subjected to several tree reconstruction methods. Trees obtained by parsimony or with the more sophisticated Gubbins method that attempts to remove recombination tracts are shown in Additional files 3 and 4, respectively. These trees tended to contain internal branches of substantial length with poor bootstrap values (see parsimony tree S3A) presumably due to recombination. To reflect more faithfully the pairwise distances between isolates, Figure 1 presents a neighbor-joining tree based on a simple Jukes-Cantor distance. In this tree, whose topology is very similar to those of the Parsimony and Gubbins trees, branches with poor bootstrap support tended to disappear (i.e. to have length close to zero). Overall, the topologies and bootstrap values supported the division in three subgroups (corresponding to clades A, B and C) previously observed using MLST data [15]. Clades A, B and C encompassed 3, 20 and 2 strains, respectively. The core genome-based tree and the previously obtained MLSTbased tree [15] showed similar clade composition with the only exceptions of strains DPIF 89/3001-6.2 and DPIF 89/0239-1 that belong to ST24 and ST20, respectively (MLST subgroup C) whereas they belong to core genome clade B. However, the position of these two strains had very poor bootstrap support in the MLST-based tree.
Pervasive recombination was obvious from the difference between the 158 863 changes along the parsimony tree and the 87 770 changes that would have been needed in the absence of homoplasy and recombination at the 86 217 polymorphic positions (considering 1493 tri-and 30 quadri-allelic SNPs). These numbers corresponded to an apparent homoplasy index HI of . To further quantify the impact of recombination, we analyzed the pattern of pairwise nucleotide divergence between closely related genomes, selecting unambiguous pairs of tips in the reconstructed trees such as to be able to distinguish private and shared polymorphism. In practice, we used for this purpose the comparison USC SP9.1 vs. USC SE30.1 (5615 SNPs) and FS08 (1) vs. P2−48 (7980 SNPs). No isolates from clade B satisfied our needs of forming a clearly isolated pair. Only bi-allelic SNPs were used, and the fraction of shared polymorphism (i.e. polymorphism also found outside the pair) represented as much as 78.8% of the sites that distinguished the first pair of isolates and 69.3% for the second pair, indicating a considerable contribution of recombination to divergence since mutation is expected to produce almost exclusively private polymorphism while recombination is expected to produce tracts mixing shared and private polymorphism. A hidden Markov model served to delineate recombination tracts and lead to estimates of the ratio of recombination and mutations to nucleotide-level divergence (r/m) of 20.6 for USC SP9.1 vs. USC SE30.1 and 7.7 for FS08(1) vs. P2−48 (Additional files 5A and B, respectively).
As observed on the phylogenetic trees, strains retrieved from the same geographical origin tended to cluster together, indicating genetic relatedness. Indeed, the two strains originating from the Atlantic coast of Spain (USC SE30.1 and USC SP9.1) formed cluster C while the three strains in group-A [FS08(1), NAC SLCC MFF and P2-48] were all retrieved from countries bordering the Mediterranean Sea. The two Japanese strains NCIMB 2154 T and NBRC 15946 also appeared related. The three strains Aq16-85, Aq16-88 and Aq16-89 isolated in French Polynesia in 2016 were virtually identical to each other (0 SNPs in the 2587 083 aligned positions of the coregenome), suggesting a dissemination of a single clone between fish farms where Platax orbicularis are raised.
In addition, these three strains were related to strain TFA4 also isolated in French Polynesia in 2013. The Chilean strain FC, also originating from the South Pacific, belonged to the same group. Five strains (i.e., JIP 10/97, P1-39, P4-45, 902 and JIP 32/91-4) retrieved from France over 22 years were also grouped together in the NJ tree ( Figure 1).

Identification of potential biomarkers in genomic data
Because genomic data pointed to a cohesive species but still displaying variability to some extent, we aimed to evaluate the potential of MALDI-TOF MS for rapid screening and typing using specific sets of biomarker ions. Since half of the detected peaks in bacterial MALDI-TOF MS spectra correspond to ribosomal proteins [40], we focused on this protein set. Using genome sequence data, we first retrieved all deduced ribosomal protein sequences and predicted the first methionine removal using the Terminator software [37,41]. The molecular weight of each deduced ribosomal protein was computed. We retrieved ribosomal protein sequences without polymorphism (i.e., invariant ribosomal protein sequences hereafter designed as monomorphic) that could serve as species biomarkers and for MALDI-TOF MS spectra internal calibration. We also retrieved ribosomal protein sequences displaying polymorphism (i.e., ribosomal protein sequences with variation hereafter designed as polymorphic) since they are likely relevant biomarkers for strain typing. Strikingly, one-third (18/54) of the ribosomal proteins were monomorphic while the others (36/54) displayed some degree of amino-acid polymorphism that mostly gave rise to a mass change (the resulting information is summarized in Additional file 6). Strikingly, the topology of the hierarchical classification tree deduced from this data set was globally congruent with that of the trees obtained using the core genome genes (Figure 1 and Additional file 3). Indeed, some ribosomal protein-encoding genes displayed clade-specific variations. For example, gene rplU encoded two different versions of the 50S ribosomal subunit protein L21: a 209 amino-acid long RplU protein in clades A and C and a 161 amino-acid long RplU protein in clade B. Other examples were provided by proteins RpmI and RpsP that displayed isoforms only found in clade A or by RplD that displayed an isoform only found in clade C.

Selection of invariant biomarker ions for internal calibration of spectra and species identification
Using our monomorphic biomarker candidates, we performed visual inspection of spectra obtained from 24 out of the 25 genome-sequenced strains to identify the corresponding peaks. We selected 18 peaks (Additional file 7) according to the following stringent criteria: (i) they covered as much of the entire spectra as possible (ranging from 2958 m/z to 12 460 m/z); (ii) they corresponded to mono-, di-or tri-charged monomorphic ribosomal proteins; (iii) they occurred in all MALDI-TOF spectra of the 24 isolates; (iv) most of them were of high intensity (ranging from 98 to 1371, mean intensity = 547), even at both ends of the spectra (though intensity was globally lower at these m/z locations); and (v) they were not disrupted by the presence of other very close peaks which could degrade peak detection reliability; in other words, retained peaks had to be in a window devoid of additional peaks. Only two selected peaks (i.e., RpmJ-M-H1 and RpmE-M-H2) represented exceptions to the latter criterion. Indeed, strains USC SE30.1 and NCIMB 2158 displayed a slightly different peak shape for RpmJ-M-H1 and RpmE-M-H2, respectively, with a larger area under the curve and a flatter bump. This resulted from the presence of two close peaks with the same intensity. However, these peaks did not disrupt the signal and the peaks for RpmJ-M-H1 and RpmE-M-H2 could both be accurately detected. The 18 selected peaks corresponded to 9 ribosomal proteins with varying degrees of ionization. They could serve as internal calibration references using the alignSpectra function for accurate peak detection. In addition, these monomorphic biomarkers were of utmost interest for species identification and were included in the quality control process of spectra (Additional file 8).

Selection of 9 polymorphic biomarkers for strain typing and MALDI-Type attribution
Using our polymorphic biomarker candidates (Additional file 6), we performed visual inspection of the above-mentioned spectra to identify the corresponding peak shifts. We selected 8 polymorphic biomarkers corresponding to ribosomal proteins (i.e., RpmD, RpmC, RpsP, RpsN, RpsO, RpsQ, RplX and RplT) with a molecular weight ranging from 6600 Da to 13 200 Da. An additional polymorphic biomarker, RpsT, was included after screening our collection of 131 isolates (see next paragraph). Indeed, strain Aq8-57 displayed an unexpected peak shift (not correlated to any amino-acid polymorphism observed in the 25 genomes data set). Sanger sequencing of strain Aq8-57 revealed that this peak shift corresponded to a 122A > G mutation in the sequence of the rpsT gene leading to a R41K change in the aminoacid sequence. In addition, strain USC RPM 539.1 also displayed a peak shift not previously observed that corresponded to a 35G > A mutation in the sequence of the rplT gene leading to a R12K change in the amino-acid sequence. These mutations gave rise to a −27 Da shift observed in the spectra compared to the type strain used as a reference. Therefore, the 9 retained biomarkers displayed varying degrees of polymorphism (Additional file 9) ranging from two to up to five isoforms (IF) (Table 2, Figure 2). An arbitrary number was given to each IF for subsequent analysis. By convention, an IF1 numbering was given for each biomarker of the type strain NCIMB 2154 T .

Validation of the typing scheme using field isolates
To validate the strain typing approach, and in addition to the 24 above-mentioned spectra, we analyzed by MALDI-TOF MS 111 field isolates from worldwide origins and retrieved from a variety of fish species (Additional file 1). Among the field isolates, 56 had previously been typed using MLST [15]. Our strategy based on automated peak detection and assignation to the corresponding IF proved to be very efficient with a success rate of 97%. Only four isolates (P1-40, Aq12-62, Aq12-64 and Aq6-9) out of 111 were not fully typed corresponding to 5 (out of 1176) peak losses (0.42%). We chose an MLST-like strategy by combining IF numbering to produce the corresponding MALDI profile as proposed by Zautner et al. [42], and here referred to as MALDI-Type (MT). Using this scheme, we identified 20 MTs (Additional file 1). Using genome sequence data, one could predict the MT of a strain; for instance, strain TM-KORJJ was predicted to belong to MT3. This approach allowed unambiguous assignment for each isolate as well as the use of visualization and analysis tools developed for MLST such as the eBurst [43] and SplitsTree decompositions [44]. Indeed, using eBurst on our dataset, the 20 MTs could be grouped in 4 clusters based on connection by single-locus-variants ( Figure 3A) that can be designated as MALDI-Groups (MGs). Of note, the criterion for the determination of MGs is analogous to the criterion for the determination of clonal complexes (CC) reported in the MLST strategy [45] but does not correspond to the same level of divergence between isolates. Figure 3B is complementary to the graph drawn using eBurst. It depicts a hierarchical clustering (average-link) built from MALDI-Types isomorphic profiles with the corresponding number of isolates.
The MTs and MGs are globally congruent with the core genome-based phylogeny ( Figure 1). Indeed, strains USC SP9.1 and USC SE30.1 both belonged to MT6 and were encompassed in clade C. Strains NAC SLCC MFF, FS08(1) and P2-48 were encompassed in clade A and belonged to MTs 5, 9 and 10, respectively. These 3 latter MTs were linked by single locus variants. All of the above-mentioned strains were encompassed in MG1. Strain DPIF 89/0239-1 belonged to MT7, which was the only singleton among all MGs. In the core genome-based phylogeny, strain DPIF 89/0239-1 also appeared distantly related to the other strains encompassed in clade B. Strain P2-27 belonged to MT19 whereas the type strain Bridel et al. Vet Res (2020) 51:60 NCIMB 2154 T belonged to MT1 and strain NBRC 15946 belonged to MT4. These three strains were encompassed in MG3. Strain UCD SB2 belonged to MT8 (MG2) and also appeared more distantly related to the other strains encompassed in clade B in the core genome-based phylogeny. Finally, all the other strains belonged to MT3 (MG2) with only two exceptions, strains FC and DPIF 89/023-9 that were the only strains displaying incongruence between MT and core genome-based phylogeny.
Although the 131 strains studied were from worldwide origin, our dataset was not suitable to highlight sound association between the isolation sources and the MT. We tried several statistical analyses (AMOVA and Fisher exact test, data not shown), but each tested variable (i.e., country and year of isolation, host fish species) was drastically correlated with each other indicating that these variables are not truly independent. For instance, a high correlation between fish hosts and countries was obvious. However, and in accordance with our previous MLST study [15], we observed trends between the geographical origin of the strains and the MT. For instance, the 4 isolates belonging to MT1 and the 3 isolates belonging to MT4 all originated from Japan. In addition, these two MTs belong to the same MG3 and are linked by Double Locus Variants (DLV). The two isolates belonging to MT2 originated from California. The two isolates belonging to MT5 originated from Italy and Malta, two neighboring countries. The two isolates belonging to MT6 originated from Spain. The three isolates belonging to MT12 originated from Tasmania. In addition, 45 out of the 50 isolates from France belonged to MT3. On the other hand, strains from the same geographical origin could belong to different MTs (e.g., the 10 Tasmanian isolates belonged to 4 different MTs whereas the 5 Italian strains each belonged to a different MT). Of importance, strains retrieved from the same host fish species could belong to different, unrelated MTs (e.g., the 74 strains retrieved from Dicentrarchus labrax belonged to 5 different MTs).

Evaluation of reproducibility, repeatability and alternative sample preparations for MALDI-TOF MS
In order to evaluate reproducibility and repeatability, the T. maritimum type strain, the 22 genome-sequenced T. maritimum strains and the type strains of the 23 other Tenacibaculum species included in this study were subjected to multiple (n ≥ 3), independent, MALDI-TOF MS data acquisitions (one acquisition corresponding to an average spectrum of four technical replicates) using the ethanol/formic acid extraction procedure from fresh bacterial culture. Using this set of reference strains, the accuracy was 100%. In addition, laboratories may use different procedures [46] for MALDI-TOF MS analysis (from sample preparation to data processing and interpretation).
To address these issues, we tested several sample preparation protocols. The quickest and easiest way to acquire MALDI-TOF MS data from a bacterium is direct transfer method. We evaluated this protocol with 104 bacterial isolates arbitrarily sampled among the 135 above-mentioned isolates. Among these, only 9 strains were not fully typed (8.65%) because of about 1% peak loss. We also assessed our method with ethanol-fixed bacteria, a strategy that could facilitate the MALDI-TOF MS typing of isolates from distant locations. We evaluated the effects of long term (up to 1 month) ethanol conservation followed by the regular protein-extraction protocol. Fifty-one out of 62 isolates (82%) were fully typed after 7 to 15 days ethanol storage. However, only 15 out of 25 isolates (60%) were fully typed after 16 to 26 days ethanol storage. Hence, the conservation time significantly affected typing reliability. Of importance, IF assignments were always congruent for the same sample whatever the preparation method used.

A web-based tool for MALDI-Type assignment
We felt that a web-based tool for MT assignment of T. maritimum strains would facilitate data interpretation and help comparisons at a global scale in the same way as previously proposed for MLST schemes [47]. Hence, we developed a friendly application named MALDIquant-TypeR for Tenacibaculum MALDI-TOF data analysis that contains 3 tools. Firstly, in order to avoid misinterpretation with spectra from bacteria that do not belong to the species T. maritimum, we added reference spectra from 24 out of the 29 described Tenacibaculum species, including all the fish-pathogenic or fish-associated species (i.e., T. dicentrarchi, T. discolor, "T. finnmarkense", T. gallaicum, T. ovolyticum and T. soleae). Each reference spectrum is composed of a peak list obtained with Tenacibaculum type strains following Bruker's recommendations (> 20 independent acquisitions). We used stringent parameters and kept peaks with a signal/noise ratio > 3. Thus, each Tenacibaculum species is defined by 20 to 60 peaks. When raw spectra from an unidentified sample are uploaded in the application, a peak list is produced using the same stringent parameters and each retained peak is compared to the reference peak file. A peak is considered matching the reference if the difference between two values is less than 700 ppm. Then, the number of matching peaks between the sample and each type strain is computed. The output is a bar-plot providing the percentage of matching peaks with the references corresponding to the 24 Tenacibaculum species included so far. For instance, all T. maritimum isolates used in this study display at least 50% of common peaks with the T. maritimum type strain (vs < 20% with the type strains of other Tenacibaculum species). This first tool suggests a taxonomic affiliation and only selects the spectra likely belonging to the species T. maritimum. The second tool uses the 18 T. maritimum specific monomorphic biomarkers previously defined. It is based on the same peak matching strategy to accurately identify a spectrum as belonging to the species T. maritimum. In addition, it also provides a valuable measure of spectra data quality (Additional file 8). The third tool is the typing method itself that identifies the IF of the 9 polymorphic biomarkers and provides the isomorphic profile and the resulting combination corresponding to the MT. This profile can further be processed in the same way as an MLST profile. The MALDIquantTypeR application is available online [23] hosted by the Migale platform at INRAE Jouy-en-Josas.

Discussion
The fast development of aquaculture faces an array of sanitary issues, causing important economic losses and impacting the environment and animal welfare [1,2]. The rapid detection of pathogens and the continuous monitoring of circulating bacterial genotypes in space and time is a prerequisite for the implementation of rational control measures [48]. International and cross-sector surveillance of pathogens requires strain typing methods that must be rapid, accurate, resolutive, reproducible and affordable, and that must enable the exchange of molecular typing data via the Internet [1]. Tenacibaculosis is an ulcerative disease affecting many marine fish species of commercial interest worldwide and T. maritimum is considered the main causative agent of tenacibaculosis in wild and cultured fish [5]. Different typing methods for epidemiological investigations of T. maritimum have been proposed [9]. However, they do not fit all the above-mentioned recommendations. For

Figure 3 eBurst network, visual representation of the links between the MALDI-Types and the hierarchical tree based on MALDI-Types isomorphic profile.
A eBurst defines clusters when isolates share 8 out of the 9 polymorphic biomarkers. These clusters may be considered as MALDI-Groups. Each of them contains a central MT, which represents the "founder" MT of the corresponding group. B The classification tree, based on average-link hierarchical clustering (average-link method), offers another view of the links existing between MALDI-Types and MALDI-Groups. The distance used for this tree corresponds to the number of differences between each MALDI profile.
instance, the different serotyping schemes that have been proposed [10,11,49] appear to be unrelated and poorly discriminatory (only three serotypes have been reported to date and some strains display cross-reactivity). Moreover, conventional serology is costly, labor-intensive and requires significant technical expertise and the use of animals to raise anti-sera. On the other hand, an MLST scheme has been proposed [15] and proved to be a powerful discriminating tool for isolate identification and strain typing. However, classical MLST is also costly, labor-intensive and time consuming. Whole genome sequencing represents the "ultimate" typing methodology in terms of discriminatory power. However, it is not yet suitable for real time monitoring of large collections of bacterial isolates and is mostly used for retrospective analysis.
In this study, we first used genomic comparisons of 25 strains including 22 newly draft-sequenced genomes to draw a global picture of the genomic diversity of the species. Tenacibaculum maritimum appears as a cohesive bacterial species which strains are characterized by: (i) a high genomic identity (ANI > 98%); (ii) a similar genome size (~3,4 Mb); and (iii) a moderate level of nucleotide divergence (maximum 1.52% in pairwise core-genome sequence comparisons) while typical bacterial species can exhibit up to ~5% nucleotide divergence [50]. In addition, a major contribution of recombination (r/m ≥ 7) in the evolutionary process of the species was observed, reminiscent of the situation observed in another fish-pathogenic species of the family Flavobacteriaceae, Flavobacterium. psychrophilum [35]. Our data set encompasses two fully PacBio-assembled genomes (i.e., NCIMB 2154 T and TM-KORJJ) that are perfectly collinear, pointing to a similar chromosomal organization without major genomic rearrangements.
Overall, our results based on 25 genomes are not only in good accordance with the conclusions drawn from our previous analysis using a 11 loci-based MLST data set [15], but they provide unprecedented details on genome content and organization. Tentative phylogenomic tree reconstructions using different methods are congruent and support the division in three main clades (A, B and C). In addition to core-genome genes similarities, clades A and C share some common genomic features. For example, the RplU encoding gene version is obviously different between clades A/C and clade B strains (209 amino-acid long in clade A and C strains vs a 161 aminoacid long in clade B strains). The metE gene, encoding methionine synthase, is full length in clade B strains but pseudogenized in clades A/C strains suggesting a nonfunctional methionine biosynthesis pathway in the latter. Additional clade-specific features were identified. For example, the two clade C strains display non-functional, frame shifted versions of genes of which full-length versions are present in all the other genomes examined (e.g., strain NCIMB 2154 T locus tags: MARIT_0127,  MARIT_0229, MARIT_0258, MARIT _0523, MARIT  _0991, MARIT_1553, MARIT_1615, MARIT_1833,  MARIT_1972, MARIT_2492, MARIT_2635 and MARIT_2635). These observations performed on a limited number of genomes may face some exceptions (e.g., due to gene shuffling by homologous recombination). However, the presence of such gene remnants in the genome of some strains argues for genome reduction trends as frequently observed in bacterial pathogens [51]. Strikingly, most of the variable-genome encoding genes are located in genomic islands, some restricted to a single strain while others are shared between several strains. It is therefore tempting to speculate that some islands provide a fitness advantage such as the one containing heavy metal resistance genes identified in strain FC to face an environment polluted by heavy metals. Indeed, important copper concentrations in sediments because of disposal of copper mine tailings has been documented for bays of northern Chile, the region where strain FC was isolated from. A copper-resistant Vibrio sp. strain was retrieved from cultured scallop [52] from the same geographical area suggesting convergent evolutionary mechanisms for heavy metal resistance in these phylogenetically unrelated marine bacteria.
In addition to providing a global picture of the genomic diversity, the use of genomic data has been a key step in the MALDI-TOF MS scheme proposed in this study by allowing a better understanding of the spectral composition. The scheme follows Sauget et al. [46] MALDI-TOF MS recommendations by using the genomic information for selecting relevant biomarkers, by choosing ribosomal proteins that should be unambiguously identified within the spectra and by focusing exclusively on pic shifts for the typing purpose. Indeed, we were able to retrieve from the spectra 18 out of 54 in silico-identified ribosomal proteins. The total number of biomarkers (18 monomorphic and 9 polymorphic) corresponds to about one-third of the detected peaks. In addition, most of these peaks display high intensity, even at both ends of the spectra. Combining genome mining and visual exploration of the spectra, we linked each noteworthy peak shift with the corresponding polymorphism in genomic sequences (Additional file 9). Strikingly, the 9 polymorphic biomarkers and their corresponding isoforms (25 in total) were defined from only 10 strains (out of 25 sequenced isolates), reflecting the cohesive nature of the T. maritimum species. The strategy based on highly relevant biomarkers and their selection based on stringent criteria proved to be particularly effective. Using 111 field isolates and the ethanol/formic acid extraction procedure, automated peak detection displayed a very high level of success rate (> 99.5% of the polymorphic biomarkers identified). Because fluctuations in MALDI-TOF MS results might be caused by the use of different bacterial growth conditions, sample preparation procedures or matrix used [46], we evaluated the robustness and reliability of the proposed typing scheme by using biological and technical replicates as well as alternative sample preparations (i.e., direct transfer method and ethanol/formic extraction after ethanol conservation). As detailed above, the results are still very good when alternative sample preparation procedures are used, though the success rate may be lower, in particular using long-term ethanol conservation. The observed peak detection failures may be attributed to random ion suppression intrinsically linked to the analogical nature of the spectra [46]. This concern may, however, be simply addressed by data re-acquisition (Additional file 8).
Because peak shifts are the direct consequence of a non-synonymous mutation in a ribosomal protein, a peak shift may be considered as an allelic change as proposed by Zautner et al. [42]. We therefore treated peaks shifts as a combination of alleles. A single strain would thus display a unique combination that could be considered as a MALDI-type (MT). These MTs are therefore the equivalent of the sequence types (STs) or electrophoretic types (ETs) resulting from the MLST or MLEE schemes, respectively. In the same way as the latter two schemes, the proposed Multi Peak Shift Typing (MPST) scheme for T. maritimum could be enriched in the future by considering additional biomarkers and/or by the identification of additional IF in the retained biomarkers. New, yet unobserved, combinations of isoforms may also be identified in the future. The definition of new MTs will follow the same incremental process as previously proposed for MLST schemes.
Applied to our collection of T. maritimum isolates, this method identified 20 MTs grouped in 4 MGs. Because our goal was to propose and to evaluate a MPST scheme dedicated to the typing of T. maritimum strains, we used isolates from broad geographical, temporal and host-fish species origins to cover as much of the species diversity as possible. We are aware of the inherent bias of such a sampling choice. Indeed, the variables (i.e. country, year of isolation, host fish species) were highly correlated with each others; therefore, this dataset is not suitable for highlighting sound associations between isolation sources and MTs. However, we observed trends between the geographical origin of the strains and the MT as previously reported using MLST data [15]. In addition, and in line with previous MLST-based conclusions, our analysis revealed no trace of long-distance dissemination of T. maritimum that could be linked to the international trade of fish or eggs.
MPST schemes based on MALDI-TOF MS data may prove suitable for large-scale epidemiological studies for a number of reasons: (i) the approach is similar to MLST, and the tools used to analyze MLST data can also be used on MPST data; (ii) MALDI-TOF acquisitions are much cheaper and faster (about 20 minutes for a whole MALDI-TOF plate acquisition) than sequencing-or PCR-based MLST; (iii) 96 samples can be deposited on the same MALDI-TOF plate; and (iv) the scheme can be transposed to any bacterial species that contains polymorphic ribosomal proteins or any noteworthy polymorphic biomarker. Indeed, the peak shifts caught by MALDI-TOF MS are highly correlated to the genotype as defined in any MLST analysis. MALDI-TOF spectra databases and processing tools can be easily exported on the Internet as a webtool application in the same way as MLST web-based tools (e.g., PubMLST.org). MALDI-quantTypeR is one example of such application available online [23] as well as other dedicated websites such as the Mass Spectrometry Identification platform [53]. Largescale studies could easily benefit from worldwide data collection. We have shown that ethanol (at least shortterm) conservation can be used for MALDI-TOF identification and typing. This method also suppresses biological hazard and should facilitate international transportation of bacterial samples. Other solutions may be designed for international collaboration, such as direct shipment of MALDI-TOF plates with fixed biological material, ready for acquisition. MPST analysis could also be interesting in local surveys monitoring bacterial populations in particularly sensitive geographical areas.
USC SE30.1. In 5B the strains compared are FS08(1) and P2−48. (Upper) Positions of the SNPs along the genomes. SNP index is reset every 100 SNPs for this representation. Each dot corresponds to one SNP in the comparison between the two considered isolates. Colors distinguish two types of polymorphism: in blue, polymorphism observed only between the two considered genomes; in red, polymorphism also observed among the other sequenced genomes. Areas in gray correspond to regions not covered by our alignments. SNPs in regions where probability is < 0.5 (i.e. Additional file 6. Heatmap displaying the diversity of ribosomal protein weights. Lines correspond to strains and columns correspond to ribosomal proteins in ascending order by weight (from left to right). White lines indicate no variation of the weight of the corresponding proteins (i.e., monomorphic proteins). In purple, proteins with weight higher than the mean and in orange proteins with weight lower than the mean (i.e., polymorphic proteins). Additional file 7. Monomorphic biomarker peaks. Screenshots of the 18 conserved peaks produced by 9 ribosomal monomorphic proteins with several degrees of ionization. The m/z values are highlighted by dotted lines and cover the entire spectrum. The red curve corresponds to the T. maritimum type strain average spectra. For each peak, the corresponding ribosomal protein is indicated with the degree of ionization (H1, H2 and H3 corresponding to 1, 2 and 3 H + ) and the presence (M) or absence (m) of the first methionine. Color code: red line for the T. maritimum type strain NCIMB 2154 T and black for the sequenced T. maritimum isolates.
Additional file 8. Quality control and T. maritimum species identification. The full dataset is composed of representatives of 24 Tenacibaculum species including 135 isolates belonging to the species T. maritimum. It encompasses 476 independent acquisitions (one acquisition corresponds to an average spectrum of several technical replicates) corresponding to 5102 spectra including technical and biological replicates. This dataset was divided into two groups: the positive control group (T. maritimum isolates) and the negative control group (the type strains of 23 other Tenacibaculum species). In order to confirm that an isolate belongs to the species T. maritimum, the spectra were scanned to identify the 18 T. maritimum monomorphic biomarkers. However, some of these biomarkers could be absent from a number of T. maritimum strains. Reciprocally, strains that do not belong to the T. maritimum species may possess some T. maritimum monomorphic biomarkers. In order to set up a T. maritimum species identity threshold, a value corresponding to the number of monomorphic biomarkers identified in a single sample was computed. The monomorphic biomarkers frequency plot obtained shows a bimodal distribution ( Figure A). All samples with a score above 60% correspond to bona fide T. maritimum isolates while all samples with a score below 25% belong to other Tenacibaculum species. True and false positives correspond to isolates correctly and incorrectly identified as T. maritimum (TP and FP), respectively. On the other hand, true and false negatives correspond to correctly and incorrectly rejected isolates (TN and FN, respectively). Positive isolates correspond to those having an identity value above a defined threshold. Using the full dataset, the number of TP and FP and the number of TN and FN were counted. The accuracy of the tool [i.e., (TP + TN)/(TP + TN + FP + FN)] was then computed by increasing the threshold value from 0% to 100% by a 0.1% step. One could observe than the accuracy varies from 0% to 100% and is maximal (i.e., above 97%) between 25% and 60% of threshold value ( Figure B). It is therefore proposed that a 60% threshold value safely identifies isolates as belonging to the T. maritimum species. Using this 60% threshold value, only 10 false negatives (i.e., bona fide T. maritimum isolates discarded) and 0 false positive (i.e., not T. maritimum isolates) were found out of 476 acquisitions. Among the 10 false negatives, 3 resulted from the extraction protocol, 3 from the Direct Deposit with Formic Acid (DDFA) protocol and 4 from the ethanol conservation protocol (see section "Testing alternative sample preparation for MALDI-TOF MS"). One can hypothesize that these rare false negatives correspond to technical problems. Indeed, by performing new spectra acquisitions on the 3 isolates previously analyzed by the extraction protocol, the identification score reached at least 88%, far above the safe threshold value of 60%, demonstrating that the original spectra were likely faulty. Finally, spectra from other Tenacibaculum species all had a score below 23%, far below the confidence threshold.
Additional file 9. Amino-acid polymorphism of the 9 retained biomarkers. Protein sequence alignments of the 9 ribosomal proteins selected as polymorphic biomarkers and their corresponding isoform (IF).