Improved strategy for phylogenetic analysis of classical swine fever virus based on full-length E2 encoding sequences

Molecular epidemiology has proven to be an essential tool in the control of classical swine fever (CSF) and its use has significantly increased during the past two decades. Phylogenetic analysis is a prerequisite for virus tracing and thus allows implementing more effective control measures. So far, fragments of the 5´NTR (150 nucleotides, nt) and the E2 gene (190 nt) have frequently been used for phylogenetic analyses. The short sequence lengths represent a limiting factor for differentiation of closely related isolates and also for confidence levels of proposed CSFV groups and subgroups. In this study, we used a set of 33 CSFV isolates in order to determine the nucleotide sequences of a 3508–3510 nt region within the 5´ terminal third of the viral genome. Including 22 additional sequences from GenBank database different regions of the genome, comprising the formerly used short 5´NTR and E2 fragments as well as the genomic regions encoding the individual viral proteins Npro, C, Erns, E1, and E2, were compared with respect to variability and suitability for phylogenetic analysis. Full-length E2 encoding sequences (1119 nt) proved to be most suitable for reliable and statistically significant phylogeny and analyses revealed results as good as obtained with the much longer entire 5´NTR-E2 sequences. This strategy is therefore recommended by the EU and OIE Reference Laboratory for CSF as it provides a solid and improved basis for CSFV molecular epidemiology. Finally, the power of this method is illustrated by the phylogenetic analysis of closely related CSFV isolates from a recent outbreak in Lithuania.


Introduction
Classical swine fever is a devastating animal disease of great economic concern worldwide [1]. The causative agent, classical swine fever virus (CSFV), is highly contagious and infects domestic pigs as well as wild boar. Infection is transmitted either by direct or indirect contact between infected pigs, by contaminated food or swill feeding, but also by transmission via contaminated objects and/or persons. Molecular virus tracing helps to understand sources and pathways of infection and therefore is an important tool for disease control [2,3].
During the past two decades technical developments like real-time RT-PCR promoted a reliable and rapid CSF diagnosis [4][5][6]. CSFV can be divided into three genotypes (1, 2 and 3), each comprising three or four subgenotypes (1.1-1.3; 2.1-2.3; 3.1-3.4) [7,8]. To assign a newly identified CSFV isolate to a genotype and to describe its phylogenetic relations to other known isolates, nucleotide sequencing is mandatory as other techniques like restriction enzyme analysis may allow segregation on genotype level, but resolution on subgenotype level is often insufficient [7,[9][10][11]. Within the highly variable genus Pestivirus (single stranded positivesense RNA viruses) CSFV is the least variable member [7]. During long lasting epidemics, CSFV has been shown to be relatively stable and only few nucleotide changes can be expected [7,12,13]. For example, during an outbreak in the Netherlands in 1997-1998 sixteen CSFV isolates were genetically characterized. In a time span of more than one year only 0-3 differing nucleotides were observed in the variable E1/E2 encoding region (850 nt) and no nucleotide exchanges were found in a 321 nt fragment located in the 5´NTR [13].
Substitution rates between 2 × 10 -3 and 5 × 10 -4 substitutions/nucleotide/year were estimated for different regions of the CSFV genome [7,14,15]. These examples emphasize that a region of high variability and adequate length is required for reliable phylogenetic analyses and molecular epidemiological investigations.
Until now, genetic typing mainly relies on a short fragment (150 nucleotides) of the 5´non translated region (NTR) [7,8,16]. Furthermore, two additional genome fragments of 190 nt and 409 nt length, located in the E2 [7] and NS5B [17] coding regions, respectively, were proposed for a standardized and harmonized strategy for genetic characterization of CSFV [8]. For this reason, sequences of these three regions were included in the CSFV database (CSFV-DB) of the EU Reference Laboratory for CSF (EURL) in Hannover [18]. To date (January 1 st , 2012), this web based database provides 662 5´NTR fragment, 526 E2 fragment, and 44 NS5B fragment sequences originating from 927 different CSFV isolates. Furthermore, 592 reference sequences from GenBank are included in CSFV-DB, resulting in a total of 1519 sequence entries. The CSFV collection at the EURL and its corresponding sequence database became a valuable tool for CSFV control in Europe.
In today's routine diagnostic, genetic typing of CSFV relies on sequences of the 5´NTR and E2 fragments to characterize individual virus isolates. However, the short sequence lengths of these fragments often hamper the ability to distinguish closely related isolates during an outbreak situation and result in phylogenetic analyses showing only low statistical significance as reflected by bootstrap values below 70% [8]. These limitations are the reason for an ongoing debate on how to best improve the strategy for molecular characterization, phylogenetic analysis and classification of CSFV isolates into defined subgenotypes.
New technologies like high throughput sequencing allow rapid determination of whole CSFV genome sequences, but are still cost extensive and only available in a few institutions. Therefore their usage is limited to special scenarios [12]. To achieve broad acceptance, a new strategy should improve the quality of generated data while still being easily practicable, robust and having a good cost-benefit relation.
The rapidly growing number of full-length E2 (1119 nt) encoding sequences in public databases like GenBank and recent publications reflect the interest in this genomic region [19][20][21]. In addition to the phylogenetic aspect, the E2 coding sequence is of particular interest as the E2 protein is the major immunogen besides the E rns and NS3 proteins [20,22]. For that reason E2 and E rns are suitable targets for diagnostic purposes, including development and implementation of a DIVA (differentiating infected from vaccinated animals) strategy in connection with a live-attenuated marker vaccine [23,24]. Full-length E2 gene sequencing may extend the knowledge about conserved epitopes in the E2 protein suitable for the development of reliable diagnostic tools.
The aim of the present study was to establish an improved strategy for genetic typing of CSFV isolates. The results of our work demonstrate that phylogenetic analyses of either 5´NTR-E2 or full-length E2 encoding sequences allow a clear assignment of CSFV isolates to a subgenotype, being supported by reliable bootstrap values. Discrimination of highly similar virus isolates which were not distinguishable by the analysis of the previously used short, partial 5´NTR or E2 fragment sequences is also possible. Compared to the latter, the analysis of full-length E2 encoding sequences provides a considerable increase of information without requiring more time or higher expenses and is therefore recommended to assist future epidemiological studies on CSF.

CSFV isolates and sequences
All isolates (n = 33) selected from the CSFV-DB held at the EURL in Hannover are shown in Table 1. The CSFV-DB was built up in the 1990s to collect European CSFV isolates and for that reason mainly contains isolates of genotype 2; this is also reflected in genotype representation of the sequenced isolates (30/33 isolates belonging to genotype 2). The used set of CSFV isolates corresponds to an applied selection from a former study following the aim to choose a representative heterogeneous set out of the isolates available in the CSF-DB [25]. Additional 52 5´NTR-E2 sequences were obtained from the GenBank data library to achieve a dataset that represents a higher variety of genotypes and subgenotypes. 22 of these 52 sequences were included in the phylogenetic analyses, comprising sequences of 15 genotype 1 isolates, three Asian genotype 2.1 isolates, two rare genotype 3 isolates, as well as subgenotype 2.3 reference strain "Alfort-Tuebingen" and one recombinant isolate (  GenBank: HQ148062,  HQ148061, GU324242, GU233734, GU233733, GU233732,  GU233731, FJ265020] were used to analyze sequence variability, but were not included in the phylogenetic analyses to reduce tree sizes. Furthermore, CSFV positive samples were obtained from a CSF outbreak in Lithuania in 2011. From each of the five pig holdings affected, two samples were chosen for determination of full-length E2 encoding sequences and subsequent phylogenetic analysis. The five cases were connected geographically and in time as they occurred in holdings with a maximum distance of 15 km in between and within a time period of 38 days. The first case (confirmed on June, 1 st ), the second and the third case (June, 3 rd ) were connected directly to each other as the respective pig holdings belonged to the same company. From the first affected holding piglets were delivered to holding no.2 (on May, 10 th ) and holding no.3 (on May, 19 th ). The epidemiological link to the cases no.4 (confirmed on July, 4 th ) and no.5 (July, 8 th ) is unknown, but it was speculated that virus was transmitted by movement between the farms.

Isolation of viral RNA
All RNAs were isolated from cell culture supernatants derived from cells infected with CSFV isolates of the CSF virus collection located at the EURL, Hannover. RNA was purified from 140 μL supernatant of an infected PK15 cell culture using the ViralAmp RNA purification kit as recommended by the manufacturer (Qiagen, Hilden, Germany). RNA from samples of the recent CSF outbreak in Lithuania in 2011 was isolated from organ and serum samples using the ViralAmp and the RNeasy kit (Qiagen), respectively.

Double stranded nucleotide sequencing
A set of CSFV reference isolates (n = 33) from the EURL virus database was used to expand the knowledge on sequence variability in the 5´-region of the CSFV genome. A region of 3508-3510 nucleotides including the commonly sequenced 5´NTR fragment, the region encoding for the N-terminal protease N pro and the structural proteins (C, E rns , E1 and E2) was amplified by RT-PCR as described above. RT-PCR amplicons were separated by agarose gel electrophoresis (100 V, 30 min) and purified using a commercial kit according to the manufacturer's recommendations (GeneJet Gel Extraction Kit, Fermentas, St. Leon-Rot, Germany). Band elution from agarose gel resulting in an amount of at least 15 ng DNA/μL was suitable to obtain sequences of good quality and length of 800-1000 nucleotides. Purified amplicons were subjected to double-stranded Sanger sequencing (Qiagen). Broad reacting sequencing primers were designed using sequences available from the Gen-Bank database ( Table 2).

Sequence analysis
Sequences were analyzed, edited and trimmed using the freeware program Gentle (by M. Manske). Multiple Sequence Alignment was performed by the "MUltiple Sequence Comparison by Log-Expectation" tool (MUSCLE) [26]. Different parts of these sequences were used for phylogenetic analysis. Distances of sequences were Table 2 GenBank sequences used for phylogenetic analyses.

Amplification and determination of CSFV sequences
The 5´-terminal part of the CSFV genome was amplified in three overlapping amplicons to determine a nucleotide sequence of 3508-3510 nucleotides, comprising a part of the 5´NTR and the genomic region encoding N pro , C, E rns , E1, E2, and the N-terminal two thirds of p7 ( Figure 1). For this purpose, regions conserved among all CSFV genotypes were identified by multiple sequence alignment using 52 CSFV sequences available from GenBank. Three primer pairs for RT-PCR based amplification of the target sequences and two additional sequencing primers located in the E2 coding sequence were designed ( Figure 1, Table 3). For all of the 33 isolates included in this study, independently of their genotype, PCR products were obtained (

Genetic variability of CSFV
The genetic variability was calculated for all available CSFV sequences (n = 85) including the 33 newly determined sequences [GenBank: JQ411559-JQ411591].
Length of the 5´NTR-E2 region differs between 3508 nucleotides and 3510 nucleotides due to one or two additional adenine bases of a poly-adenine stretch located in the 5´NTR. The majority of 5´NTR-E2 sequences (n = 62) has a length of 3508 nucleotides, 22 sequences own a length of 3509 nucleotides, and one sequence has a length of 3510 nucleotides [GU233731]. Calculations were performed for the 5´NTR and E2 fragments as well as for the sequences coding for the non-structural protein N pro and the structural proteins to determine the intrinsic discriminatory ability of the individual genomic regions (Table 4). Furthermore, variability was determined for all isolates independently of their genotype or subgenotype assignment (n = 85) as well as on genotype level for genotype 2 (n = 48) and on subgenotype level for genotype 2.3 (n = 28). Genotype 2 and subgenotypes 2.3 were chosen as representatives because they include the majority of the more recently identified CSFV isolates from Europe and reflect the largest number of available sequences. Analysis of the entire 5´NTR-E2 sequences revealed that about 46% of the positions were variable when all genotypes were included, while 34% and 17% variability were observed within genotype 2 and subgenotype 2.3, respectively. With the exception of the more conserved 5´NTR (9% variable nucleotide positions), variability in the different regions encoding the individual proteins was quite uniform ( Table 4). The N pro coding sequence and the E2 fragment exhibited a slightly increased variability being about 4-7% higher than the variability of the entire 5´NTR-E2 region ( Table 4).

Genetic distances on genotype, subgenotype and isolate level
Matrices of genetic distances were generated using the 5´NTR fragment, the E2 fragment, the full-length E2 and the 5´NTR-E2 sequence to find out whether it is possible to establish breakpoints between genotype, subgenotype and isolate level ( Figure 2). Genetic distances of longer stretches like the full-length E2 and the 5´NTR-E2 sequences allowed clear segregation between genotypes and subgenotypes, respectively. Genetic distances of the 5´NTR-E2 region varied from ≤ 7.7% among isolates of an individual subgenotype, 5.5%-12.1% between the subgenotypes of one genotype, and 14.5%-19.9% between different genotypes. A similar pattern was found for the full-length E2 encoding sequences displaying evolutionary distances of ≤ 8.5% on isolate level, 6.3%-14% on subgenotype level and 15.6%-23% on genotype level, respectively ( Figure 2). This illustrates that even for longer sequence stretches, evolutionary distance between isolates of two different subgenotypes can be higher than the genetic distance between two members of the same subgenotype. Particularly, the observed high variability within genotype 1 as well as between subgenotypes 2.1 and 2.2 does not allow a clear assignment of CSFV isolates to defined/established subgenotypes by the level of genetic distances ( Figure 2 Figure 1 Strategy for RT-PCR and nucleotide sequencing. The 5´-terminal portion of the classical swine fever virus genome encompassing parts of the 5´nontranslated region (NTR) and the region encoding the N-terminal autoprotease N pro and the structural proteins C, E rns , E1, and E2 was amplified by RT-PCR in three overlapping amplicons (PCR1, PCR2 and PCR3). The analyzed regions include the commonly used 5´NTR fragment (150 nt) and E2 fragment (190 nt) sequences. Location of primers (indicated by arrows) and sequence length of the 5´NTR-E2 region (3508-3510 nt) correspond to the sequence of CSFV strain Alfort187 [GenBank: X87939] and can differ between the isolates due to presence or absence of nucleotides in the 5´NTR. Further information on all primers is given in Table 3. the 5´-terminal portion of "strain 39" and "cF114" (subgenotype 1.1) is confirmed by our analysis of the 5´NTR (Figure 3), whereas the entire NTR-E2 sequence stretch of "strain 39" as well as the E2 fragment and the E2 full-length sequence showed the highest homology with various newly determined sequences of CSFV 2.2 isolates (CSF0014, CSF073, CSF0378, CSF0573, CSF0906) instead of CSFV 2.1 isolates (Figure 3, Figure 4). Genetic   distances of these five 2.2 isolates and "strain 39" were 4.7%-8.8% in the full-length E2, whereas the genetic distance between "GWZ02" (subgenotype 2.1) and "strain 39" was 12.7% and thus considerably higher (data not shown). In consequence, the results of our analysis clearly demonstrate that "strain 39" harbors the structural genes of a subgenotype 2.2 isolate rather than of a 2.1 isolate (Figure 3, Figure 4). With the exception of "strain 39", no other recombination events between different genotypes or subgenotypes could be observed when the trees based on different parts of the 5`NTR-E2 genomic regions were compared.  Figure 3 Phylogenetic trees based on the 5´NTR fragment and the entire 5´NTR-E2 sequences. Phylogenetic trees of 33 sequences of isolates from the EURL database (country, year, CSF number) and additional 22 reference sequences originating from GenBank (isolate name, accession number) were calculated by the Neighbor Joining method including bootstrap values for 1000 repetitions. Only statistically significant bootstrap values (≥ 70.0%) are indicated. Evolutionary distances between sequences were calculated by the Kimura-2 parameter method. Trees were rooted at the distinct CSFV strain Great Britain/1964 "Congenital Tremor" [GenBank: JQ411575]. Genotypes and subgenotype names are indicated besides the trees [7,8]. Branch lengths are given as 0.01 substitutions per position according to the scale bars underneath each tree.

Influence of the analyzed region on phylogeny
So far, phylogenetic analysis of CSFV routinely was performed on the basis of the 150 nt 5´NTR fragment and the 190 nt E2 fragment. To analyze the limitations of these short regions and to find a suitable improved strategy, the multiple sequence alignment of the 5´NTR-E2 region (3508-3510 nt) was divided into several subsets, corresponding to the 5´NTR and E2 fragments as well as the regions encoding for the individual viral proteins N pro , C, E rns , E1, and E2 and subsequently analyzed separately. To achieve better comparability, generated phylogenetic trees were rooted against the most distinct isolate "Congenital Tremor" (CSF0410). With the exception of the 5´NTR, the N pro and the E1 encoding sequences all of the regions resulted in a similar phylogenetic grouping and subgrouping independently of the used method. For these three regions, a clear distinction of isolates of subgenotypes 1.1 and 1.2 was achieved neither by the commonly used Neighbor Joining method nor by other phylogenetic calculations like Maximum Likelihood or Bayesian analysis (data not shown). Neighbor Joining trees of the 5´NTR, the E2 fragment, the full-length E2 and the 5´NTR-E2 sequence rooted at the isolate Great Britain/1964 "Congenital Tremor" (CSF0410) are shown in Figure 3 and Figure 4. Trees based on the E2 fragment and the full-length E2 encoding sequences are similar with the trees applying the complete 5´NTR-E2 sequences. The phylogenetic tree based on the 5´NTR fragment allowed a rough genotyping, but failed to differentiate between the subgenotype 1.1 isolates "CAP" and "Glentorf" and the subgenotype 1.2 strains "CS" and "RUCSFPLUM" (Figure 3, Table 5). Apart from the isolates belonging to genotypes 1.1 and 1.2, the trees based on the N pro and E1 coding sequences showed a relative high resolution (data not shown), whereas in the 5´NTR fragment based tree eleven branches comprised two or more isolates, which were not distinguishable from each other ( Figure 3, Table 5).

full-length E2
To gain more detailed insight into the discriminatory ability of the individual genomic regions, the different sequence data sets subjected to phylogenetic analyses were investigated systematically ( Table 5). Some of the individual groups of isolates not distinguishable by the analysis of the 5´NTR fragment comprise strains with an overall high identity reflecting their outbreak history and geographic origin, while other groups encompass strains showing a relatively high sequence divergence with respect to the entire 5´NTR-E2 region (up to 214/3509 variable positions, Table 5). The latter situation was observed for isolates belonging to individual subgenotypes (1.1, 2.1, and 2.3), but also for groups of isolates of different subgenotypes (1.1 and 1.2), again illustrating the limitations of the 5´NTR fragment for discrimination of CSFV isolates. Very closely related and almost identical, recently obtained 2.3 isolates from Slovakia and Hungary (CSF1027, CSF1032), German isolates from the 1990s (CSF0083 and CSF0600; CSF0485 and CSF0638), old subgenotype 1.1 reference strains like "Alfort187" and "LOM" [GenBank: X87939, EU789580] or sequences from two different passages of strain "Brescia" (CSF0947, [GenBank: AF091661]) are either not distinguishable from each other or only at low confidence levels ( Figure 3, Table 5).
For most isolates best discrimination was achieved with the sequences encoding for N pro (504 nt) and E2 (1119 nt), respectively (Table 5). Although N pro and E1 coding sequences show a high degree of variability, phylogenetic analyses revealed that these regions are less suited for clear differentiation of 1.1 and 1.2 isolates when compared to analysis of the full-length E2 genes (data not shown). A reliable differentiation of all analyzed strains -even of very closely related isolateswas possible based on phylogenetic analysis of the full-length E2 encoding sequences (Figure 4, Table 5). This is also reflected by significantly higher bootstrap values supporting the clustering in the tree based on full-length E2 gene sequences when compared to phylogenetic analyses based on the E2 fragment ( Figure 4). For example, bootstrap values at the 17 nodes within subgenotype 2.3 (≤ 8.5% genetic distance) were significant (≥ 70%) in only five cases when trees were generated with the E2 fragment, whereas 11 and 13 of the 17 nodes showed values ≥ 70% when full-length E2 and the entire 5´NTR-E2 sequences were analyzed, respectively. Accordingly, phylogenetic analysis of the entire 5´NTR-E2 region resulted in only slightly increased bootstrap values when compared to the analysis of full-length E2 encoding sequences, although the former is almost three times longer in size ( Figure 3, Figure 4). Taken together, the results of the present study show that phylogenetic analysis of full-length E2 encoding sequences allows differentiation of even closely related isolates and segregation is supported by adequate confidence levels.

Application of the established strategy during recent Lithuanian CSF outbreak
In 2011, a CSF outbreak with five involved domestic pig holdings was reported from Lithuania. From each of the five pig holdings affected, two samples were chosen for determination of full-length E2 encoding sequences. At first sight, by routine analysis of the 5´NTR (150 nt) and E2 (190 nt) fragments no sequence differences could be detected to the Lithuanian CSFV isolate originating from an outbreak in 2009. To study the genetic relatedness of these isolates in more detail, the strategy of full-length E2 sequencing and subsequent phylogenetic analysis was applied ( Figure 5). All full-length E2 encoding sequences were deposited at GenBank [GenBank: JQ411592-JQ411601]. Comparison of the full-length E2 encoding sequences revealed six and seven nucleotide exchanges between the 2009 sequence and two sequences of samples originating from the index case in 2011. Furthermore, the full-length E2 encoding sequences from four subsequent cases (cases 2-5) were determined for two samples each. The short 5´NTR and E2 fragment sequences displayed no differences between the isolates of the five cases in 2011. In contrast, analysis of the E2 full-length encoding sequences revealed at least three differences between the isolates of case 4 and the isolates from the four other cases in 2011. One of these differing nucleotides was also present in the sequence of the Lithuanian isolate from 2009.

Discussion
Different regions of the CSFV genome have been proposed for phylogenetic analysis, namely fragments of the 5´NTR as well as partial E2 and NS5B encoding regions [7,8,16,17,36]. During the past two decades, determination of 5´NTR and E2 fragment sequences became the world-wide accepted standard for characterization of CSFV isolates, although this strategy has several limitations which are mainly due to the short sequence lengths of these regions. Today, new technological developments like next-generation sequencing allow rapid determination of full-length sequences, but due to limited access and high expenses the application of such techniques will be restricted to a limited number of institutions and a small number of selected CSFV isolates in the near future. Against this background, rapid and reliable diagnostics in outbreak situations will still rely on analysis of adequate, shorter genomic regions on the basis of an internationally harmonized standard.
To establish an improved strategy for CSFV phylogeny, the 5´NTR-E2 sequences of 33 CSFV isolates from the virus collection held at the EU and OIE Reference Laboratory for CSF (EURL) were determined in this study and used for comparative sequence analyses. For all isolates, including representatives of the three major genotypes, specific amplicons could be generated by RT-PCR using conserved primers. These virus isolates include frequently requested reference strains, isolates of rare CSFV genotypes as well as isolates obtained from recent CSF outbreaks (e.g. in Slovakia, Hungary, Lithuania). It was not possible to include isolates of all known subgenotypes as some subgenotypes (e.g. 3.1, 3.2 and 3.3) are very difficult to obtain and are not represented in the virus collection of the EURL. For most of the sequenced isolates only the short 5´NTR (150 nt) and E2 fragment (190 nt) sequences were available beforehand. Therefore, the 5´NTR-E2 sequences (3508-3510 nt) reported in the present study add significant sequence information to this collection of CSFV isolates. The majority of CSF outbreaks, which occurred during the past decades in Europe, were caused by genotype 2 viruses. In consequence, mainly sequences of genotype 2 virus isolates were determined, comprising 19 isolates of subgenotype 2.3 and five isolates of subgenotypes 2.1 and 2.2 each.
Furthermore, 5´NTR-E2 sequences of the two distinct isolates "Congenital Tremor" (CSF0410, no assigned genotype) and "Kanagawa" (CSF0309, genotype 3.4), the reference strain "Brescia" (CSF0947, genotype 1.1) and one Malaysian isolate (CSF0306) of the rare genotype 1.3 were determined. With regard to the entire 5´NTR-E2 sequences determined in this study and 22 additional sequences obtained from GenBank, all CSFV isolates were assigned to established genotypes and subgenotypes (Figure 3). Our analyses revealed that CSFV "strain 39" [GenBank: AF407339], which has been previously described to be a natural recombinant strain of parental subgenotype 1.1 and 2.1 isolates [35], actually represents a chimera of subgenotype 1.1 and 2.2 isolates (Figure 3, Figure 4). Furthermore, it was recognized that strain The Netherlands/xxxx "Bergen" (CSF0906, subgenotype 2.2) partially displayed a higher genetic similarity to some genotype 2.1 isolates, e.g. to CSFV isolate CSF0021, than to different 2.2 isolates (data not shown). This observation might be a hint for a recombination event between subgenotype 2.1 and 2.2 isolates and is under further investigation. In consequence, strain The Netherlands/ xxxx "Bergen" (CSF0906) might disturb segregation of 2.1 and 2.2 isolates when further 2.1 and 2.2 isolates are added in phylogenetic analysis.
Variability and length of analyzed sequences are crucial parameters for the reliability of phylogenetic analyses. The overall variability observed for the different genomic regions is astonishingly uniform ( Table 4). Exceptions are the more conserved fragment in the 5´NTR and the slightly more variable E2 fragment. In consequence, not variability but length of the used sequence seems to be crucial to optimize resolution and confidence levels of CSFV phylogeny. Low variability of 9% (14/150 nucleotide positions) in concert with the short sequence length of 150 nt explains the intrinsic limitation of the 5´NTR for phylogenetic analyses. Due to its variability, the 190 nt E2 fragment has the greatest intrinsic discriminatory ability with respect to the above mentioned 5´NTR, E2, and NS5B fragments [7]. The E2 fragment encodes for the N-terminal part of the E2 protein harbouring several neutralizing epitopes resulting in selective pressure [22,[37][38][39]. When comparing the variability of the sequences encoding for the major immunogen E2 and the sequences of other viral proteins like N pro , E1 or C, which do not elicit a detectable immune response upon infection, it can be concluded that selection pressure mediated by specific immune reactions is not a major cause of E2 divergence since the overall sequence divergence in other genomic regions reaches similar levels (Table 4). Nevertheless, it can be speculated that lack of antigenic selection pressure might be a reason for the failure of N pro -and E1-based analyses to discriminate Germany/1997 [CSF0277] Great  genotype 1.1 and 1.2 isolates (data not shown). Genotype 1 represents an old and therefore highly variable CSFV genotype. Antigenic selection pressure might have been an important force for development of the 1.1 and 1.2 subgenotypes, while sequence divergence is less pronounced in genomic regions encoding for less immunogenic proteins like N pro and E1. In the present study, analysis of genetic variability in the regions encoding the individual viral proteins (overall 46% variable positions) did not identify regions of adequate length that are more variable than the 504 nt N pro encoding sequence and the 190 nt E2 fragment (50% variable positions). Taking into account the above mentioned limitations of the short 5'NTR fragment as well as the limitations of the nucleotide sequences encoding N pro and E1 for CSFV phylogeny, extension of the short sequence of the E2 fragment to full-length E2 gene sequences is an excellent strategy to obtain data for reliable and detailed phylogenetic analyses ( Figure 4).
Calculation and analysis of genetic distances with respect to full-length E2 encoding sequences revealed that genetic distances of more than 15% define a genotype and distances of less than 14% can be found on subgenotype and isolate level ( Figure 2). These values will probably not have consistency with an increasing number of analyzed sequences. Furthermore, it was not possible to define universally valid breakpoints between isolate and subgenotype level. Discrimination of the isolate and subgenotype categories based on previously reported ranges for the NS5B fragment (4.5% and 10.5% genetic distance, respectively) is not supported by the analyses of the presented study [8].
For phylogenetic analysis, the use of a standardized method for tree calculation is desirable to achieve a better comparability of internationally published data. In the presented study, genetic distances calculated by the Kimura 2-parameter method and phylogenetic trees generated by Neighbor Joining method subsequently rooted at the strain "Congenital Tremor" (CSF0410) -representing the isolate most distinct from all other CSFV isolates known so far -led to appropriate tree topologies and reliable confidence levels ( Figure 3, Figure 4). The phylogenetic trees either generated with full-length E2 encoding sequences or with the 5´NTR-E2 sequences showed the same segregation of CSFV isolates into genotypes and subgenotypes. Compared to E2 full-length sequences, the sequences derived from the 5´NTR and E2 fragments which are currently used for phylogenetic analyses are considerably less suited for differentiation and tracing of CSFV isolates. In case of the 5´NTR fragment the sequence length and intrinsic variability are too low and in case of the E2 fragment the short sequence length significantly limits the information content and consequently diminishes confidence levels of many groupings. The data presented in Figure 3 and Table 5 demonstrate the limited ability of the 5´NTR based trees to differentiate between isolates within a certain subgenotype. In addition, analysis of the 5´NTR fragments fails to segregate isolates into defined subgenotypes as observed for 1.1 and 1.2. This problem was also recognized earlier with other isolates of genotype 1 [7]. Segregation within genotype 1 can be improved by using the E2 fragment, but within a subgenotype, like 2.3, the ability to differentiate closely related isolates (e.g. Slovakian isolates) is still insufficient (Figure 4). Moreover, the trees generated with the E2 fragment sequences display only very low confidence levels which do not allow a further division of the established subgenotypes or a reliable epidemiological interpretation. The high similarity among European isolates, mainly belonging to genotype 2, makes the implementation of a strategy based on larger sequence sets an incontrovertible necessity. This is illus-trated by the following examples of CSFV isolates not distinguishable on basis of the short 5Ń TR sequences (Table 5). With respect to the analyzed 5´NTR-E2 sequences, the two isolates CSF0277 (Germany, 1997) and CSF0283 (The Netherlands, 1997) differed in two sites, one of them located in the E2 encoding sequence. These isolates were obtained from a cross-border epidemic and have a direct epidemiological link [40]. Isolates CSF1027 and CSF1032 were obtained from wild boar during the 2007 epidemic in Slovakia and Hungary, respectively, and displayed two nucleotide differences in the E2 encoding sequences. Closely related virus isolates obtained from different German CSF outbreaks in the 1990s (CSF0083 and CSF0600; CSF0485 and CSF0638) were clearly distinguishable on the basis of full-length E2 encoding sequences ( Figure 4, Table 5). Furthermore, isolates displaying a high degree of sequence similarity without an epidemiological link (e.g. isolates "LOM" and "Alfort187") also illustrate the discriminatory ability of the full-length E2 encoding sequences. These examples as well as the recent experiences regarding the Lithuanian outbreaks in 2009 and 2011 clearly demonstrate that the information obtained by analysis of the full-length E2 encoding sequences allows to discriminate even between very closely related virus isolates from the same epidemic and from (nearly) the same geographical origin ( Figure 5). Assuming a mutation rate of 3.3 × 10 -3 to 3.7 × 10 -3 substitutions/nucleotide/year in the E2 encoding sequence as estimated for the E2 fragment sequence [7,15], approximately 0.6-0.7 nucleotide exchanges may be expected in the short E2 fragment (190 nt) and 3.7-4.1 exchanges in the complete E2 encoding sequence (1119 nt) per year, respectively. Although analysis of full-length E2 encoding sequences results in a significant increase of information, the mutation rate is probably too low for exact determination of infection chains.
To date, both fragments, 5´NTR and E2, are routinely amplified and sequenced for identification and characterization of novel CSFV isolates. The recent CSF outbreak in Lithuania demonstrated that determination of both sequences corresponding to the 5´NTR and E2 fragments was neither able to differentiate between isolates obtained during outbreaks in 2009 and 2011 nor to detect differences between the isolates originating from different outbreak holdings in 2011 ( Figure 5). In contrast, phylogenetic analysis of full-length E2 encoding sequences allowed the discrimination of the 2009 and 2011 Lithuanian isolates and identified significant differences between isolates of case no.4 and the isolates of the four other cases. These results suggest that the index case was the source of virus transmission for outbreaks no.2, 3, and 5, while it can be speculated that the virus isolate from case no.4 was introduced either after additional steps of (undetected) transmission or from another source. To allow a reliable interpretation of this finding, more full-length E2 encoding sequences from different CSF epidemics and corresponding epidemiological information need to be analyzed. Against this background, molecular clock analyses of sequences obtained from well documented CSF epidemics would be highly desirable and will be the aim of future studies. Such analyses need to take into account that speed of virus evolution is influenced by many factors including host immunity, vaccination campaigns, presence of virus reservoirs, number of passages in hosts, and last but not least socio-economic determinants [41]. Nevertheless, even without detailed knowledge about speed of molecular evolution in CSF epidemics, the analysis of full-length E2 encoding sequences provides valuable information about the origin of virus introduction as this method increases the probability to identify the ancestral virus isolate. In case of the two Lithuanian outbreaks in 2009 and 2011, identical isolates would have indicated an arrest of molecular clock like in infectious material being frozen (frozen meat, frozen laboratory isolate, etc.). The latter scenario could be clearly excluded by analysis of the full-length E2 encoding sequences. Accordingly, the Lithuanian example illustrates the benefit of phylogenetic analysis of full-length E2 encoding sequences with regard to molecular virus tracing.
Taken together, the proposed strategy based on complete E2 coding sequences allows a clear assignment of CSFV isolates to a subgenotype, results in reliable and statistically significant bootstrap values, and even enables the discrimination of highly similar virus isolates without requiring more time or higher expenses.