A total of 75 accessions were chosen from the gene bank at
IBERS to make the ecotype panel. 70 of these accessions were from natural
population collections, representing 16 different countries from Europe, three counties
from Asia and one from the Middle East. A further five accessions were
commercially available varieties. The countries were allocated to their
geographic regions according to The World Factbook and Eurovoc (Table 1 and 2,
Figure 1). The five varieties (Table 3) were chosen for certain
characteristics. AberRuby is an old IBERS variety. It has a lax growth habit,
with few stems and now has little agronomic use. Broadway and Crossway are
early generation varieties developed at AgResearch in New Zealand from ecotype
collections from Portugal and Spain. The plants are low growing and may under
certain damp conditions produce runners that root (Rumball, et al., 2003). Milvus and Britta are tall
MattenKlee type varieties that were developed in Switzerland and Sweden,
respectively. Milvus is an early flowering variety, which has many high
yielding stems and was bred for dry hay production. It is also reported to show
resistance to Sclerotinia trifoliorum
(crown rot) (Boller, et al., 2012). Britta is a late flowering
variety (Lundin, et al., 1974) and shows a degree of
resistance to stem nematode (Taylor and Quesenberry, 1996). There were 640 plants in the
ecotype panel; eight plants from each ecotype accession and 16 from each of the
varieties. The clovers were planted at IBERS in 2015 in a randomized block
DNA extraction and GBS protocol
DNA was extracted from 100mg of fresh young leaf tissue
using a Qiagen DNeasy extraction kit in a 96 well format. The DNA concentration
was measured using a Qubit™2.0, and normalized to 10ngµl-1 with sterile TE
buffer. The DNA was prepared for sequencing following the published GBS
protocol (Elshire, et al., 2011), with modifications. ApeKI
was chosen for the restriction enzyme, and bar coded adapters were annealed to
each genotype. 16 of the annealed DNA samples were mixed at a time across the
plate, and cleaned with magnetic beads. The concentration of the 16 mixed
genotypes (6 per plate) was calculated by Qubit™2.0, and 40ng was used in a
PCR. The PCR product was cleaned by magnetic beads, and the concentration
calculated by Qubit™2.0; the PCR was repeated 3 more times. All of the PCR
products were then mixed at equal concentration (10ngµl-1) to
form the final library. The final library was analysed as 96-plex with 125bp
single end sequencing using an Illumina HiSeq2000 NGS platform.
TASSEL5v2 (Glaubitz, et al., 2014) and BWA (Li and Durbin, 2009) were used in conjunction with the red clover
genome (De Vega, et al., 2015) for a reference based GBS
pipeline to identify high quality SNP within the ecotype panel. In the TASSEL5v2
pipeline, the quality of the SNP calling was set with the following parameters:
20bp minimum and 64bp maximum length; a quality score of 20 was used which is
related to the quality of the base calling on the Illumina; and a minimum cut
off of 10 reads per SNP site; the minor allele frequency was set to 0.05 to
allow for heterozygote calling. For a SNP to pass into the final count, it had
to be present in least 80% of the genotypes. Finally, missingness was set to
0.1 per genotype; this ensured that genotypes with less than 10% of the SNP
were excluded from the dataset.
Data were analysed in R studio using a variety of packages (Ihaka and Gentleman, 1996, Team, 2014). Data re-coding and
imputation of missing markers was implemented in Synbreed (Wimmer, et al., 2012). The imputation was performed
by random sampling of the allele distributions, with the minor allele frequency
set to 0.05, and data missingness to 0.2.
Using all of the markers, the population was grouped into
respective clusters by using UPGMA in the program Cluster (Maechler, et al., 2017). This package was also used
to produce a dendrogram of phylogenetic relationships (termed relationship tree
in the text). A principal component analysis (PCA) was calculated within R, and
coloured according to the derived clusters from the above analysis. To further
identify delimitations in the population structure and to infer ancestry,
STRUCTURE v2.3 (Pritchard, et al., 2000) was used with an unbiased Bayesian approach Markov
chain Monte Carlo (MCMC) clustering of samples. The data were assessed for K
values ranging from 1 to 15 with burnin and MCMC iterations set to 20,000 each.
For each value of K, three replications were made. STRUCTURE harvester
web v0.6.94 (Earl and vonHoldt, 2012) was used to find the optimum
K value for the population using the ?K.
This assigned the ecotypes to their best fit genetic group. A total of
250 SNP were used per chromosome.
An analysis of molecular variance (AMOVA) was performed on
the data using Pegas (Paradis, 2010). The population was grouped into the varieties
and the ecotypes. AMOVA was also used to determine the index of genetic
which is a standardised inter-group genetic distance of two geographic regions.
It represents the correlation of genes of different individuals in a
population. Values of ?ST
index are defined as follows: differentiation may be little (?ST < 0.05), moderate (?ST 0.05 - 0.15), great (?ST 0.15 - 0.25) and very great (?ST > 0.25) (Hartl and Clark, 2007).
The population genetics parameters (FIT, FST,
FIS, and HW) for each SNP marker was implemented in Pegas (Paradis, 2010) using the allele frequencies. The overall
population genetic statistics were calculated in Hierfstat (Goudet, 2005).
The global pairwise FST was subsequently calculated in StAMPP (Pembleton, et al., 2013).
Haplotype phasing of the data
To reconstruct the haplotypes of the genotype data, the
program fastPHASE v1.4.8 (Scheet and Stephens, 2006) was used. fastPHASE
implements a Bayesian statistical method to reconstruct the haplotypes of a
given dataset. The setting used were 10 random start runs for the EM algorithm,
with 35 iterations per run (default setting), and the number of haplotype
clusters was determined by the program which showed the lowest overall error
rate (default setting). The haplotype clustering was performed by a Hidden Markov
Model (HMM), using a cross validation process (using a random 15% missing
data). The data were assessed on a per chromosome basis, so seven passes of the
fastPHASE program was used. The haplotype data were imported into excel, and
each cell coloured according to the SNP. This allowed visual examination of the
data for patterns of haplotype blocks.
To identify candidate loci potentially influenced by
selection, a three way comparison was made between loci identified in basic R (Team, 2014)
, BayeScan v2.0 (Foll and Gaggiotti, 2008) and Sam?ada
v0.5.3 (Anselin, 1995, Joost, et al., 2007, Stucki and Joost, 2015, Stucki, et
To test the effect of environment (i.e.
geographic origin) on genetic diversity, the geographical coordinates were used
for each sampling site. Sam?ada is a spatial correlation method,
which uses non-random associations between cause and effect. It uses Moran’s I
correlation (Moran, 1950)
which assesses the overall clustering of the data, and this is related to the
first law of geography “everything is related to everything else” (Tobler, 1970).
Parameters were set to spatial for the autocorrelation, and nearest neighbours
20 for the weighting. Significant loci were identified after Wald and G tests
following Bonferonni correction at a 99% confidence level.
The panel of samples were divided into the varieties and
ecotypes to assess patterns of linkage disequilibrium (LD). LD was calculated
using two methods: Synbreed, and a custom R script written for use with the Genetics
package (Grinberg, et al., 2016, Warnes, 2012)
using the method as described in Wang et al (Wang, et al., 2013).
LD was calculated as the squared allele frequency (r2) between each
pair of SNP loci. Alleles were only considered if the minor allele frequency
was above 0.05, as r2 has large variances if rare alleles are
considered (Wen, et al., 2009). LD decay was estimated by using estimated
second-degree locally weighted polynomial regression (LOESS) (Cleveland, 1979).For both the R script and Synbreed, all SNP loci
that passed quality control were implemented in the LD calculation.
Flowering date was recorded per plant from the initial start
point of 1st April until 21st June in 2016 (day 1 – 81)
and a further date recorded as day 100 (10th July 2016) as day of
harvest. Measurements for plant height, width, number of stems and two harvest
wet weights were also recorded. Any data that were not normally distributed were
log transformed to improve homogeneity and analysed in Genome Association and
Prediction Integrated Tool (GAPIT) for GWAS. The analysis was performed with
compressed mixed linear model (Zhang, et al., 2010)
implemented in the GAPIT R package (Lipka, et al., 2012).