Plant materialA total of 75 accessions were chosen from the gene bank atIBERS to make the ecotype panel. 70 of these accessions were from naturalpopulation collections, representing 16 different countries from Europe, three countiesfrom Asia and one from the Middle East.
A further five accessions werecommercially available varieties. The countries were allocated to theirgeographic regions according to The World Factbook and Eurovoc (Table 1 and 2,Figure 1). The five varieties (Table 3) were chosen for certaincharacteristics. AberRuby is an old IBERS variety. It has a lax growth habit,with few stems and now has little agronomic use. Broadway and Crossway areearly generation varieties developed at AgResearch in New Zealand from ecotypecollections from Portugal and Spain. The plants are low growing and may undercertain damp conditions produce runners that root (Rumball, et al., 2003).
Milvus and Britta are tallMattenKlee type varieties that were developed in Switzerland and Sweden,respectively. Milvus is an early flowering variety, which has many highyielding stems and was bred for dry hay production. It is also reported to showresistance to Sclerotinia trifoliorum(crown rot) (Boller, et al., 2012). Britta is a late floweringvariety (Lundin, et al., 1974) and shows a degree ofresistance to stem nematode (Taylor and Quesenberry, 1996). There were 640 plants in theecotype panel; eight plants from each ecotype accession and 16 from each of thevarieties. The clovers were planted at IBERS in 2015 in a randomized blockdesign.
DNA extraction and GBS protocolDNA was extracted from 100mg of fresh young leaf tissueusing a Qiagen DNeasy extraction kit in a 96 well format. The DNA concentrationwas measured using a Qubit™2.0, and normalized to 10ngµl-1 with sterile TEbuffer.
The DNA was prepared for sequencing following the published GBSprotocol (Elshire, et al., 2011), with modifications. ApeKIwas chosen for the restriction enzyme, and bar coded adapters were annealed toeach genotype. 16 of the annealed DNA samples were mixed at a time across theplate, and cleaned with magnetic beads. The concentration of the 16 mixedgenotypes (6 per plate) was calculated by Qubit™2.0, and 40ng was used in aPCR. The PCR product was cleaned by magnetic beads, and the concentrationcalculated by Qubit™2.
0; the PCR was repeated 3 more times. All of the PCRproducts were then mixed at equal concentration (10ngµl-1) toform the final library. The final library was analysed as 96-plex with 125bpsingle end sequencing using an Illumina HiSeq2000 NGS platform. SNP discoveryTASSEL5v2 (Glaubitz, et al., 2014) and BWA (Li and Durbin, 2009) were used in conjunction with the red clovergenome (De Vega, et al., 2015) for a reference based GBSpipeline to identify high quality SNP within the ecotype panel. In the TASSEL5v2pipeline, the quality of the SNP calling was set with the following parameters:20bp minimum and 64bp maximum length; a quality score of 20 was used which isrelated to the quality of the base calling on the Illumina; and a minimum cutoff of 10 reads per SNP site; the minor allele frequency was set to 0.
05 toallow for heterozygote calling. For a SNP to pass into the final count, it hadto be present in least 80% of the genotypes. Finally, missingness was set to0.
1 per genotype; this ensured that genotypes with less than 10% of the SNPwere excluded from the dataset. Data analysisData imputationData were analysed in R studio using a variety of packages (Ihaka and Gentleman, 1996, Team, 2014). Data re-coding andimputation of missing markers was implemented in Synbreed (Wimmer, et al.
, 2012). The imputation was performedby random sampling of the allele distributions, with the minor allele frequencyset to 0.05, and data missingness to 0.2.
Population structureUsing all of the markers, the population was grouped intorespective clusters by using UPGMA in the program Cluster (Maechler, et al., 2017). This package was also usedto produce a dendrogram of phylogenetic relationships (termed relationship treein the text). A principal component analysis (PCA) was calculated within R, andcoloured according to the derived clusters from the above analysis. To furtheridentify delimitations in the population structure and to infer ancestry,STRUCTURE v2.3 (Pritchard, et al., 2000) was used with an unbiased Bayesian approach Markovchain Monte Carlo (MCMC) clustering of samples. The data were assessed for Kvalues ranging from 1 to 15 with burnin and MCMC iterations set to 20,000 each.
For each value of K, three replications were made. STRUCTURE harvesterweb v0.6.94 (Earl and vonHoldt, 2012) was used to find the optimumK value for the population using the ?K.This assigned the ecotypes to their best fit genetic group. A total of250 SNP were used per chromosome.An analysis of molecular variance (AMOVA) was performed onthe data using Pegas (Paradis, 2010).
The population was grouped into the varietiesand the ecotypes. AMOVA was also used to determine the index of geneticdifferentiation (?ST),which is a standardised inter-group genetic distance of two geographic regions.It represents the correlation of genes of different individuals in apopulation. Values of ?STindex are defined as follows: differentiation may be little (?ST <0.05), moderate (?ST0.05 - 0.15), great (?ST0.
15 – 0.25) and very great (?ST> 0.25) (Hartl and Clark, 2007).Population statisticsThe population genetics parameters (FIT, FST,FIS, and HW) for each SNP marker was implemented in Pegas (Paradis, 2010) using the allele frequencies. The overallpopulation genetic statistics were calculated in Hierfstat (Goudet, 2005).The global pairwise FST was subsequently calculated in StAMPP (Pembleton, et al., 2013). Haplotype phasing of the dataTo reconstruct the haplotypes of the genotype data, theprogram fastPHASE v1.
4.8 (Scheet and Stephens, 2006) was used. fastPHASEimplements a Bayesian statistical method to reconstruct the haplotypes of agiven dataset. The setting used were 10 random start runs for the EM algorithm,with 35 iterations per run (default setting), and the number of haplotypeclusters was determined by the program which showed the lowest overall errorrate (default setting). The haplotype clustering was performed by a Hidden MarkovModel (HMM), using a cross validation process (using a random 15% missingdata). The data were assessed on a per chromosome basis, so seven passes of thefastPHASE program was used. The haplotype data were imported into excel, andeach cell coloured according to the SNP.
This allowed visual examination of thedata for patterns of haplotype blocks.Outlier detectionTo identify candidate loci potentially influenced byselection, a three way comparison was made between loci identified in basic R (Team, 2014), BayeScan v2.0 (Foll and Gaggiotti, 2008) and Sam?adav0.5.3 (Anselin, 1995, Joost, et al., 2007, Stucki and Joost, 2015, Stucki, etal., 2016).To test the effect of environment (i.
e.geographic origin) on genetic diversity, the geographical coordinates were usedfor each sampling site. Sam?ada is a spatial correlation method,which uses non-random associations between cause and effect. It uses Moran’s Icorrelation (Moran, 1950)which assesses the overall clustering of the data, and this is related to thefirst law of geography “everything is related to everything else” (Tobler, 1970).
Parameters were set to spatial for the autocorrelation, and nearest neighbours20 for the weighting. Significant loci were identified after Wald and G testsfollowing Bonferonni correction at a 99% confidence level. Linkage disequilibrium The panel of samples were divided into the varieties andecotypes to assess patterns of linkage disequilibrium (LD). LD was calculatedusing two methods: Synbreed, and a custom R script written for use with the Geneticspackage (Grinberg, et al.
, 2016, Warnes, 2012)using the method as described in Wang et al (Wang, et al., 2013).LD was calculated as the squared allele frequency (r2) between eachpair of SNP loci. Alleles were only considered if the minor allele frequencywas above 0.05, as r2 has large variances if rare alleles areconsidered (Wen, et al., 2009).
LD decay was estimated by using estimatedsecond-degree locally weighted polynomial regression (LOESS) (Cleveland, 1979).For both the R script and Synbreed, all SNP locithat passed quality control were implemented in the LD calculation.Phenotype measurementsFlowering date was recorded per plant from the initial startpoint of 1st April until 21st June in 2016 (day 1 – 81)and a further date recorded as day 100 (10th July 2016) as day ofharvest. Measurements for plant height, width, number of stems and two harvestwet weights were also recorded. Any data that were not normally distributed werelog transformed to improve homogeneity and analysed in Genome Association andPrediction Integrated Tool (GAPIT) for GWAS.
The analysis was performed withcompressed mixed linear model (Zhang, et al., 2010)implemented in the GAPIT R package (Lipka, et al., 2012).