Springboard-Capston-project 1

Incorporating molecular data-driven decisions towards soybean improvement

1. The problem I want to solve

Many agriculturally important traits such as yield, quality and some forms of disease resistance are controlled by many genes and are known as quantitative traits. It’s labor intensive and time consuming to evaluate those traits. The strategy of genomic selection (GS) estimates the genetic merit of an individual based on molecular genetic information by simultaneously accounting for all DNA markers. GS is a form of marker-assisted selection that selects progeny lines or parents based on the estimated genomic estimated breeding values (GEBVs), which leads to shorter breeding cycle duration as it is no longer necessary to wait for late filial generations’ performance trial. The problem I want solve herein is to build a genomic selection model for prediction of soybean complex traits (yield, protein and oil content).

2. Potential client and why do they care about this problem

The exponential growth of the world’s population correlates to a stark rise in the demand for food production and an unprecedented opportunity for agribusiness companies. To meet this challenge, agribusiness companies are innovating for accelerating the release of superiority crop varieties. Their efforts take the form of artificial intelligence (AI) and remote sensors in the field, drones for crop monitoring and genomic selection for complex traits. So potential clients are agriculture companies like Monsanto, Sygenta and Bayer.

3. Data source information

GS usually needs a large amount of molecular markers covering the entire genome in term of all relevant genomic regions are represented by molecular markers and phenotypic data. The complete data set I will use is for 20,087 G. max and G. soja accessions genotyped with 42,509 SNPs is available for Wm82.a1 and Wm82.a2 in HapMap format. It’s public available through the following link: https://soybase.org/snps/index.php. The phenotypic data is from a published paper https://www.g3journal.org/content/6/8/2329. Moreover, I will incorporate our own genotype and phenotype data (~1000 accessions)with the above big datasets.

4. Brief outline I’ll follow to solve this problem

Since the prediction model is calibrated with the genotypes and phenotypes of reference individuals constituting the training set, the following outline lists the major steps I will follow. a) Collection of both phenotypic and genotypic data for each line in the training population (TP). b) Integrating the genotypic and phenotypic data, conduct statistical analysis on TP to estimate allele effects at all loci on the phenotype. c) Generation of a prediction model for the GEBVs that combines all the marker genotypes with their effects on the predictive value of each line. d) Application of the prediction model on breeding population for which genotypes (but not phenotypes) are available. GEBVs are estimated and the best lines are selected for breeding. The essential steps of GS are also shown as following:

The allelic state of individuals can be coded as a matrix of 0, 1, or 2 to a diploid genotype value of AA, AB, or BB, respectively. Since the number of marker served as predictors (p) is usually far greater than the number of individual lines (n), it raise issues in multi-collinearity and overfitting among predictors. Basically, there are two regressions methods can be used, parametric and nonparametric regressions, including 11 models have been developed to address the “large p small n” issue. They are RR-BLUP (Meuwissen et al., 2001), least absolute shrinkage and selection operator (LASSO, Li et al., 2012) , elastic net (EN, Zou et al., 2005 ), Bayesian ridge regression (BRR, de los Campos et al., 2013), Bayesian version of LASSO (BL, Li et al., 2012) Bayes A (de los Campos et al., 2013), Bayes B (Heffner et al., 2011), Bayes C (de los Campos et al., 2013, 2016) and BayesCπ (Habier et al., 2011), reproducing kernels Hilbert spaces regression (RKHS, Gianola et al., 2008) and random forest (RF, Holliday et al., 2012). Moreover, improved GS approaches including non-linear algorithm, such as artificial neural networks (ANN) (i.e. deep learning) and gradient tree boosting will also be tested in this project.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
GWAS via GAPIT		GWAS via GAPIT
Capstone Project 1_ Data Wrangling_Z. Wen (updated).pdf		Capstone Project 1_ Data Wrangling_Z. Wen (updated).pdf
Capstone Project 1_ Final Report_Zixiang Wen.pdf		Capstone Project 1_ Final Report_Zixiang Wen.pdf
Capstone Project 1_ Milestone Report_Zixiang Wen.pdf		Capstone Project 1_ Milestone Report_Zixiang Wen.pdf
Capstone Project 1_Exploratory Data_Wen_10_30_2019.ipynb		Capstone Project 1_Exploratory Data_Wen_10_30_2019.ipynb
DL_Capstone project 1_2019_11_19.ipynb		DL_Capstone project 1_2019_11_19.ipynb
Data Story_for_CS_project 1_data_9_15_2019.ipynb		Data Story_for_CS_project 1_data_9_15_2019.ipynb
Data cleaning_unify_for_CS_project 1_data_9_25.ipynb		Data cleaning_unify_for_CS_project 1_data_9_25.ipynb
ML_Capstone project 1_2019_12_2.ipynb		ML_Capstone project 1_2019_12_2.ipynb
Project 1_7-25-2019(updated).pdf		Project 1_7-25-2019(updated).pdf
README.md		README.md
SB capstone project 1 report.pdf		SB capstone project 1 report.pdf
gs.png		gs.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Springboard-Capston-project 1

Incorporating molecular data-driven decisions towards soybean improvement

1. The problem I want to solve

2. Potential client and why do they care about this problem

3. Data source information

4. Brief outline I’ll follow to solve this problem

About

Releases

Packages

Languages

wzxsoy/Springboard-Capston-project-1

Folders and files

Latest commit

History

Repository files navigation

Springboard-Capston-project 1

Incorporating molecular data-driven decisions towards soybean improvement

1. The problem I want to solve

2. Potential client and why do they care about this problem

3. Data source information

4. Brief outline I’ll follow to solve this problem

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages