Presentation file for ipPCA workshop

Transcription

Workshop R.ipPCA BSI STAFFS 1 ipPCA so/ware itera(ve pruning Principal Component Analysis a tool for popula(on clustering •  input: genotype dataset •  output: number of possible subpopulaBons and their members ipPCA version 1 and 2 (Matlab) : Download : www4a.biotec.or.th/GI/tools/ippca R.ipPCA version 0.1 (R language): •  Graph visualiza(on •  Generate input file for ADMIXTURE and PLINK 2 ipPCA to assist populaBon geneBcs studies ADMIXTURE PopulaBons 10
genotype data 10
0
5
−10
0
−20
−20
5
0
20
5
0
10
0
5
−5
0
−20
−15
−10
−5
0
5
10
15
20
−10
0
−20
−20
0
20
−5
0
5
10
15
20
0
5
−10
−50
−40
0
−20
20−5
−20
−15
−20
0
−30
−10
−15
−20
−20
0
0
20
20
40
−10
−5
0
5
10
0
0
10
20
20
40
−5
0
5
−20
0
20
0
10
20
00
−10
−50
−20
−40
−20
−5
−10
0
0
5
10
10
00
−20
−20
−10
−20
0
−20
−10
ipPCA 50
10
0
10
pop1
pop4
pop2
pop3
popout
20
−10
20
−10
40
40
20
−20
−30
−15
20
10
pop1
pop4
pop2
pop3
popout
40
0
10
0
−10
−15
0
10
15
40
−5
0
0
−10
10
5 −10
0
40
0
−10
−20
50
10
20
−5
40
−20
−20
−20
−15 10 −10
−15
0
20
−5
−15
−10
−20
−20
−10
−20
20
−10
−5
−10
10
10
15
10
0
15
20
50
10
pop1
pop4
pop2
pop3
popout
PLINK associa6on 0
0
10
15
20
−50
−40
−10
40
−20
20
−30
−15
−10
−5
4 subpopula(ons and their members 0
5
10
0
−20
−10
3 PopulaBon geneBcs analysis PopulaBons genotype data 10
0
15
−10
10
−20
−20
5
0
10
Let’s have an exercise with R.ipPCA 0
20
40
0
20
40
−20
0
20
0
10
20
20
−5
5
0
−10
−20
−10
0
10
0
−20
−20
−5
−10
50
10
−15
−20
−15
pop1
pop4
pop2
pop3
popout
0
0
−10
−5
0
5
10
15
20
−50
−40
−10
40
−20
20
−30
−15
−10
−5
0
5
10
0
−20
−10
ipPCA 4 subpopula(ons and their members 4 Install R.ipPCA package 1. Install required libraries: scaFerplot3d, Matrix, expm, e1071, stats 1
Package 5 How to install R.ipPCA package 2. Download R.ipPCA package hVp://www4a.biotec.or.th/GI/IRRDB_Workshop 6 How to install ipPCA package 2. Install R.ipPCA_package.tgz 2
3
4
1
How to install ipPCA package 3. Type library(R.ipPCA) in R console panel or click package menu and (ck þ on R.ipPCA package prac6ce 8 Input: Genotype data •  SNP data 1 •  Microsatellite data 2 9 ipPCA PracBce Example 1 : simulated SNP data Example 2 : Human SNP data (HapMap) Example 3 : Plant microsatellite data 10 ipPCA PracBce Example 1 : simulated SNP data Example 2 : Human SNP data (HapMap) Example 3 : Plant microsatellite data 11 snp snp1 snp2 snp3 snp4 snp5 snp6 snp7 snp8 snp9 snp10 POP1 AA AA AA AA AA AA AA TT AA AA POP1 AT AA AA AA AA AA AA TT AA AA Genotype Encoding AA 0 AB 1 BB 2 Missing -‐1 snp snp1 snp2 snp3 snp4 snp5 snp6 snp7 snp8 snp9 snp10 POP1 0 0 0 0 0 0 0 2 0 2 POP2 AA AA AA AA AC GG AA AA AA CC POP2 AA AA AA AA AA GG AA AA AA -‐-‐ POP3 TT AA AA CC CC AA AA AA AA CC POP3 AA AA AA CC CC AG AA AA AA CC POP4 AA GG AA AA AA AA AA AA GG CC POP4 AA GG AA AA AA AA AA AA GG CC SNP convert to 0,1,2 format POP1 1 0 0 0 0 0 0 2 0 2 POP2 0 0 0 0 1 2 0 0 0 0 POP2 0 0 0 0 0 2 0 0 0 -‐1 POP3 2 0 0 2 2 0 0 0 0 0 POP3 0 0 0 2 2 1 0 0 0 0 POP4 0 2 0 0 0 0 0 0 2 0 POP4 0 2 0 0 0 0 0 0 2 0 12 R.ipPCA usage Example 1 : Input file : simulated_data.csv •  Number of SNP = 10,000 SNPs •  Number of samples = 125 samples from 5 popula(ons chromosome,snp,posiBon,allele1,allele2,POP1,POP1,POP2,POP2,… 1,snpid1,1,A,B,0,1,2,1,… 1,snpid2,2,A,B,0,1,1,1,… 1,snpid3,3,A,B,0,1,2,0,… ... Genotype Encoding AA 0 AB 1 BB 2 Missing data -‐1 prac6ce 13 How to run R.ipPCA Three parameters to use R.ipPCA : input.file = "simulated_data.csv" output.dir = ".” #ipPCA_output folder is created res.dir = ipPCA(input.file, output.dir, threshold=0.15) Parameters Default value input file -‐ A input file, e.g /home/user/input.txt result.dir -‐ An output directory, e.g /home/user/output/ snp.by.row TRUE If SNPs is listed row by row, it should be TRUE, otherwise it should be FALSE threshold 0.15 A value to stop the execu(on. Higher value, the process will be stop faster, otherwise it will become slow DescripBons 14 How to run R.ipPCA Three parameters to use R.ipPCA : input.file = "simulated_data.csv" output.dir = “.” #ipPCA_output folder is created res.dir = ipPCA(input.file, output.dir, threshold=0.15) The result files were saved at: ./ipPCA_output > res.dir [1] "./ipPCA_output" If you run res.dir again, the ipPCA_output will be create in ipPCA_output1. prac6ce 15 How to use R.ipPCA Before running R.ipPCA : 1.  Set your working directory 1
2
16 How to use R.ipPCA 2. Run R.ipPCA command : >input.file = "simulated_data.csv" >output.dir = “." >res.dir = ipPCA(input.file, output.dir, threshold=0.15) prac6ce 17 Output files for ipPCA The R.ipPCA result's directory (6): Name ipPCA_result.html Type DescripBon file The main output file. file The output file with the scaVer plots and the colors of each group are related to the ipPCA clustering result. ipPCA_scaVer_by_label.html file The output file with the scaVer plots and the colors of each group are related to predefined group. ipPCA_eigenvalue_plots.html file The output file with eigen-‐value plots ipPCA_scaVer_by_ipPCA.html images directory All plots be saved in this directory as the PDF format. RData directory The R objects will be saved in this directory. SuggesBon : Open html files by using Google Chrome or Safari browser 18 ipPCA_result.html 19 ipPCA_scaFer_by_ipPCA.html 20 ipPCA_scaFer_by_label.html 21 ipPCA_eigenvalue_plots.html 22 Other funcBon in R.ipPCA package 23 Create ordinary PED file save.ped : export the result to PLINK file format Command : > save.ped(res.dir) Parameters Default value res.dir -‐ DescripBons An output directory of ipPCA, e.g. /
home/user/output/ 24 Create ordinary PED file Command : > save.ped(res.dir) Generate 2 files : output.ped output.map Useful => ADMIXTURE 25 Create .ped file for PLINK Command : > save.ped(res.dir, group1=c(10)) Generate 2 files : output.ped output.map output_pheno.txt # If you determine the wrong nodes for case and control Warning Conver(ng ... Please set the parameter 'group1' correctly, for example: group1=c() group1=c(1,3,4) group1=c(1,"R4","L7") NULL Useful : PLINK associa(on analysis 26 ipPCA PracBce Example 1 : simulated data Example 2 : Real data (HapMap genotype) Example 3 : Microsatellite data of plant 27 Test other input files Example 2 : HapMap genotype •  Number of SNP = 54,701 SNPs •  Number of samples = 209 samples from 4 popula(ons •  CEU : Utah residents with ancestry from northern and western Europe •  CHB : Han Chinese in Beijing, China •  JPT : Japanese in Tokyo, Japan •  YRI : Yoruba in Ibadan, Nigeria •  Contain chromosome 1 to 22 prac6ce 28 Example 2 : HapMap genotype Run R.ipPCA : >input.file = "HapMap_Gty_ipPCA.csv” >output.dir = “.” >res.dir = ipPCA(input.file, output.dir, threshold=0.15) 29 ipPCA_result.html (Hapmap data) Threshold = 0.15 30 Threshold = 1.21 Threshold = 0.15 31 ipPCA PracBce Example 1 : simulated data Example 2 : Real data (HapMap genotype) Example 3 : Plant microsatellite data 3.1 : Simulated microsatellite data 3.2 : Boechera (rockcress) microsatellite data 32 Example 3_1 : Simulated microsatellite data Input file : testdata1_ipPCA_012Encoding.txt •  Simulated microsatellite data with 200 diploid individuals from 2 popula(ons 33 sample pop 1 I 2 I 3 V 4 V 5 G 6 G 7 CK 8 CK 9 R 10 R B07 185 185 188 188 185 185 197 197 182 182 B07 185 185 188 188 185 185 197 197 182 182 b6 337 337 333 349 337 337 337 337 329 329 b6 337 351 333 349 337 337 337 337 329 329 Convert to 0,1,2 format B07 sample pop 1 I 2 I 3 V 4 V 5 G 6 G 7 CK 8 CK 9 R 10 R 182 0 0 0 0 0 0 0 0 2 2 185 2 2 0 0 2 2 0 0 0 0 188 0 0 2 2 0 0 0 0 0 0 197 0 0 0 0 0 0 2 2 0 0 B11 180 180 178 178 174 174 180 180 172 172 B11 180 180 178 178 174 174 180 180 172 172 C02 170 170 172 172 172 172 172 172 170 0 C02 170 170 172 172 172 172 172 172 170 0 C03 173 173 174 174 174 174 174 174 173 173 C03 173 173 174 174 174 174 174 174 173 173 Count evidence of paEern of microsatellite b6 329 0 0 0 0 0 0 0 0 2 2 333 0 0 2 0 0 0 0 0 0 0 349 0 0 0 2 0 0 0 0 0 0 337 2 1 0 0 2 2 2 2 0 0 351 0 1 0 0 0 0 0 0 0 0 n … … … … … … … … … … 34 Example 3_1 : Simulated microsatellite data Run R.ipPCA : >input.file = "testdata1_ipPCA_012Encoding.txt” >output.dir = “.” >res.dir = ipPCA(input.file, output.dir, threshold=0.15, snp.by.row=FALSE) 35 ipPCA_result.html (Simulated microsatellite data) Threshold = 0.15 36 ipPCA PracBce Example 1 : simulated data Example 2 : Real data (HapMap genotype) Example 3 : Microsatellite data of plant 3.1 : Simulated microsatellite data 3.2 : Boechera (rockcress) microsatellite data 37 3.2 : Boechera (rockcress) microsatellite data Data from paper : popula(on gene(cs of Braun’s Rockcress (Boechera perstellata, Brassicaceae), and Endangered Plant with a Disjunct Distribu(on Baskauf et al., 2013, Journal of Heredity •  Microsatellite makers (“loci”) •  205 samples from 4 Tennessee (I,V,G,CK) and 3 Kentucky (R,C,H) popula(ons 38 3.2 : Boechera (rockcress) microsatellite data Run R.ipPCA : >input.file = ”Boechera_microsatellite_Encoded012.csv” >output.dir = “.” >res.dir = ipPCA(input.file, output.dir, snp.by.row=FALSE, threshold=0.15) 39 3.2 : Boechera (rockcress) microsatellite data 40 PopulaBon geneBcs analysis ADMIXTURE PopulaBon 10
0
15
15
10
10
0
5
−10
0
−20
−20
5
0
20
5
0
10
0
5
−5
0
−20
−15
−10
0
0
5
10
15
20
−20
−20
0
0
20
20
40
−20
0
20
0
10
20
10
pop1
pop4
pop2
pop3
popout
40
10
−5
0
5
10
0
15
20
50
pop1
pop4
pop2
pop3
popout
0
−50
−40
−10
0
40
−50
−40
−20
−20
0
−10
40
20
−30
−15
−20
−10
−5
0
5
10
20
−10
−5
0
5
10
0
−20
−10
20
−30
−15
0
0
−10
10
0
−5
40
0
−10
−20
50
−10
20
−5
40
−20
−20
−20
−15 10 −10
−15
0
20
−5
−15
−10
−20
−20
−10
−20
20
−10
−5
−10
10
10
To show the paFern profile of the data 0
−20
−10
0
10
20
ipPCA 41 ADMIXTURE PracBce Example 1 : simulated data (SNP genotyped) 42 Format input file of ADMIXTURE Example 1 : simulated_data.csv Command : > save.ped(res.dir) Generate 2 files : output.ped output.map Go to ~/Desktop/Workshop_IRRDB/
ipPCA_Workshop/Simulated_SNPs/ipPCA_output 43 Convert input file for ADMIXTURE Create 12 code file : ./plink -‐-‐file output -‐-‐recode12 -‐-‐noweb -‐-‐out mydata Output : mydata.ped, mydata.map
#“12” codes file Or create binary (.bed) file : ./plink -‐-‐make-‐bed -‐-‐file output -‐-‐noweb -‐-‐out mydata Output : mydata.bed , mydata.bim, mydata.fam SNP informa(on Sample informa(on 44 RUN ADMIXTURE K=2 to 4 : for i in {2..4}; do ./admixture mydata.bed $i; done; Output files: mydata.2.P, mydata.2.Q mydata.3.P, mydata.3.Q … where : Q = the ancestry frac(on P = the allele frequencies of the inferred ancestral popula(ons prac6ce 45 Plot ADMIXTURE Loop R command : > for (i in 2:4){ admix = read.table(paste0("./mydata.",i,".Q")) pdf(paste0("./plot_admixture_k",i,".pdf")) barplot(t(as.matrix(admix)), col=rainbow(i),xlab="Individual #", ylab="Ancestry", border=NA, space=0) dev.off() } 46 Result ADMIXTURE Example 1 : simulated_data.csv K = 5 47 ADMIXTURE plot for all K in simulated data K = 2 K = 3 K = 4 K = 5 K = 6 K = 7 48 PopulaBon geneBcs analysis ADMIXTURE PopulaBon 10
15
15
15
10
10
5
0
5
−10
0
−20
−20
0
20
5
0
10
0
5
−5
0
−5
−20
−15
−15
00
−10
−5
0
5
10
15
20
−2010
−20 −20
0
−20
−20
0
20
20
pop1
pop4
pop2
pop30
popout
40
10
−10
−20
−5
0
50
5
10
0
−10
15
20
0
10
−10
0
−50
−40
−20
−20
−20
−20
0
−10
40
20
−30
−15
−10
−5
0
5
10
−5
0
20
0
20
−20
0
40
40
0
5
0
10
−20
−10
50
40
0
20
20
40
0
10
10
20
o
20
10
pop1
pop4
pop2
pop3
pop ut
PLINK associaBon 0
50
0
pop1
pop4
pop2
pop3
popout
0
−50
−40
−20
−10
20
−10
40
20
−20
−30
−15
20
0
−10
10
−5
10−10
0
20
0
0
−10
−5
40
−10
−20
20
−15
−10
0
−10
−20
−20
−20
−10
−20
−10
−5
−20
−15
5
0
−10
10
10
10
10
0
ipPCA 0
−5
0
5
10
15
20
−50
−40
−20
0
20
0
10
20
−10
40
−20
20
−30
−15
−10
−5
0
5
10
Case + Control 0
−20
−10
SNP genotype 49 Basic StaBsBcal Methods Allele coun(ng to test for associa(on between SNP genotype and case/control studies Cases Controls Total Observed allele counts G
Cases Controls Total
2r0+r1
2s0+s1
2n0+n1
T
GG
r0
s0
n0
GT
r1
s1
n1
TT
r2
s2
n2
Total R S N Expected allele counts Total r1+2r2
s1+2s2
n1+2n2
2R 2S 2N G
Cases 2R(2n0+n1)/(2N)
Controls 2S(2n0+n1)/(2N)
T 2R(n1+2n2)/(2N) 2S(n1+2n2)/(2N) Chi-‐square test for independence of rows and columns Σ (Obs – Exp)2 ~ Χ2 with 1 df Exp 50 Basic StaBsBcal Methods The odds ra(o : a measure of effect size Odds of an event occurring = Pr(event occurs) / Pr(event doesn’t occur) = Pr (event occurs) / [1 – Pr(event occurs)] Cases Controls Allele counts G T a b c d Odds ra(o = odds that G allele occurs in a case = a/c = a d odds that T allele occurs in a case b/d b c OR = increase in odds of being a case for each addi(onal G allele *OR = 1 : no associa(on between genotype and disease *OR > 1 : G allele increase risk of disease *OR < 1 : T allele increases risk of disease 51 PLINK associaBon PLINK – A whole genome associa(on toolset hVp://pngu.mgh.harvard.edu/~purcell/plink/ The focus of PLINK is purely on analysis of genotype/phenotype data. The basic associa(on test is based on comparing allele frequencies between cases and controls 52 PLINK associaBon analysis input file : Thalassemia data set •  618 Samples •  41,789 SNP markers Command on linux shell: ./plink -‐-‐file input –assoc (Default) 53 PLINK associaBon analysis Transposed genotype files -‐-‐Oile <-‐-‐-‐-‐ normal.ped -‐-‐-‐-‐> <-‐-‐-‐ normal.map -‐-‐-‐> 1 1 0 0 1 1 A A G T 1 snp1 0 5000650 2 1 0 0 1 1 A C T G 1 snp2 0 5000830 3 1 0 0 1 1 C C G G 4 1 0 0 1 2 A C T T 5 1 0 0 1 2 C C G T 6 1 0 0 1 2 C C T T <-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ trans.tped -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐> 1 snp1 0 5000650 A A A C C C A C C C C C 1 snp2 0 5000830 G T G T G G T T G T T T TPED/TFAM fi les: <-‐ trans.‚am -‐> 1 1 0 0 1 1 2 1 0 0 1 1 3 1 0 0 1 1 4 1 0 0 1 2 5 1 0 0 1 2 6 1 0 0 1 2 hFp://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr 54 PLINK associaBon analysis To specify an alternate phenotype for analysis : -‐-‐pheno pheno.txt where pheno.txt is a file that contains 3 columns (one row per individual): Column 1 : Family ID Column 2 : Individual ID Column 3 : Phenotype 1 = control, 2 = case FAM1 ID1 1 FAM2 ID2 1 … FAM617 ID617 2 FAM618 ID618 2 Thalassemia_Phenotype.txt 55 PLINK associaBon analysis How to run PLINK in Linux shell : ./plink -‐-‐file Thalassemia_Gty -‐-‐allow-‐no-‐sex -‐-‐noweb -‐-‐pheno Thalassemia_phenotype.txt -‐-‐assoc -‐-‐out Thalassemia_Gty 56 Output of Thalassemia_Gty.assoc DefiniBon : Chr SNP BP A1 F_A F_U A2 CHISQ
P OR File : Thalassemia_Gty.assoc Chromosome SNPid Physical posi(on (base-‐pair) Minor allele name (based on whole sample) Frequency of this allele in cases Frequency of this allele in controls Major allele name Basic allelic test chi-‐square Asympto(c p-‐value for this test Es(mated odds ra(o (for A1) 57 Output of Thalassemia_Gty.assoc Sort CHISQ values by using linux command : sort -‐-‐key=8 -‐nr Thalassemia_Gty.assoc 58 Visualizing associaBon result ManhaFan plot : •  A type of scaVer plot, used to display data with a large number of data-‐points Command in R shell : > results = read.table(“Thalassemia_Gty.assoc", header=TRUE) > manhaVan(results) 59 Example of manhaFan plot rs9376092 rs4895441 Thalassemia_Gty.assoc 60 Acknowledgements Leader : Dr.Sissades Tongsima R-‐ipPCA developer: Kridsadakorn Chaichoompu BiostaBsBc and InformaBcs Laboratory (BSI lab) members www.facebook.com/BSI.TH 61

Presentation file for ipPCA workshop

Transcription

Similar documents

Labor Market Trends in the Metro South/West Region

Paralinguistic features in spoken journalistic texts

Tutorial, part 1: Introduction