PyRate Manual – v. 0.570
Transcription
PyRate Manual – v. 0.570
PyRate Manual – v. 0.570 PyRate is a Python program to estimate speciation, extinction, and preservation rates from fossil occurrence data using a Bayesian framework. The methods and the program are described here: Silvestro, D., Schnitzler, J., Liow, L.H., Antonelli, A. & Salamin, N. (2014) Bayesian Estimation of Speciation and Extinction from Incomplete Fossil Occurrence Data. Systematic Biology, 63, 349– 367. Silvestro, D., Salamin, N., Schnitzler, J. (in review) PyRate: A new program to estimate speciation and extinction rates from incomplete fossil record. Please report suggestions and bugs to: [email protected] Table of Contents Compatibility and installation! 2 Preparing input file for analysis! 3 Input files, working directory ! 4 Main analysis settings! 5 Preservation (fossilization) model! 6 Birth-death model – constrained shifts! 7 Settings for trait correlated rates (Covar models)! 7 (Hyper)prior settings! 8 Description of the output files! 9 Plot/summarize results! 10 Advanced (BD)MCMC settings! 11 Tuning parameters! 12 Miscellaneous! 14 Acknowledgments! 15 References! 15 1 1. Compatibility and installation PyRate has been tested under Unix operating systems (Mac OS 10.6 or higher and Cent OS 6) and under Windows (XP and 7) using Python 2.6 – 2.7. The program requires the library argparse, which needs to be installed manually under Python 2.6 (it can be found here: https:// code.google.com/p/argparse/), whereas it is part of the default installation in Python 2.7. Please note that Python 3.x is currently not supported, Python 2.7.x can be found here: https:// www.python.org/downloads/. The libraries numpy and scipy are required. Source files and installers are available here: http:// sourceforge.net/projects/numpy/files/ and http://sourceforge.net/projects/scipy/files/. Under UNIX systems, PyRate will use the library multiprocessing and thread (if available) and implement parallel computation of the likelihoods (see command -thread). To launch a PyRate analysis on UNIX browse via Terminal to the PyRate directory and type: ./PyRate.py path_to_input_file/file_name.py [commands] or python PyRate.py path_to_input_file/file_name.py [commands] To start PyRate on Windows, browse via Command prompt to the PyRate directory and type: python PyRate.py path_to_input_file/file_name.py [commands] If you are working under Windows, please make sure that the path to python.exe is included in the PATH environment variables. To do so, edit the PATH environment variable and add the folder in which Python 2.x is installed (e.g. ‘C:\python27’). An easy tutorial how to do that can be found for example on the Java website: https://www.java.com/en/download/help/path.xml The function -plot (see below) generates an R script that is used to produce a graphic output. The script is automatically executed by PyRate using the shell command RScript. If you are working under Windows, please make sure that the path to Rscript.exe is included in the PATH environment variables (default in Mac/Linux). To do so, edit the PATH environment variable and add the \bin\ folder of the R installation (e.g. ‘C:\Program Files\R\R-2.14.0\bin\i386’). An easy tutorial how to do that can be found for example on the Java website: https://www.java.com/en/ download/help/path.xml The R script pyrate_utilities.r was tested under R version 2.x and 3.x. The function fit.prior requires the package fitdistrplus. 2 2. Preparing input file for analysis The PyRate program requires a specific format for the input data, so please follow the next steps carefully. A correctly formatted input file can be generated using an R function provided in the script ‘pyrate_utilities.r’ starting from a table with the fossil occurrence data. All fossil occurrences need to be provided in a table (a tab-delimited text file), with species names, their status ("extant" or "extinct"), and minimum and maximum ages as the columns. The minimum and maximum ages commonly correspond to the temporal boundaries of the stage a particular fossil is assigned to and are generally available from the databases. At present, PyRate can not deal with missing information in these four columns, so make sure that you remove these entries beforehand. One additional column may be added providing a trait value, if available, which can be used in the birth-death analysis (note that here, missing data are allowed, and should be given as NA). A typical input file may look like this: Species! ! ! ! Ursus_etruscus! ! ! Ursus_etruscus! ! ! Ursus_etruscus! ! ! Agriotherium_insigne! ! Ursavus_brevirhinus! ! Ursavus_brevirhinus! ! Agriotherium_intermedium! ...! ! ! ! ! Status! extinct! extinct! extinct! extinct! extinct! extinct! extinct! ...! ! MinT! MaxT 1.9! 2.6!! 1.2! 1.8!! 2.6! 3.4!! 4.2! 5.3!! 8.2! 9.0!! 11.2! 15.2!! 3.4! 4.2!! ...! ...!! Trait 90 90 90 285 80 80 NA ...! This file can then be processed using the R function extract.ages from the R utilities script provided with PyRate package. To load the extract.ages function, open an R console and type: > source(file = "/path_to_file/pyrate_utilities.r") The extract.ages function needs the path to the file containing the raw data and has a few options that can be specified. replicates! Examples: This option allows the user to generate several replicates of the data set in a single input file, each time re-drawing the ages of the occurrences at random from uniform distributions with boundaries MinT and MaxT. The replicates can be analyzed in different runs (see PyRate command -j) and combining the results of these replicates is a way to account for the uncertainty of the true ages of the fossil occurrences (see also Silvestro et al. 2014). replicates=1 (default, generates 1 data set) ! replicates=10 (generates 10 random replicates of the data set) 3 cutoff! Specify a threshold to exclude fossil occurrences with a high temporal Examples: uncertainty, i.e. with a wide temporal range between MinT and MaxT. cutoff=NULL (default; all occurrences are kept in the data set) ! cutoff=5 random! Specify whether to take a random age (between MinT and MaxT) for each Examples: (all occurrences with a temporal range of 5 Myr or higher are excluded from the data set) occurrence or the midpoint age. Note that this option defaults to TRUE if several replicates are generated (i.e. replicates > 1). random = TRUE (default) random = FALSE (use midpoint ages) The extract.ages function can be called in an R console as follows: > extract.ages(file = "/path_to_file/Ursidae.txt", replicates=10, cutoff=5, random=TRUE) This resamples 10 times the age of fossil occurrences randomly within the respective temporal ranges and generates a Python file (here called 'Ursidae_PyRate.py') that can now be imported in PyRate for diversification rate analyses. 3. Input files, working directory <input file> Set input file including path and file name. The file is a Python file with the fossil occurrence data (as list of Numpy arrays) and, optionally, one or more continuous traits. Input files with correct formatting can be generated using the R script ‘pyrate_utilities.R’ (see above). Example: python PyRate.py path_to_input_file/Ursidae_PyRate.py -wd! Define working directory where all output files will be saved. If not specified, output files will be saved in the same directory as the input file. Example: -wd path_to_target_directory -j If the input file includes several data sets, e.g. generated using the R script Example: ‘pyrate_utilities.R‘ to account for uncertainties of fossil ages, this command defines which data set will be analyzed. -j 1 (default; the first data set from the input file is analyzed) 4 -out Example: Add tag to default stem name of output files. This command is useful to avoid overwriting output files when running several instances of the same analysis. -out run_2 (adds ‘run_2’ to the name of all output files) 4. Main analysis settings -A -A 0 run an MCMC for parameter estimation, i.e. with fixed number of shifts in the birth-death model (see section ‘Birth-death model – constrained shifts’). This analysis generates output files with posterior samples of the parameters, e.g. speciation/extinction/preservation rates and times of rate shifts (see section ‘Description of the output files’ for more details). -A 1 run an MCMC with thermodynamic integration (TI; Lartillot & alternative Philippe 2006) to estimate the fit of a birth-death model. This analysis computes the marginal likelihood of a birth-death model (save to a ‘*_marginal_likelihood.txt’ file) that can be use to compare models, e.g. with different number of rate shifts or trait correlations. -A 2 run a BDMCMC analysis (Silvestro et al. 2014; Stephens 2000) to Example: jointly estimate the number of rate shifts in the birth-death process, their temporal placement and the speciation and extinction rates between shifts. This analysis generates output files with posterior samples of the parameters e.g. preservation rates and speciation/extinction rates through time (see section ‘Description of the output files’). -A 2 (default) -n Number of MCMC or BDMCMC generations Example: -n 10000000 -s MCMC sampling frequency Example: -s 1000 -p MCMC print-on-screen frequency Example: -p 1000 -b Set the number of iterations to be discarded (i.e. not logged) from the analysis as burnin. Must be set to a reasonable number when estimating (default) (default) 5 marginal likelihood through TI (see command -A). This command is also used to exclude burnin samples when summarizing MCMC results (e.g. -mProb, -plot functions). When set to a number between 0 and 1, it is Examples: interpreted as a fraction of the total number of MCMC generations. -b 0 (default; all samples are logged) -b 1000 (first 1,000 samples are discarded) -b 0.10 (the first 10% of the samples are discarded) NOTE: Additional commands are provided in the section ‘Advanced (BD)MCMC settings’. 5. Preservation (fossilization) model -mHPP Use homogeneous Poisson process for preservation rates instead of non-homogeneous Poisson process (NHPP) with ‘hat-shaped’ (PERT) distributed preservation rates (Liow et al. 2010). Note that the NHPP is used unless differently specified. Example: -mHPP -mG Set the Gamma model, allowing heterogeneity of the mean preservation rate across the taxa in a data set. Preservation rates under the Gamma model will be assumed to be distributed according to a gamma distribution with shape parameter (alpha) estimated from the data. On most empirical data sets the Gamma model strongly outperforms the alternative assumption of constant rates, and across-taxa rate heterogeneity has been shown to Gamma model accounts to some extent for temporal changes of the preservation rates (Silvestro et al. 2014). This option can be applied to both the NHPP and the HPP preservation models. Example: -mG -ncat! Set the number of categories used to obtained the gamma distributed preservation rates under the Gamma model. Increasing this number will allow for more variability of the rates across taxa, with comparatively little effect on the speed of the analysis. This command is used only when running under a Gamma model of preservation (i.e. with -mG). Example: -ncat 4 -fixSE! Fix the times of speciation and extinction of all taxa based on a previous (default) analysis. This command can be used to load a ‘mcmc.log’ from which posterior mean times of speciation and extinction are calculated for all 6 taxa. PyRate will then run the analysis using these values, i.e. without estimating them and without calculating the likelihood of the preservation process (NHPP or HPP), but only sampling the parameters of the birth-death. If set to -fixSE null first and last appearances are used as fixed times of speciation and extinction. Example: -fixSE /path_to_file/Ursidae.mcmc.log 6. Birth-death model – constrained shifts -mL Set the number of speciation rates through time used in the MCMC (with -A 0 or -A 1). In BDMCMC analyses (-A 2) this command is only used to set the number of starting rates. Example: -mL 2 (set 2 speciation rates and 1 rate shift) -mM Set the number of extinction rates through time used in the MCMC (with -A 0 or -A 1). In BDMCMC analyses (-A 2) this command is only used to set the number of starting rates. Example: -mM 3 (set 3 extinction rates and 2 rate shifts) -mC Constrain the time frames of speciation and extinction rates to be equal. This command is only used in birth-death models with at least one rate shift (e.g. with -mL 2 and -mM 2) and Example: -mC -fixShift Fix number and time based on ages provided in a text file. This should be simply a text file with the ages at which the rate shifts should be fixed (see file ‘epochs.txt’ in PyRate’s example files). When this command is used, the number and temporal placement of the rate shifts is fixed and set identical for both speciation and extinction. Example: -fixShift path_to_file/epochs.txt 7. Settings for trait correlated rates (Covar models) -mCov Set Covar models in which the birth-death rates (and preservation rate) vary across lineages as the result of a correlation with a continuous trait, provided as an observed variable, based on estimated correlation parameters (cov_sp, cov_ex, cov_q). Examples: -mCov 1 correlated speciation rate 7 -trait Example: -mCov 2 correlated extinction rate -mCov 3 correlated speciation and extinction rates -mCov 4 correlated preservation rate -mCov 5 correlated speciation, extinction, preservation rates If the input file includes several traits, this command defines which trait will be analyzed. -trait 1 (default; it takes first trait) 8. (Hyper)prior settings -N ! Set the number of extant species (regardless of whether or not they are included in the fossil occurrence data). This is used to calculate the hyperprior on the speciation and extinction rates, i.e. conditioning on the present known diversity (Kubo and Iwasa 1995; Equations 12–15 in Silvestro et al. 2014). If set to -N -1 or if not specified, a gamma prior will be used instead with shape and scale parameters defined by the flags -pL and -pM (see below). The latter option might be appropriate for clades that have gone extinct or when the current diversity is doubtful. Example: -N 24 -pL Shape and scale parameters of the gamma distributed prior on the speciation rate, which estimates the expected number of fossil occurrences per lineage per Myr. This command if used only if the number of extant taxa is not specified (i.e. -N -1). Example: -pL 1.1 1.1 (default) -pM Shape and scale parameters of the gamma distributed prior on the extinction rate, which estimates the expected number of fossil occurrences per lineage per Myr. This command if used only if the number of extant taxa is not specified (i.e. -N -1). Example: -pL 1.1 1.1 (default) -pP Shape and scale parameters of the gamma distributed prior on the preservation rate, which estimates the expected number of fossil occurrences per lineage per Myr. Example: -pP 1.5 1 (default) 8 -pS Shape parameter of the Dirichlet prior on the length of the time frames Example: relative to the total time span of the data set. If set to -pS 1 the prior is a uniform distribution, whereas big values will favor equal sized time frames. -pS 2.5 (default) -pC! Standard deviation of the Normal prior on the correlation parameters under the Covar model. The Normal priors are centered in 0. Example: -pC 1 (default) 9. Description of the output files A typical PyRate analysis produces three output files: *_sum.txt Text file providing the complete list of settings used in the analysis. *_mcmc.log Tab-separated table with the MCMC samples of the posterior, prior, likelihoods of the preservation process and of the birth-death (indicated by PP_lik and BD_lik, respectively), the preservation rate (q_rate), the shape parameter of its gamma distributed heterogeneity (alpha), the parameters of the Covar model (cov_sp, cov_ex, cov_q), the number of sampled rate shifts (k_birth, k_death; only logged in BDMCMC analyses), the value of scaling factor used in TI analyses (beta), the time of origin of the oldest lineage (root_age), the speciation/ extinction rates between shifts (lambda_0, lambda_1, ... and mu_0, mu_1, ...; only logged under fixed number of shifts, i.e. with -A 0 or -A 1), the times of rate shifts in speciation and extinction (shift_sp_1, ... and shift_ex_1, ...; only logged with -A 0 or -A 1), the total branch length (tot_length), and the times of speciation and extinction of all taxa in the data set (*_TS and *_TE, respectively). This file can be used to calculate the sampling frequencies of birth-death models with different number of rate shifts after a BDMCMC analysis using the function -mProb (see section Plot/summarize results). Additionally, the file can be opened in the program Tracer (Rambaut and Drummond 2007) to check the efficiency and mixing of the MCMC and the proportion of burnin. *_marginal_rates.log Tab-separated table with the posterior samples of the marginal rates of speciation, extinction, and net diversification, calculated within 1 time unit (typically Myr). This file can be used to generate rates-through-time plots using the function -plot (see section Plot/summarize results). ! 9 When running an analysis to estimate the marginal likelihood of a birth-death model by TI (option -A 1) the ‘*_marginal_rates.log’ file is replaced by the following: *_marginal_likelihood.txt Text file providing the marginal likelihood of a birth-death model estimated by TI. This value can be used to compare the relative fit of different birth-death model (e.g. with different number of shifts, fixed shift ages, trait-correlated rates using the Covar model, ...). The calculation of Bayes Factors to quantify the relative model support can be done using the command -BF described below. 10. Plot/summarize results -plot! This function takes the marginal speciation and extinction rates logged by a PyRate analysis in a file (named ‘*_marginal_rates.log’) and generates a rates-through-time plot (RTT) using the scripting language R. Two output files are generated: an R script named ‘*_RTTplot.r’ and a pdf file named ‘*_RTTplot.pdf’. The former contains the source R code for generating the graphic output saved in the pdf file. As for all the other input files, by default these files will be save in the same directory as the input file. Mean speciation, extinction, and net diversification rates through time are plotted in 1 Myr time bins with the respective 95% HPD. Several log files (e.g. from different replicates) can be loaded at ones and combined in a single plot. The proportion of burnin to be excluded can be specified using the command -b (by default set to 0). We recommend to inspect the ‘*_marginal_rates.log’ file in Tracer to define the appropriate proportion of burnin. Example: -plot file_name_marginal_rates.log -mProb! Takes the posterior samples logged in a BDMCMC analysis to a file (named ‘*_mcmc.log’) to calculate the sampling frequencies of birth-death models with different number of rate shifts after a BDMCMC analysis. The proportion of burnin to be excluded can be specified using the command -b (by default set to 0). We recommend to inspect the ‘*_mcmc.log’ file in Tracer to define the appropriate proportion of burnin. Example: -mProb file_name_mcmc.log -BF! Takes the marginal likelihoods calculated under two birth-death models (from ‘*_marginal_likelihood.txt’ files) to calculate Bayes Factors and quantify the support of one model against the other. The Bayes factor 10 is calculated as twice the difference of log marginal likelihood and the degree of support divided into four categories: negligible, positive, strong, very strong, based on Kass and Raftery (1995). Example: -BF path_to_file/file_1_marginal_likelihood.txt ! path_to_file/file_2_marginal_likelihood.txt 11. Advanced (BD)MCMC settings -r Set the number of parallel ‘heated’ chains for Metropolis Coupled MCMC (MC3) analysis. Each chain will use a different processor if available. The number includes the ‘cold’ chain. This command is used only when running MCMC analysis for parameter estimation (i.e. with -A 0). Example: -r 4 (for 1 cold and 3 heated chains) -t Set the ‘temperature’ parameter for the MC3 heated chains. This command is used only when running MC3 analysis (i.e. with -r >1). Example: -t 0.03 (default) -sw Frequency of attempted swaps between chains in MC3 analysis. This command is used only when running MC3 analysis (i.e. with -r >1). Example: -sw 100 (default) -k Number of scaling factors used for marginal likelihood estimation in TI. Higher number of scaling factors will improve the accuracy of the estimated marginal likelihood, but require longer computational time. This command is used only when running TI analysis (i.e. with -A 1). Example: -k 10 -a Shape parameter of the beta distributed scaling factors in TI analyses (cf. Xie et al. 2011; Silvestro et al. 2014). This command is used only when running TI analysis (i.e. with -A 1). Example: -a 0.3 -M! Frequency of model update in BDMCMC analysis. This parameter determines how frequently new birth-death models will be explored. Reducing this number can improve the sampling of the number of shifts in birth-death rates. This command is used only when running BDMCMC analysis for parameter estimation (i.e. with -A 2). (default) (default) 11 Example: -M 25 -B Set the birth-rate at which the BDMCMC algorithm will propose new rate shifts in the model. This also corresponds to the shape parameter a Poisson distributed prior on the number of speciation/extinction rates in the model. This command is used only when running BDMCMC analysis for parameter estimation (i.e. with -A 2). Example: -B 1 -T Set the time spent in updating the model in BDMCMC analysis. Increasing this parameter has a similar effect to increasing the birth-rate (command -B). This command is used only when running BDMCMC analysis for parameter estimation (i.e. with -A 2). Example: -T 1 -S Set the number of generations after which the BDMCMC algorithm will start updating the model (e.g. after a burn-in phase). This command is used only when running BDMCMC analysis for parameter estimation (i.e. with -A 2). -S 1000 Example: (default) (default) (default) 12. Tuning parameters -tT Window size of updates of speciation/extinction times (uniform sliding window). Example: -tT 1 -nT Maximum number of speciation/extinction times updated at a time. If set to 0, speciation/extinction times are set equal to first/last appearances and the preservation model is automatically set to HPP. Example: -nT 5 -tQ Window sizes of the preservation rate (q) and of the shape parameter of the gamma distributed rate heterogeneity (alpha), respectively (uniform sliding windows). -tQ 0.33 3 (default) Example: (default) (default) 12 -tR Window size of updates of speciation/extinction rates (uniform sliding window). Example: -tR 0.05 -tS Window size of updates of shift times for speciation/extinction rates (uniform sliding window). -tS 1 (default) Example: (default) -fS Frequency of updating shift times, when updating birth-death parameters (else rates are updated). The value will be automatically set to 0 when no rate shifts are being sampled or if the times of shifts are fixed (with command -fixSE). Example: -fS 0.7 -tC Window sizes of updates of correlation parameters, when using models with rates covarying with a trait. Window sizes are given for covariation with speciation, extinction, and preservation rates respectively. The parameters will be updated (or not) depending on the Covar model selected (see command -mCov). Example: -tC 0.025 0.025 0.1 -fU Example: Update frequencies for preservation rate, birth-death parameters, and correlation parameters under the Covar model, respectively. What is left is used for updating speciation and extinction times. -fU 0.02 0.18 0.08 (default under Covar model; updates preservation parameters with frequency 2%, birth-death parameters with frequency 18%, Covar parameters with frequency 8%, times of speciation and extinction with frequency 72%) -fR Fraction of birth-death rates updated at a time (with frequency defined (default) (default) by the command -fU). This command should be used to reduce the fraction of updated rate parameters especially when running birth-death models with many shifts, e.g. defined by the command -fixShift, to improve the MCMC mixing. Example: -fR 1 (default) 13 13. Miscellaneous -v! Print program’s version -h Print command list -cite Print PyRate citation -thread Set the number of threads used for calculating the likelihood of the birth-death process and the NHPP likelihood. When both values are set to 0, PyRate will use sequential computation of the likelihood (thus slowing down a bit the analysis) and will not use the multiprocessing python library. Under MS Windows operating systems the sequential likelihood calculation is set by default. Examples: -thread 1 3 (default on UNIX operating systems) -thread 0 0 (default on Windows operating systems) The calibration of molecular phylogenies heavily relies on the assignment of prior distributions on the ages of particular internal nodes (generally derived from first appearances of fossils that belong to the clade of interest). However, the selection of appropriate prior distributions can be difficult and is critical for the correct calibration of the tree (Heath 2012). The R function fit.prior from the pyrate_utilities.r script can be used to derive a gamma distribution for the estimated time of speciation of any species in the dataset. To load the fit.prior function, open an R console and enter: > source(file = "/path_to_file/pyrate_utilities.r") The function fit.prior needs the path to the log file (i.e. the ‘*mcmc.log’ file containing the posterior sample of the model parameters) and the name of the species of interest. lineage! Examples: Name of the species for which a prior distribution should be generated. The default (“root age”) is the age of origin of the diversification process, i.e., the time of speciation of the oldest sampled species. lineage=”Ursus_minimus” (generate a prior distribution for the time of speciation for Ursus minimus) The fit.prior function can be called in an R console as follows: > fit.prior(file = "/path_to_file/Ursidae_mcmc.log", lineage = “Ursus_minimus”) 14 This generates a text file with the shape and scale parameters as well as the offset of the gamma distribution. 14. Acknowledgments We are grateful to Susanne Fritz and Ingo Michalak for help testing the software on Windows and to the VITAL-IT cluster of the Swiss Institute for Bioinformatics where we ran simulations and analyses. 15. References Kass, R.E. & Raftery, A.E. (1995) Bayes Factors. Journal of the American Statistical Association, 90, 773–795. Kubo, T. & Iwasa, Y. (1995) Inferring the rates of branching and extinction from molecular phylogenies. Evolution, 49, 694–704. Lartillot, N. & Philippe, H. (2006) Computing Bayes factors using thermodynamic integration. Systematic Biology, 55, 195–207. Liow, L.H., Skaug, H.J., Ergon, T. & Schweder, T. (2010) Global occurrence trajectories of microfossils: environmental volatility and the rise and fall of individual species. Paleobiology, 36, 224–252. Rambaut, A. & Drummond, A.J. (2007) Tracer: Available from http://beast.bio.ed.ac.uk/Tracer. Silvestro, D., Schnitzler, J., Liow, L.H., Antonelli, A. & Salamin, N. (2014) Bayesian Estimation of Speciation and Extinction from Incomplete Fossil Occurrence Data. Systematic Biology, 63, 349– 367. Silvestro, D., Salamin, N., Schnitzler, J. (in review) PyRate: A new program to estimate speciation and extinction rates from incomplete fossil record. Stephens, M. (2000) Bayesian Analysis of Mixture Models with an Unknown Number of Components – an alternative to reversible jump methods. The Annals of Statistics, 28, 40–74. Xie, W., Lewis, P.O., Fan, Y., Kuo, L. & Chen, M.H. (2011) Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology, 60, 150–160. 15