View article
Transcription
View article
IST-Africa 2013 Conference Proceedings Paul Cunningham and Miriam Cunningham (Eds) IIMC International Information Management Corporation, 2013 ISBN: 978-1-905824-38-0 i4Life: Standardising the World’s Biodiversity Catalogue Alastair CULHAM1, Magdalena SITKO1, Yuri ROSKOV1, Viktoras DIDŽIULIS1, Kwok CHEUNG1, Thomas KUNZE1, Peter SCHALK2, Wouter ADDINK2, Markus DÖRING3, Guy COCHRANE4, Stéphane RIVIÈRE4, Vincent ROBERT5, Wieslaw BOGDANOWICZ6, Craig HILTON-TAYLOR7, Walter BERENDSOHN8, Anton GÜNTSCH8, Andrew JONES9, Richard WHITE9, Thierry BOURGOIN10 1 University of Reading - UoR, Whiteknights Shinfield Rd, Reading, RG6 6UR, UK Tel: +44 118 378 6466, Email: [email protected] 2 ETI BioInformatics, Van Steenisgebouw, Einsteinweg 2, Leiden, 2333 CC, Netherlands Tel: +31 71 527 1350, Fax: +31 71 527 1351, Email: [email protected] 3 GBIF, Universitetsparken 15, Copenhagen, DK-2100, Denmark Tel: +45 35 32 14 70, Fax: +45 35 32 14 80, Email: [email protected] 4 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK Tel: +countrycode localcode number, Fax: + countrycode localcode number, Email: 5 KNAW, CBS, Uppsalalaan 8, Utrecht, 3584 CT, Netherlands Tel: +31 30 2122600, Fax: +31 30 2512097, Email: [email protected] 6 MIZ-PAS, Wilcza 64, Warsaw, 00-679, Poland Tel: +48 22 629 32 21, Fax: + 48 22 629 63 02, Email: [email protected] 7 UICN Species Programme Office, 219c Huntingdon Road, Cambridge, CB3 0DL, UK Tel: +44 1223 277966, Fax: +44 1223 277845, Email: [email protected] 8 FUB-BGBM, Königin-Luise-Straße 6-8, Berlin, 14195, Germany Tel: +49 30 838-50166, Email: [email protected] 9 Cardiff University - CU, 5 The Parade, Cardiff, CF24 3AA, UK Tel: +4429 20875537, Fax: +44 29 20874598, Email: [email protected] 10 MNHN, 45 rue Buffon, Paris, 75005, France Tel: + 33 (0)1 40 79 53 92, Email: [email protected] Abstract: i4Life provides linkages between the Catalogue of Life, an expert based knowledge portal for living species on earth, and global partners (IUCN, GBIF, ENA at EBI, BOLD, EoL, and Life Watch) providing data portals for distribution, genetic diversity and conservation information. This novel e-infrastructure offers the only single global consensus list of living species on earth and their associated data. This structure uses custom services to cross-map, transfer, and make available subsets of this global list to interested users. It facilitates both global and local understanding of biodiversity, it’s distribution, variation and threats. Keywords: e-infrastructure, biodiversity informatics, Catalogue of Life, species lists, data portals, taxonomy 1. Introduction Degradation and loss of global biodiversity is one of the key issues of our time. Conservation of biodiversity is key to human survival, food security and quality of life. Climate change and population growth combine to worsen this threat. However we are still Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 1 of 10 without a complete, comprehensive and global list of the Earth’s living species against which to measure biodiversity change. The i4Life project (www.i4Life.eu) is a 3-year European e-Infrastructure project, launched at the University of Reading on 1st November 2010 and it is a continuation and expansion of the 4D4Life project. The project has brought together the Species2000 [1] and ITIS [2] Catalogue of Life [3], an index of almost 1.4 million species, with six global biodiversity portals together delivering around 700 million data records to provide an accessible, cross referenced access service to species lists and data associated with genetic diversity, conservation, identification and distribution to allow a truly global understanding of life on earth. The Catalogue of Life is a clearing house for databases covering some 70% of the world’s known species, and this catalogue is growing steadily in quality and coverage. It represents common data from over 100 constituent databases curated and reviewed by experts. Combining this knowledge infrastructure with the data portals of GBIF [4], IUCN RedList [5], ENA at EBI [6], BOLD [7] and EoL [8] gives unrivalled access to man’s knowledge of biodiversity allowing appropriate legislative decisions, conservation planning and focusing research priorities. Vast areas of agricultural landscape are predicted to become desert and whole ecosystems will disappear or be degraded by species loss due to climate change and human impact. Through population growth, urbanisation, and changing land use, some of the ecosystems richest in wild species will continue to become fragmented or degraded beyond repair, leaving us with lost ecosystem services and lost species diversity. Internationally there is a perception that many sectors of society, especially in urban and city landscapes around the world, do not even appreciate our extreme dependence on biodiversity, or the severe threats it brings to the livelihoods of the next generation. For scientists there remains a more fundamental issue: we are still racing to explore the extent of the species present on earth before extinctions wipe these records from our evolutionary history. A key issue in global biodiversity science is how scientists are able to synthesize a comprehensive view of what global biodiversity is [9] its myriad components, and an understanding of how it functions, and be able to model and forecast how it will respond to the pressures of urbanisation and climate change [10] [11] [12] [13] [14] [15]. Several projects have attempted this previously and particular challenges have included lack of complete data on a Global scale [16] and knitting together legacy software [17] [18]. Harmonising the differing catalogues of species around the world is a crucial part of this synthesis and has enormous practical significance in indexing the knowledge needed to protect biodiversity. The six ‘global biodiversity programmes’ in i4Life, the Global Biodiversity Information Facility (GBIF), the European Nucleotide Archive (part of INSDC), the Barcode of Life initiatives, the IUCN Red List, the new LifeWatch [19] [20] programme, and the Encyclopedia of Life, are partners in i4Life and each currently has its own taxonomy of species so making the cross referencing of species from different data portals an unreliable process [21]. Through i4Life these portals are being cross referenced to a standardised Catalogue of Life list so that searches for a named species will be searches for the same thing in each place (fig 1). The target is to enable each programme to enhance its taxonomic catalogue with the assistance of the others, and to create a harmonised list for the entire set of organisms based around the expert-edited Catalogue of Life. These key players present particular hurdles to Catalogue integration because they a) have established their own architectures, standards and protocols, b) have special requirements, and c) have their own partial catalogues that need to be integrated with the Catalogue of Life in a two way flow. In each case i4Life has designed and implemented, and is now testing the necessary special data pipelines, and contributing to enhancement of the Catalogue of Life taxonomic coverage through input from the global partners. By providing access to a common species catalogue within each organisation, we expect to contribute a much needed level of knowledge integrity across the various scientific and Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 2 of 10 community studies of the global biota. To make sense of global biodiversity it is vital that these organisations can communicate through a unified view of the extent of life. Figure 1 – The Six Global Biodiversity Partners in i4Life are Becoming Interlinked with the Taxonomic Backbone of the Catalogue of Life A host of independently-run and separately-funded programmes are making progress in documenting specific dimensions of global and regional biodiversity. But progress is slower in synthesising these data into more broadly coherent pictures of life on earth, its processes of change, and opportunities for its sustainable management. Most such programmes are “owned” by one country or regional authority: the Catalogue of Life is not – it provides free access to data via a web portal. Across various data portals information on species, in particular, is fragmented and confused, with different standards and cultural perspectives persisting in each. The Catalogue of Life offers the opportunity to bridge those gaps and through the i4Life project, is building the required linkage – an ecosystem of services for standardised species naming and classification. The Catalogue of Life is a long existing and widely used product. The community of users of the Catalogue of Life is developing to encompass a wealth of public, scientific, societal and commercial needs for a single, consistent and reliable index of the world’s species (Figure 2) Figure 2 Community of Users of the Catalogue of Life Apart from its many individual users, the CoL provides services to global biodiversity portals. GBIF has used the COL as its taxonomic backbone since 2007. The Encyclopaedia of Life has used the CoL as its taxonomic backbone from the beginning. Thanks to the i4Life project this cooperation is now strengthened. Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 3 of 10 Although the majority of visits to the Catalogue of Life portal are from North America and Europe, the Catalogue of Life has is widely used in African countries as well. The map below provides information about the distribution of visits to the CoL by country in Africa (fig 3). There is no indication of direct use of the Catalogue of Life web portal so far from following countries: Chad, Equatorial Guinea, Eritrea, Guinea-Bissau, Sierra Leone, Western Sahara and South Sudan. However in addition to online functionality, the Catalogue of Life Annual Edition is distributed in DVD form to many places around the world including many African countries. We have sent DVDs directly on individual request to 16 African countries including most of the countries listed above, with exception of Western Sahara and South Sudan. We also send DVDs to GTI/CHM Focal Points which send out DVDs to 49 countries in Africa; and to GBIF and GBIF Node Managers covering 16 African countries. Figure 3 Distribution of online visits from Africa to CoL; by country (Google statistics 2013). 2. Objectives The principle goal if i4Life has been to establish the Virtual Research Community that will interlink and harmonise the global taxonomic catalogues presently developed by each of the global partners. The existing Catalogue of Life has been used as a backbone. 1. The project has conducted gap analysis in our knowledge and to facilitate communication among scientists and among all people with an interest in life on earth. 2. It has enabled the programme of each global partner to enhance its catalogue with the assistance of the others, and to create a harmonised list for the entire set of organisms. 3. i4Life will for the first time provide a summary of all species known across these programmes. 4. It is creating a global standard for taxonomic data integration in electronic infrastructures world-wide. 5. It has built on the work of the 4D4Life Project that presently supports the internal ‘service ecosystem’ of the Species 2000 Catalogue of Life database networks. Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 4 of 10 3. Methodology The global programme partners are exploring or aggregating knowledge on the extent of species diversity on earth as a part of their scientific work on different aspects of global biodiversity: global species distribution modelling, genome and sequence diversity, species identification using DNA Barcodes, conservation etc. Several of these programmes use and publish well-developed and widely used taxonomic catalogues – for instance the IUCN Red List, and the NCBI Taxonomy displayed in EBI. This project is to take the first steps towards a global virtual research community working with these leading global organisations (the five global programmes plus the CoL) to share some of the responsibility for exploring and cataloguing the extent of species diversity, making use where appropriate of the CoL as a common base, alongside their own catalogues (fig 5). It includes both practical installations and informatics: 1. Installing the Catalogue of Life within each global partner’s informatics platform, and making it available for comparison with its own taxonomy and, where appropriate, for that organisation’s user interface. 2. Establishing maintained cross-maps at the taxonomic concept level, so that internal staff and external users of each organisation can, where appropriate, see the relations between their own taxonomy and that of the CoL. 3. Establishing a workflow by which each organisation can make contributions of missing taxa and names to the CoL. Although large and growing, the CoL remains patchily incomplete. Data flow from partners can provide both additional names and taxa encountered in their present system, and the ongoing inflow of new taxa for reasons related to their own business. Because the CoL is a federated system with taxonomic expert responsibilities taken by an array of 115 organisations, this exercise also requires an extensive new workflow within the CoL organisation. In some cases adding the precise taxa or names used by one of the global partner organisations will add significantly to the usability of the CoL within that community. Figure 5 4. Technology Description The Virtual Research Network established by i4Life has the largest direct impact on the working process, on the standardisation, and on the coherent outcomes for those working Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 5 of 10 on species explorations and documentation within the Global Programmes. To fulfil goals of the project, three workflows have been established: 1. The Download Tools Workflow - the purpose of this workflow is to deliver the download tool. The download format for data exchange across the project is the Darwin Core Archive format (DwC-A) [22] [23]. Download service is operational. 2. The Cross-mapping Tools Workflow - the purpose of this workflow is to deliver the cross-mapping tool. The development version of the tool is operational. 3. The CoL Piping Tools Workflow - the purpose of this workflow is to deliver the Piping tool. The development version of the tool is operational. These tools provide a web-accessible route to connect data among the partners and to deliver output in Darwin Core Archive format to users. Development is through an international partnership lead by Reading University with collaborators at ETI Bioinformatics (Netherlands) and Cardiff University (UK). The overview of the Workflow is given in the diagram below (Figure 6). Fig. 6 – The i4Life systems diagram. GBPs – Global Biodiversity Programmes, GSDs – the Catalogue of Life source databases called Global Species Databases. Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 6 of 10 5. Developments One of the tasks in the project was to establish a workflow that delivers refreshed instances of the Catalogue of Life taxonomy within the very heterogeneous working platforms and portals of the global programme partners (GBIF, EBI, ECBOL, IUCN and LifeWatch), using a CoL download service developed by ETI as its base. The i4Life project has designed and implemented the enhanced download service of the Catalogue of Life which serves all global partners. This service is currently available only within the project, but the next stage of development will be to open the service more widely. One of the crucial things needed for comparison between differing catalogues and for the partners to be able to share species information, is a tool allowing taxonomic crossmapping between lists of names present in each of them. This will enable identification of how taxonomy differs from one organisation to another. The cross-mapping tools have been developed by the team from Cardiff University and the first phase of development has finished. Establishing the relationships among names and actual species is a complex process usually requiring experience and knowledge however a large element of the work can be entirely or partially automated. However the software has to cope with relationships as simple as “equal to” and as complex as “partially included in” or “overlaps with”. These can be very challenging to work out. The piping tools are designed to improve quality of the taxonomic backbone of the i4Life partner’s databases. The piping tools workflow developed at Reading University accepts submission of lists of species names that are not in the CoL database and then assigns them to the relevant GSDs for inclusion into their taxonomic databases. This has been achieved by providing a web based user interface for name-list uploads, downloads, reporting placement status, commenting on species and taxa names, and displaying workflow statistics and progress. The piping tools are operational and were already in use to serve data distribution for the i4Life pilot projects. A key to achieve project objectives was a transformation of the Catalogue of Life data to a common, internationally recognised data exchange format. The initial specification of a reduced DwC-A format agreed in the first year of the project was enhanced and refined during the second year into a document called ‘The i4Life DwC-A profile’ (www.i4life.eu). This document forms the basis for a common checklist exchange format. 6. Results A project specific version of the DwC-A format has been specified and is in use across the project for exchanging information between the partners and the Catalogue of Life. The initial implementation of tools in Global Partner Organisations has started. The GBIF data portal has now upgraded from the 2007 Catalogue of Life with 1 million species to current version of the Catalogue containing almost 1.4 million species. This increased coverage by the Catalogue of Life has already helped to resolve some of the problematic taxonomic records in the GBIF data portal making searches more successful. The DNA database at EBI now has a public taxon portal available in prototype and EBI are currently engineering a more robust backend data warehouse to support this. There are more than 800000 species (and infraspecies) in the EBI database. The first implementation linking the IUCN Red List web site to the CoL using the CoL webservice has been completed and allows outside users to get information that a species name they have searched for is not in IUCN but is in CoL. The fundamental importance of that is that the user is reassured they have searched for a real species with a recognised spelling and is not tempted to try other name spellings or to just give up their search. The specification of how the Catalogue of Life will cooperate with the Barcode community was discussed and agreed. A connection with the BOLD database will be done Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 7 of 10 through the BOLD European Mirror based in Utrecht (KNAW). The first stage of the process in which the Catalogue of Life will be incorporated into the BOLD platform was achieved. The pipeline between the Catalogue of Life and the EDIT platform (part of the LifeWatch project) has been implemented. The specification addresses both the data flow between the two platforms and the element mapping between concepts of the DwC-A format and the EDIT-CDM. The i4Life download service has been successfully used for data quality measures in the OpenUp! (http://open-up.eu/ ) project. The service is part of the “Data Quality Toolkit” used by curators of specimen databases to check names associated with zoological objects. In the first 6 months of deployment of the Data Quality Toolkit already 220,000 zoological metadata records have been checked. With this, the i4Life download services contributes significantly to the quality and retrievability of collection data provided by European Natural History collections. The Cross-mapping tools have also been adopted by the related EC-funded EU-Brazil OpenBio (http://www.eubrazilopenbio.eu/Pages/Home.aspx ) project, and adapted for use within the GCUBE high-performance computing environment, and the cross-mapping software design is being incorporated as a service within the EC-funded BioVeL (http://www.biovel.eu/) project. It is therefore becoming recognised as a product which can contribute to the scientific infrastructure of a number of biodiversity initiatives, helping to support inter-operation at the taxonomic level between heterogeneous data sets. 7. Business Benefits Given the huge societal, political, and economic interest in global biodiversity, these six global biodiversity data organisations (with their network partners) thus find themselves the bearers of a significant world responsibility, both in enumerating the extent of species diversity and in the assembly of an electronic taxonomic backbone that can serve society’s needs all around the world. The UN Convention on Biological Diversity has clearly recognised the significance and urgency of this task. i4Life is projected to make significant impact in five key areas: 1. The single largest impact will be to put in place the process of integration between the major global biodiversity programmes, so that those in molecular biology, biodiversity science, ecology and conservation can co-operate in the creation of a single authoritative catalogue showing the detail and extent of species on earth. 2. Opening up a range of electronic e-taxonomy services as infrastructure that can be built into the main global biodiversity information portals: GBIF, EBI, Barcode initiative, IUCN, LifeWatch, EoL. 3. Opening up the same range of electronic e-taxonomy services as infrastructure that can become part of the seamless architecture of biodiversity informatics, and available to be built into the much wider range of national local and NGO portals in Europe and around the world, and into novel experimental systems subject of a growing research and development community. This will provide a significant contribution to the connected structures needed for the global virtual biodiversity analysis laboratory prioritised and endorsed by the e-Biosphere 09 ‘Roadmap’ planning group of June 4/5 2009. 4. Opening up the same range, plus additional specially tailored e-taxonomy services to certain scientific communities, such as the biodiversity and climate change modelling community, the oceanographic community, and the genomics community. 5. And finally, simply by enhancing Catalogue of Life content and services for global biodiversity partners we anticipate significant impact on the wider user community. Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 8 of 10 8. Conclusions This project is providing a vital and significant improvement to the provision of a world species list and, co-working with the global programmes, as a vital first step to facilitate many future improvements. The technology is now in place to deliver an enhanced Catalogue of Life via a dedicated data portal (www.catalogueoflife.org) with associated services and archived annual checklists. It also develops a range of additional services which are planned to serve the global biodiversity community in its common goal to enhance data quality and harmonise the species catalogues worldwide. List of Terms Used in the Text The Catalogue of Life - The Catalogue of Life, started in June 2001 by Species 2000 and Integrated Taxonomic Information System (ITIS), is becoming a comprehensive catalogue of all known species of organisms on Earth. The Catalogue currently compiles data from 115 peer-reviewed taxonomic databases that are maintained by specialist institutions around the world. The twelfth annual edition of the catalogue (May 2012) included 1,404,038 species. Species2000 - Species 2000 is a federation of database organizations across the world that compiles the Catalogue of Life, a comprehensive checklist of the world's species, in partnership with the Integrated Taxonomic Information System (ITIS). Species 2000 was initiated by Frank Bisby and colleagues at the University of Reading in the UK in 1997 and the Catalogue of Life was first published in 2001. ITIS - The Integrated Taxonomic Information System (ITIS) is a partnership designed to provide consistent and reliable information on the taxonomy of biological species. ITIS was originally formed in 1996 as an interagency group within the U.S. federal government, involving several US Federal agencies and has now become an international body, with Canadian and Mexican government agencies participating. GBIF - The Global Biodiversity Information Facility (GBIF) - an international organisation that focuses on making scientific data on biodiversity available via the Internet using web services. IUCN RedList - Provides taxonomic, conservation status, and distribution information on taxa that are facing a risk of global extinction. ENA at EBI - The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. INSDC - International Nucleotide Sequence Database Collaboration. The International Nucleotide Sequence Databases (INSD) have been developed and maintained collaboratively between DDBJ, ENA, and GenBank for over 18 years. NCBI - National Center for Biotechnology Information. U.S. government-funded national resource for molecular biology information. BOLD - The Barcode of Life Database is designed to support the generation and application of DNA Barcode data. EoL - The Encyclopedia of Life, a free, online collaborative encyclopedia intended to document all of the 1.9 million living species known to science. LifeWatch is a European research infrastructure in development http://www.lifewatch.eu/en_GB DwC-A - Darwin Core Archive is a Biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset for species occurrence or checklist data (http://www.gbif.org/informatics/standards-and-tools/publishing-data/datastandards/darwin-core-archives/ ) Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 9 of 10 Acknowledgement Indexing for Life (i4Life) is a European e-Infrastructure project co-funded by the European Commission’s Seventh Framework Programme for Research and Technological Development. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Species 2000 Available from: www.sp2000.org. ITIS, Integrated Taxonomic Information System Available from: www.itis.gov/. Catalogue of Life - 2012 Annual Checklist. Available from: http://www.catalogueoflife.org/annualchecklist. Global Biodiversity Information Facility, GBIF. Available from: www.gbif.org. The IUCN Red List of Threatened Species. Available from: www.iucnredlist.org. ENA - European Bioinformatics Institute. Available from: www.ebi.ac.uk/ena. BOLD Systems Available from: www.barcodinglife.com. Encyclopedia of Life - Animals - Plants - Pictures & Information. Available from: eol.org. Bisby, F.A., The quiet revolution: Biodiversity informatics and the internet. Science, 2000. 289(5488): p. 2309-2312. Heap, M.J. and A. Culham, Automated Pre-processing Strategies for Species Occurrence Data Used in Biodiversity Modelling, in Knowledge-Based and Intelligent Information and Engineering Systems, Pt Iv, R. Setchi, et al., Editors. 2010. p. 517-526. Heap, M.J., A. Culham, and J. Osborne, The Benefits of a Compute Cluster Approach to High Spatial Resolution Biodiversity Richness Modelling: Projecting the Impact of Climate Change on Mediterranean Flora. The International Journal of Climate Change: Impacts and Responses, 2012. 4: p. 115-218. Culham, A. and C. Yesson, Biodiversity informatics for climate change studies, in Climate change, ecology and systematics, Hodkinson T, et al., Editors. 2011, Cambridge University Press: Cambridge. p. 131-242. Yesson, C. and A. Culham, Phyloclimatic modeling: Combining phylogenetics and bioclimatic modeling. Systematic Biology, 2006. 55(5): p. 785-802. Yesson, C. and A. Culham, Biogeography of cyclamen: an application of phyloclimatic modelling, in Climate change, ecology and systematics, T. Hodkinson, et al., Editors. 2011, Cambridge University Press: Cambridge. p. 265-279. Yesson, C., N.H. Toomey, and A. Culham, Cyclamen: time, sea and speciation biogeography using a temporally calibrated phylogeny. Journal of Biogeography, 2009. 36(7): p. 1234-1252. Yesson, C., et al., How Global Is the Global Biodiversity Information Facility? Plos One, 2007. 2(11). Pahwa, J.S., et al., Accessing Biodiversity Resources in Computational Environments from Workflow Applications. 2006 Workshop on Workflows in Support of Large-Scale Science. 2006. 11-20. Jones, A.C., et al., Building a biodiversity GRID, in Grid Computing in Life Science, A. Konagaya and K. Satou, Editors. 2005. p. 140-151. Lifewatch Project Available from: http://www.lifewatch.eu/en_GB Berendsohn, W.G., et al., Biodiversity information platforms: From standards to interoperability. Zookeys, 2011(150): p. 71-87. Thessen, A.E. and D.J. Patterson, Data issues in the life sciences. Zookeys, 2011. 150: p. 15-51. Robertson, T., et al. Darwin Core Text. Guide. Biodiversity Information Standards (TDWG). 2009. 2011. Wieczorek, J., et al. Darwin Core, Biodiversity. Information Standards (TDWG). 2009. Copyright © 2013 The authors www.IST-Africa.org/Conference2013 Page 10 of 10