Study of the distribution of Fredericella sultana in
Transcription
Study of the distribution of Fredericella sultana in
Scuola di Ingegneria Civile, Ambientale e Territoriale Dipartimento di Elettronica, Informazione e Bioingegneria Master of Science in Environmental and Land Planning Engineering Study of the distribution of Fredericella sultana in Switzerland in relation to environmental variables to predict the diffusion of Proliferative Kidney Disease in fish populations. SUBMITTED APRIL 1ST , 2015 BY IRENE BARDI Student Id n. 801780 SUPERVISORS Enrico Bertuzzo, Prof. Renato Casagrandi Academic Year 2014-2015 Scuola di Ingegneria Civile, Ambientale e Territoriale Dipartimento di Elettronica, Informazione e Bioingegneria Corso di Laurea Magistrale in Ingegneria per l’Ambiente e il Territorio Studio della distribuzione di Fredericella sultana in Svizzera in relazione a delle variabili ambientali per predire la diffusione della Malattia Renale Proliferativa nelle popolazioni di pesci. PRESENTATA IL 1 APRILE, 2015 DI IRENE BARDI Matricola n. 801780 RELATORI Enrico Bertuzzo, Prof. Renato Casagrandi Anno Accademico 2014-2015 Abstract Proliferative Kidney disease (PKD) is one of the most important parasitic diseases of salmonid populations in Europe and North America. It brings important economic losses to fish farms and has a significant impact on wild fish populations. The causing agent of PKD is the myxozoan Tetracapsuloides bryosalmonae, which uses freshwater bryozoans as intermediate hosts. The most common host species for myxozoan T. bryosalmonae is the bryozoan Fredericella sultana. The objective of this Master thesis was to create a probability distribution model for the presence of F. sultana in Switzerland from presence-only records of infected trout and by analyzing various environmental variables. The selected environmental variables were local but also of the entire upstream catchment area. These environmental variables estimate climate, river slope, land cover, geology and pollutant release facilities. All of the data was first processed with a GIS system and Matlab in order to compute the environmental variables required. The probability distribution was then created with MaxEnt, a Species Distribution Model (SDM) that combined observations of a species’ presence with environmental variables that could have effects on the suitability of the species’ habitat. Various runs were performed in order to evaluate all the different modelling possibilities. All of the runs performed significantly better than the random model, and the AUC was high for all the cases. The predicted probability of presence of F. sultana is higher near the large cities of Switzerland, this fact is consistent with the most important variable: the percent of built up area in the upstream basins. The other important variables were found to be the mean river slope, mean altitude (local and up-stream) and the percentage of alluvial rocks of the upstream basins. Sommario La malattia renale proliferativa (MRP) è una delle più importanti malattie parassitarie delle popolazioni salmonidae in Europa e nel Nord America. Questa malattia porta importanti perdite economiche agli allevamenti di pesci e ha un notevole impatto sulle popolazioni di pesci presenti in natura. L’agente che causa la MRP è la myxozoa Tetracapsuloides bryosalmonae, che utilizza i briozoi di acqua dolce come ospiti intermedi. La più comune specie ospite per la myxozoa T. bryosalmonae è la bryozoa Fredericella sultana. L’obbiettivo di questa tesi di master è quello di creare un modello di distribuzione di probabilità per la presenza della F. sultana in Svizzera utilizzando dati di sola presenza di trote infette e analizzando diverse variabili ambientali. Le variabili ambientali selezionate sono sia locali che dell’intero bacino idrografico a monte. Queste variabili ambientali considerano il clima, la pendenza del fiume, la copertura del suolo, la geologia e le eventuali strutture di rilascio di inquinanti. Tutti i dati sono stati elaborati con dei sistemi GIS e Matlab per ottenere le variabili ambientali volute. La distribuzione di probabilità è stata creata con MaxEnt, un modello di distribuzione di popolazioni (SDM) che combina osservazioni di presenza di specie con variabili ambientali che potrebbero avere un impatto sull’idoneità dell’habitat per le specie. Sono state eseguite diverse runs per valutare tutte le possibili modalità di modellizzazione. Tutte le runs hanno una performance significativamente migliore del modello random, e il valore di AUC è stato alto in tutti i casi. La probabilità predetta di presenza delle F. sultana è più alta vicino alle grandi città della Svizzera, questo fatto è consistente con la variabile più importante: la percentuale di urbanizzato nei bacini a monte. Le altre variabili ambientali importanti sono risultale la pendenza del fiume, l’altitudine media (locale e dei bacini a monte) e la percentuale di rocce alluvionali nei bacini a monte. Acknowledgements Il mio primo grazie va alle persone che hanno reso possibile questo lavoro, i miei relatori Enrico Bertuzzo e Renato Casagrandi. Grazie ad Enrico Bertuzzo per essere stato sempre disponibile e presente per ogni nuova idea, dubbio o problema. Grazie a Renato Casagrandi per avermi dato la possibilità di svolgere la tesi all’estero e per avermi incoraggiata nei momenti di incertezza. Un enorme grazie va ai miei genitori per l’incoraggiamento continuo e incondizionato. Grazie per aver creduto sempre in me, siete stati fondamentali. Grazie anche a Elena che mi ha sopportata per tutti questi anni, anche quando usavo la scusa. . . “eh ma io sono in università. . . le cose sono più difficili qui”, ora toccherà anche a te! Grazie a Martin per essere sempre stato presente ed avermi aiutata in ogni piccola difficoltà, sei stato un punto di riferimento per me. Grazie a Marco, compagno di caffè a Sat, pozzo di sapienza con Matlab e fidato amico. Grazie anche a Bea, Iris e Gib, non solo colleghi ma amici, sempre pronti a sostenermi e incoraggiarmi, senza di voi questi cinque anni sarebbero stati molto più duri. Grazie a Dani, Simo, Umbe, Pietro e Rebbi, compagni di questa bella avventura, sempre capaci di farmi sorridere. 3 Contents Acknowledgements 3 1 Introduction 1.1 A general overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 2 Data 2.1 Subdivision of Switzerland in drainage basins 2.1.1 PKD presence data . . . . . . . . . . 2.1.2 Altitude . . . . . . . . . . . . . . . . 2.1.3 Temperature . . . . . . . . . . . . . 2.1.4 Geology . . . . . . . . . . . . . . . . 2.1.5 Land cover . . . . . . . . . . . . . . 2.1.6 Swiss pollutant register . . . . . . . . 2.2 Data Analysis . . . . . . . . . . . . . . . . . 2.2.1 Drainage basins . . . . . . . . . . . 2.2.2 Altitude . . . . . . . . . . . . . . . . 2.2.3 River slope . . . . . . . . . . . . . . 2.2.4 Temperature . . . . . . . . . . . . . 2.2.5 Geology . . . . . . . . . . . . . . . . 2.2.6 Land cover . . . . . . . . . . . . . . 2.2.7 Swiss pollutant register . . . . . . . . 3 Maxent 32 3.1 Introduction to Maxent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Explanation of Maxent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Results 4.1 Data used in Maxent . . . . . . . . . . . . . . . . 4.2 Correlation . . . . . . . . . . . . . . . . . . . . . 4.3 The runs . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Run 1 . . . . . . . . . . . . . . . . . . . . 4.3.1.1 Analysis of omission/commission 4.3.1.2 Response curves . . . . . . . . . 4.3.1.3 Analysis of variable contributions 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 8 10 10 11 12 12 12 12 15 17 19 23 27 29 37 37 38 38 38 38 42 43 4.3.2 Run 2 . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Analysis of omission/commission 4.3.2.2 Analysis of variable contributions 4.3.3 Run 3 and run 4 . . . . . . . . . . . . . . . 4.3.4 Run 5 . . . . . . . . . . . . . . . . . . . . 4.3.4.1 Analysis of omission/commission 4.3.4.2 Response curves . . . . . . . . . 4.3.4.3 Analysis of variable contributions 4.3.5 Run 6 . . . . . . . . . . . . . . . . . . . . 4.3.6 Run 7 . . . . . . . . . . . . . . . . . . . . 4.3.7 Run 8 . . . . . . . . . . . . . . . . . . . . 4.3.8 Run 9 . . . . . . . . . . . . . . . . . . . . 4.3.9 Run 10 . . . . . . . . . . . . . . . . . . . 4.3.10 Run 11 . . . . . . . . . . . . . . . . . . . 5 Discussion and conclusions 5.1 Discussion . . . . . . . . . . . . . . 5.1.0.1 Analysis of the data 5.1.1 Correlation . . . . . . . . . . 5.1.1.1 Local variables . . . 5.1.1.2 Up-stream variables 5.1.2 The runs . . . . . . . . . . . 5.1.2.1 Run 1 . . . . . . . 5.1.2.2 Run 2 . . . . . . . 5.1.2.3 Run 3 and 4 . . . . 5.1.2.4 Run 5 . . . . . . . 5.1.2.5 Run 6 . . . . . . . 5.1.2.6 Run 7 . . . . . . . 5.1.2.7 Run 8 . . . . . . . 5.1.2.8 Run 9,10,11 . . . . 5.1.3 All the runs . . . . . . . . . . 5.2 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 47 49 49 49 55 55 55 55 55 59 59 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 64 64 64 65 65 65 65 66 66 66 68 70 70 70 73 List of Tables 2.1 2.2 2.3 2.4 Data sources PKD . . . . . . . . . . . . Extract from the legend of the lithology Data source land cover . . . . . . . . . Geology legend . . . . . . . . . . . . . . . . . 4.1 Environmental variables for the distribution of Fredericella sultana . . . . . . . . . . . 39 5.1 Percent contribution of the most inportant variables for all the differents runs . . . . . 73 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 12 13 25 List of Figures 1.1 1.2 T. bryosalmonae life cycle [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map of Switzerland . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 2.30 2.31 Basic geometry . . . . . . . . . . . . . . . . . . . . . . . . Zoom of basic geomerty . . . . . . . . . . . . . . . . . . . Example of selection of all the upstream basins . . . . . . . Example of selection of all the downstream basins . . . . . Map presence PKD . . . . . . . . . . . . . . . . . . . . . . Altitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean temperature in January . . . . . . . . . . . . . . . . . Example of a system of basins and its adjacency matrix . . . Adjacency matrix . . . . . . . . . . . . . . . . . . . . . . . Coloured adjacency matrix with coordinates . . . . . . . . Adjacency matrix with coordinates and lakes . . . . . . . . Workflow altitude . . . . . . . . . . . . . . . . . . . . . . . Comparison between the raster of the polygons and the DTM Mean altitude . . . . . . . . . . . . . . . . . . . . . . . . . Mean upstream altitude . . . . . . . . . . . . . . . . . . . . Polt log flow accumulation Vs. log river slope . . . . . . . . Mean flow accumulation . . . . . . . . . . . . . . . . . . . Log mean flow accumulation . . . . . . . . . . . . . . . . . Log mean river slope . . . . . . . . . . . . . . . . . . . . . Log mean drained river slope . . . . . . . . . . . . . . . . . How Zonal Statistics works [2] . . . . . . . . . . . . . . . . Mean monthly temperature January 2013 . . . . . . . . . . Geology . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zoom of geology before and after dissolve . . . . . . . . . How Statistic Tabulate Intersection works [2] . . . . . . . . Extract of the geology table . . . . . . . . . . . . . . . . . . Percentage of different classes of geology . . . . . . . . . . Land cover . . . . . . . . . . . . . . . . . . . . . . . . . . Percentage of different classes of land cover . . . . . . . . . Pollutant releases . . . . . . . . . . . . . . . . . . . . . . . Waste and waste water management facilities . . . . . . . . 4.1 Extract of the csv file used for Maxent . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 6 7 8 9 9 11 11 14 14 15 16 16 17 18 18 19 20 20 21 22 22 23 23 24 24 26 27 28 29 30 31 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.23 4.22 4.24 4.25 Correlation matrices for the variables . . . . . . . . . . . . Omission and predicted area for F. sultana- run 1 . . . . . ROC for Fredericella sultana-run 1 . . . . . . . . . . . . . Responding curves- run1 . . . . . . . . . . . . . . . . . . Importance of different variables-run 1 . . . . . . . . . . . Jackknife-run 1 . . . . . . . . . . . . . . . . . . . . . . . Omission and predicted area for Fredericella sultana- run 2 ROC- run 2 . . . . . . . . . . . . . . . . . . . . . . . . . Thresholds table-run 2 . . . . . . . . . . . . . . . . . . . Importance of different variables-run 2 . . . . . . . . . . . Jackknife-regularized training gain-run 2 . . . . . . . . . . Jackknife- run2 . . . . . . . . . . . . . . . . . . . . . . . Responding curves run 3 and 4 . . . . . . . . . . . . . . . Average omission and predicted area for F. sultana-run 5 . ROC- run 5 . . . . . . . . . . . . . . . . . . . . . . . . . Responding curves for variable 39- run 5 . . . . . . . . . . Jackknife-run 5 . . . . . . . . . . . . . . . . . . . . . . . Importance of the most important variables- run 2 and 6 . . Probability of presence F. sultana- run 6 . . . . . . . . . . Probability of presence Fredericella sultana- run 8 . . . . . Probability of presence F. sultana- run 10 . . . . . . . . . Importance of different variables-run 10 . . . . . . . . . . Importance of different variables-run 11 . . . . . . . . . . Probability of presence F. sultana- run 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41 42 44 45 46 47 48 48 50 51 52 53 54 54 56 57 57 58 59 60 61 62 63 5.1 5.2 5.3 5.4 5.5 5.6 Response curve of variable 39 . . . . . . . . . ROC for Fredericella sultana-run 7 . . . . . . . Responding curves variable 39- run7 . . . . . . Probability of presence F. sultana - run 6 and 8 Response curve of variable 39 - run 9 . . . . . Response curve of variable 4 - run 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 69 71 72 74 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 A general overview Fresh water ecosystems are among the most important earth ecosystems, in fact it is estimated that 40% of all species in the word have their origins in fresh water ecosystems. These types of habitats are also home to a lot of different other organisms such as aquatic plants, invertebrates or amphibians [3]. However, they are also one of the most threatened natural resources. Human activity has an important role in this process. The degradation of freshwater ecosystems has important impacts on the biodiversity of species; it can change water quality and cause disease. The increased presence of disease threatens aquatic animal health and can change the resilience and biodiversity of a whole population. Proliferativa Kidney Disease (PKD) is one of the most important parasitic diseases of salmonid populations in Europe and North America [4]. It brings important economic losses to fish farms and has a significant impact on populations of wild fish [5]. It causes a massive proliferation of the intestinal kidney tissue, anemia, ascites, exophthalmos and apathy[1]. The causing agent of PKD is the myxozoan Tetracapsuloides bryosalmonae. It is a multicellular endoparasite of fresh water bryozoans and salmon fish. Its life-cycle is based on the exploitation of vertebrate and invertebrate hosts. The most common host species for myxozoan T. bryosalmonae are bryozoa [6]. Bryozoan colonies can be mistaken for a mat of moss, and they are found in environments where there is scarce light, as for example under stones or logs [7]. They represent a source of food for many species of fish and a microhabitat for small invertebrates. The dispersion of bryozoa is facilitated by statoblasts. Their colonies expand during summer and regress to inactive, asexual hibernating stages (statoblast) in the fall. Summer, when the water reaches temperatures around 15 degrees C or more, is when PKD appears the most [4]. 1 2 CHAPTER 1. INTRODUCTION Box Tetracapsuloides bryosalmonae life cycle T. bryosalmonae is a myxozoan parasite of salmonid fish and it is the causative agent of PKD. PKD is a known illness since the 1900s but it was only discovered in 1985 that myxozoan were the cause of it [8] and in 1999 that freshwater bryozoans were the invertebrate hosts [5]. Myxozoa are a group of endoparasites of vertebrates and invertebrates animals of aquatic environments of a very small size, usually ranging from 10 µm to 20 µm in size. There are around 1300 species of myxozoans and most of them have a two-host life cycle, for example fish, annelid worms or bryozoan. They are multicellular organisms, and studies have shown that they can originate from cnidarians [9]. The life cycle of T. bryosalmonae is still under study but the identification of bryozoans as intermediate hosts has brought important progress for the understanding of it. The first stage of the infection caused by T. bryosalmonae is a covert infection when there are single-cell stages in the bryozoans. Subsequently, multicellular sacs are developed from the single-cell stages and they multiply in the cavity of the bryozoan colonies. The mature sacs contain spores, two internal amoeboid cells and four polar capsules [10]. The spores released by the bryozoans infect trout via filaments contained in the polar capsules, which enter the fish via gills or skin. The infection in fish is caused by amoeboid cells in spores that reach the vascular system. The bryozoans can cycle from covert to overt infection [1]. Figure 1.1: T. bryosalmonae life cycle [1] T. bryosamonae reaches the kidneys and spleen, causing and inflammatory responses and damage to kidney tissues. Some of these spores reach the lumen of the kidney tubules and are released in urine of the trout. These spores are infective to bryozoans, in the case of brown and brook trout [1]. Laboratory experiments show that infection by T. bryosalmonae alone can lead to the death of the fish, without necessarily a secondary infection [11]. Fredericella sultana (Blumenbach) is the most common bryozoan responsible of PKD. It is a fresh water bryozoan typical of lotic and lentic habitats [12]. It has been found in Europe, North America, Australia and New Zealand [13]. F. sultana prefers lakes that are rather low in altitude, have rich vegetation with plant species typical of eutrophic environments, gyttja sediments, stony “hard” shores, some wave action, high in calcium content, and slightly colored water. On the other hand F. sultana 3 1.2. METHODS Figure 1.2: Map of Switzerland avoids lakes with pH below 5.4, ponds, ditches and mires, dystrophic lakes surrounded by Sphagnum bogs and with dy sediments [7]. The objective of this study is to create a probability distribution model for the presence of F. sultana in all of Switzerland from presence-only in rivers and by analyzing various environmental variables. This study has a large scale of interest (about 40.000 km2), as said before, all of the Swiss territory. In Figure 1.2 is shown see the map of Switzerland with all the different cantons and cities. In Switzerland, PKD is the most prevalent disease in fish [4]. 1.2 Methods In this study, in order to predict the presence of PKD, the distribution of F. sultana is studied. Unfortunately no data of the presence of F. sultana is available in Switzerland, but is it known that F. sultana is the most common intermediate host species for T. bryosalmonae. The available data is a record of presence or absence of infected trout collected in the whole Swiss territory. The spatial scale of analysis is the basins one, all the territory of Switzerland is divided into drainage basins and partial drainage basins (the mean area of of them is 2 km2 ). Considering the spatial scale of analysis and the mobility of the trout (which is quite limited especially in spring, the season when the trout contract the disease) it is possible to not consider the mobility of the trout. For this reason, the presence of F. sultana is deducted from the presence of infected trout. 4 CHAPTER 1. INTRODUCTION The environmental variables selected for this study were local but also of the entire upstream catchment area. The local variables were referred to the considered basin while the upstream basins take into account all the basins that drain in the considered basin. These environmental variables estimate climate, river slope, land cover, geology and pollutant release facilities. In order to represent the climate, altitude was selected because in previous studies [7], it was a discriminant variable to describe the presence of F. sultana and has relevant impact on the local climate conditions. The other variable selected to represent climate was the mean monthly temperature of 2013, because temperature has strong impact on climate and on the habitat of many species [14]. To represent the river characteristics, river slope and flow accumulation were calculated. Geological and land cover characteristics may have an impact on the water quality [15], [16]. For land cover, 6 classes were considered: rocks, agriculture, built up, forest, glacier and lakes/water. Concerning the geology, 6 further classes were studied: alluvial rocks, peat/loam, sedimentary rocks, sand/gravel, granite and gneiss. This classification was made according to the rock chemical and physical properties that can influence water quality [16]. In order to consider human impact, land cover and pollutant release facilities were investigated. All of these environmental variables were also calculated for all the upstream basins of the considered basin. Chapter 2 Data 2.1 Subdivision of Switzerland in drainage basins To analyze the distribution of F. sultana in the whole territory of Switzerland the environmental variables were linked with the hydrology of Switzerland. To study the hydrology of Switzerland the data bassisgeometrie (basic geometry) was used. The source of this data is OFEV (Office Fédéral de l’Environnement). This file contains the subdivision of Switzerland into drainage basins and is a mosaic of polygons that correspond to partial drainage basins that cover the whole territory. The mean surface of these polygons is 2 km2 . The subdivision in drainage basins can be divided into partial and complete drainage basins. For each partial drainage basin there is an univocal complete drainage basin where the drainage is made in the same estuary and each complete drainage basin can be defined using a simple query. In Figure 2.1 is possible to see the basic geometry of this file,while in Figure 2.2 there is a zoom of it. 5 6 CHAPTER 2. DATA Figure 2.1: Basic geometry Among the different attributes of this file, H1 and H2 represent two auxiliary codes developed by OFEV. They are created to help the user to represent the hierarchical structure of these basins. They correspond at the “right” and “left” value of a dataset, according to a model. With these two parameters it is possible to select the complete drainage basin for each of the partial derange basin. They are not a unique key to identify the polygon selected because they are not fixed values and at each new version of the subdivision, they change. For all the partial drainage basins Tn where the drainage is made through the estuary of the Ti basin and that all together they make the complete drainage basin Ti , Where: H1Ti < H1T n < H2Ti H1Ti : value of H1 of the partial drainage basin Ti H2Ti : value of H2 of the partial drainage basin Ti H1T n : value of H1 of all the partial drainage basins Tn 2.1. SUBDIVISION OF SWITZERLAND IN DRAINAGE BASINS 7 Figure 2.2: Zoom of basic geomerty With a GIS system it is possible to select all the basins where the drainage is made through the estuary of the selected basin, Figure 2.3, using the following query: H1≥P1 AND H1 < P2 Where P1 and P2 are respectively the value of H1 and H2 of the selected basin. It is also possible to do the opposite; find all the basins downstream from the selected basin, Figure 2.4. With the flowing query: H1≤P1ANDH2≥P2 As before, P1 and P2 are the value of H1 and H2 of the selected polygon (Office federal de l’environnement OFEV, Subdivision de la Suisse en bassins versants (Bassins versants Suisse)). 8 CHAPTER 2. DATA Figure 2.3: Example of selection of all the upstream basins 2.1.1 PKD presence data The PKD presence data were obtained from the OFEV -Office Fédéral de l’Environnement. This data contains information about PKD and names of the water bodies where PKD had been detected in fish. This data is a point Shapefile with 504 records, where 236 records are positive for PKD. The table 2.1 shows the characteristics of this dataset, while the Figure 2.5 represents the distribution of the PKD presence data. The class Befund contains the information of the presence or not of PKD in the fish analyzed. 2.1. SUBDIVISION OF SWITZERLAND IN DRAINAGE BASINS Figure 2.4: Example of selection of all the downstream basins Figure 2.5: Map presence PKD 9 10 CHAPTER 2. DATA Data tipe Shapefile Feature Class Shapefile pkd_06.shp Geometry Type Point Coordinates have Z values Yes Coordinates have measures Yes Projected Coordinate System CH1903_LV03 Projection Hotine_Oblique_Mercator_Azimuth_Center False_Easting 600000,00000000 False_Northing 200000,00000000 Scale_Factor 1,00000000 Azimuth 90,00000000 Longitude_Of_Center 7,43958333 Latitude_Of_Center 46,95240556 Linear Unit Meter Geographic Coordinate System GCS_CH1903 Datum D_CH1903 Prime Meridian Greenwich Angular Unit Degrees Table 2.1: Data sources PKD 2.1.2 Altitude The data used to represent altitude was the DTM (digital terrain model) of Switzerland. The DTM is a raster file with a resolution of 25 m, we can see it in Figure 2.6. 2.1.3 Temperature The data used for temperature was obtained from MeteoSwiss. The variable, used in this study, contains the temperature two meters above ground, averaged over calendar months in degrees Celsius. The mean values were calculated by averaging daily mean values that were calculated form automatic 10-minute measurements both day and night, there were about 80 measurement stations. To interpolate these measurements, MeteoSwiss used spatial interpolation. This method has better results than normal linear temperature interpolation with a height relationship. This product should better reproduce temperature variations such as those from inversions over the Swiss Plateau or winter-time cold pools. However, some physical effects are not modelled and there is some spatial variation of the interpolation accuracy. For example, in winter months, the standard error has a range of 0.6-1.8 degrees and for summer months, it has a range of 0.5-0.7 degrees (MeteoSwiss). This data was downloaded in the form of a raster file in TIFF format. Figure 2.7 shows the mean temperature of January. 11 2.1. SUBDIVISION OF SWITZERLAND IN DRAINAGE BASINS Altitude 4500 200 4000 400 3500 600 800 3000 1000 2500 1200 2000 1400 1500 1600 1800 1000 2000 500 2200 0 Figure 2.6: Altitude Mean Tenperature January 10 10 5 20 30 0 40 50 −5 60 −10 70 80 −15 90 100 −20 Figure 2.7: Mean temperature in January 2.1.4 Geology The data used for the variable geology made by Swisstopo and was a structure data vector GK500_V1_1_FR. It corresponds to the printed geology and tectonic map of Switzerland 1:500 000. The part of the data used was PY_surfaces_base, a shapefile that contains some information about the geology formations, the tectonic units and the reservoir aquifers. In this file there were 46 different classes for the lithology. In table 2.2 we can see an extract from the legend of the lithology data. 12 CHAPTER 2. DATA ID 1 2 3 4 5 6 7 8 9 LITHOLOGY Roche meuble en general Limon, argile, tourbe (tourbieres, marais) Limon, argile, (loss, limon de pente, limon d’alteration) Principalement blocs (eboulement) Gravier limoneux, sable, limon, p.p. blocs (moraine) Blocs, gravier grossier, sable (depot d’eboulis) Gravier, sable, limon, p.p. blocs (cone de dejection) Gravier et sable («Schotter») Gravier et sable, p.p. cimentes («Schotter») Table 2.2: Extract from the legend of the lithology 2.1.5 Land cover The data used for the land cover was downloaded through GeoVITE (Geodata Visualization and Interactive Training Environment). The aim of Geovite is to provide easy-to-use online access to the most important Swisstopo geodatasets. Data used was an extract of VECTOR200, a landscape model that represent the features of the landscape of Switzerland in vector format. The shapefile had 11 different categories for the land cover. Table 2.3 represents the data source of the file used for the variable land cover: 2.1.6 Swiss pollutant register SwissPRTR is a Swiss Pollutant Release and Transfer Register. It provides information about the releases of pollutants and transfers of wastes from facilities and diffuses sources. This data includes different types of facilities, pollutants and waste treatment. The field of activity of these facilities are: animal and vegetable products from the food and beverage sector, chemical industries, energy industries, mineral industries, paper and wood production and processing, production and processing of metals, waste and waste water management or other industrial activities. 2.2 Data Analysis 2.2.1 Drainage basins All the data was processed in order to use only the necessary information for the study, to validate the spatial position and to compute some statistics. First, an adjacency matrix was created with MATLAB in order to create a network of all the connection between all the different partial drainage basins. To do that the query in Equation 2.1 was used: H1≤P1ANDH2≥P2 (2.1) 13 2.2. DATA ANALYSIS Data Type Shapefile Feature Class Shapefile Primary surface land cover\VEC200_LandCover Geometry Type Polygon Coordinates have Z values no Coordinates have measures no Projected Coordinate System CH1903_LV03 Projection Hotine_Oblique_Mercator_Azimuth_Center False_Easting 600000,00000000 False_Northing 200000,00000000 Scale_Factor 1,00000000 Azimuth 90,00000000 Longitude_Of_Center 7,43958333 Latitude_Of_Center 46,95240556 Linear Unit Meter Geographic Coordinate System GCS_CH1903 Datum D_CH1903 Prime Meridian Greenwich Angular Unit Degree Table 2.3: Data source land cover As explained before, H1 and H2 were attributes in the bassisgeometrie data and P1 and P2 was the value of H1 and H2 of the selected basin. The matrix was created in order to have one in the position Cij if the basin i is immediately upstream of basin j (where i is the index of the rows, and j is the one for the columns). Figure 2.8 shows an example of a system of basins and its adjacency matrix. The Figure 2.9 shows how the sparse matrix looks, the points represent the value of the cells of the matrix different from zero, and in this case, they are equal to one. 14 CHAPTER 2. DATA Figure 2.8: Example of a system of basins and its adjacency matrix 4 0 x 10 0.5 1 1.5 2 0 0.5 1 nz = 23440 1.5 2 4 x 10 Figure 2.9: Adjacency matrix Next, the coordinates of each polygon centroid were calculated. The file bassisgeometrie also had the attribute SEE that was equal to one if the polygon was a lake or backwater, two if the polygon drains 15 2.2. DATA ANALYSIS 5 3.5 x 10 first square second square third square fourth square extreme right values 3 2.5 2 1.5 1 0.5 4.5 5 5.5 6 6.5 7 7.5 8 8.5 5 x 10 Figure 2.10: Coloured adjacency matrix with coordinates in the polygon with value one or zero in the normal case. Using this information, it was possible to identify the lakes. In Figure 2.11 we can see how the sparse matrix looks with the coordinates and the information of the lakes. In order to understand the structure of the matrix in Figure2.9, the different squares in it were plotted with different colours. We can see the result of this plot in Figure 2.10. As it is possible to see the different squares in Figure 2.9 represent the different parts of Switzerland. The different polygons were, in fact, numerate according to their position in the Swiss territory. The connections represented in blue, in Figure 2.10, are some lakes; we can see them in the right-down corner, in the lower part and in the right side of the adjacency matrix, in Figure 2.9. With the same original data, bassisgeometrie, a matrix was also created where, for each polygon, the information of all the polygons that drained in that polygon was registered . This matrix was used to calculate the mean value of all the environmental variables of all the upstream basins for each polygon. 2.2.2 Altitude With Matlab, the mean altitude was calculated for each polygon. In order to do this, the shape file bassisgeometrie was converted into a Raster file with a cell size of 100 m. In each cell, this new file contained the ID of the polygon that was at that precise position in the bassisgeometrie file. The first 16 CHAPTER 2. DATA 5 3.5 x 10 connections lake 3 2.5 2 1.5 1 0.5 4.5 5 5.5 6 6.5 7 7.5 8 8.5 5 x 10 Figure 2.11: Adjacency matrix with coordinates and lakes Figure 2.12: Workflow altitude 2.2. DATA ANALYSIS 17 operation that was made to the DTM was to fill the sinks because it is common to have anomalies in digital terrain models. To be able to compare the two Raster files, the DTM was resampled in order to change the size of the cell from 25m to 100m, in Figure 2.13 there is an example of the 2 raster files. This change was made with ArcMap using the resample function. Using the clip function, the two raster files were resized to have the same dimension. Next, they were converted into ASCII format to be processed in Matlab. With Matlab, the mean altitude for each polygon was then computed, Figure 2.14. Figure 2.13: Comparison between the raster of the polygons and the DTM Subsequently, using the matrix with the information of all the drained basins, the mean altitude of each basin and the raster derived from bassisgeometrie with Matlab, the mean elevation of the upstream basins for all the basins was computed, Figure 2.15. 2.2.3 River slope The river slope was calculated with GIS using the Spatial Analyst tool (Surface toolset- Slope), starting with the DTM of Switzerland. The resolution of each cell was 25 m. The flow accumulation was calculated with the DTM as well. In order to do this, the DTM was filled using the hydrology toolset in Arcmap 10.2 (Spatial Analysis toolbox- Hydrology toolset-fill), and then the flow direction was performed using the Hydrology toolset’s flow direction tool. Using flow direction as an input, flow accumulation was calculated with the hydrology toolset flow accumulation tool. To calculate the mean slope for each polygon Matlab was used. The shape file bassisgeometrie was rasterized with resolution of 25m, then the flow accumulation, the slope and the bassisgeometrie files were clipped to have the same dimension in order to have a matrix with the same indexes in Matlab. The flow accumulation 18 CHAPTER 2. DATA Mean Altitude 3500 200 3000 400 600 2500 800 2000 1000 1200 1500 1400 1600 1000 1800 500 2000 2200 0 Figure 2.14: Mean altitude Mean Upstream Altitude 3500 200 3000 400 600 2500 800 2000 1000 1200 1500 1400 1600 1000 1800 500 2000 2200 0 Figure 2.15: Mean upstream altitude 19 2.2. DATA ANALYSIS was used to select the pixels and compute the mean slope: first, all the pixels in the bassisgeometrie file with the same ID were selected. Then, they were sorted according to decreasing flow accumulation and then the value of the slope of the first pixels were used to calculate the mean slope. In Figure 2.16 we can see the flow accumulation vs the log of the river slope. log accumulation Vs log river slope 10 data fitted line 8 6 4 2 0 −2 −4 0 5 10 15 20 Figure 2.16: Polt log flow accumulation Vs. log river slope In Figure 2.17, Figure 2.18, Figure 2.19 and Figure 2.20, there are the maps of the mean flow accumulation, the log of the mean flow accumulation, the log of the mean river slope and the log of the mean slope for each polygon, respectively. 2.2.4 Temperature The mean temperature for each polygons was calculated with ArcMap 10.2 using the Spatial Analyst Tools-Zonal Statistics. For each of the 12 files containing mean monthly temperature the spatial coordinate system was added. The function Zonal statistics as a Table summarizes the value of a raster within the zones of another dataset and give us the result with a table. In this case, the feature zone data was bassisgeometrie and the zone field was the polygon ID, the input value raster was the raster contained the mean monthly temperature. In Figure 2.21 we can see how Zonal Statistics works. Once a table with the mean monthly temperature was created it was exported and used in Matlab. In Figure 2.22 we can see the map of the mean monthly temperature of January 2013. 20 CHAPTER 2. DATA mean flow accumulation 7 x 10 3.5 200 3 400 600 2.5 800 2 1000 1200 1.5 1400 1600 1 1800 0.5 2000 2200 Figure 2.17: Mean flow accumulation log mean flow accumulation 16 200 400 14 600 12 800 10 1000 1200 8 1400 6 1600 4 1800 2000 2 2200 Figure 2.18: Log mean flow accumulation 21 2.2. DATA ANALYSIS log mean slope 8 200 400 6 600 800 4 1000 2 1200 1400 0 1600 1800 −2 2000 2200 Figure 2.19: Log mean river slope 22 CHAPTER 2. DATA log mean drained river slope 8 200 400 6 600 800 4 1000 2 1200 1400 0 1600 1800 −2 2000 2200 −4 Figure 2.20: Log mean drained river slope Figure 2.21: How Zonal Statistics works [2] 23 2.2. DATA ANALYSIS Mean Tenperature January 10 10 5 20 30 0 40 50 −5 60 −10 70 80 −15 90 100 −20 Figure 2.22: Mean monthly temperature January 2013 2.2.5 Geology For this study, the different types of geology were grouped in six classes: alluvial rocks, peat/loam, sedimentary rocks, sand/gravel, granite and gneiss. In Figure 2.23 we can see the map of the different classes of geology. Figure 2.23: Geology 24 CHAPTER 2. DATA (a) Geology before dissolve (b) Geology after dissolve Figure 2.24: Zoom of geology before and after dissolve Figure 2.25: How Statistic Tabulate Intersection works [2] As said before, the classification was made according to the information found in the literature about the rock chemical and physical properties that can influence the water quality [16] and their presence and importance in Switzerland. Usually the chemical that are associated with the water quality are Ca, Mg and SO4 because these constituents form the principal solutes derived from rock in most stream system. Also some physical attributes are important: rock strength (uniaxial compressive strength) and rock hydraulic conductivity [16]. In order to compute the percentage of the different geology classes for each polygons, a dissolve was performed on the file PY_surfaces_base of the different types of geology. The dissolve was used to aggregate the features of this file based on the classification performed before. In Figure 2.24a and Figure 2.24b we can see a zoom on the geology map, before and after the dissolve. Once the dissolve was performed, the percentage of each geology class for each polygon was calculated. It was computed using ArcMap 10.2 and the Analysis Tool-Statistics-Tabulate Intersection. In Figure 2.25 we can see how Statistic Tabulate Intersection works. 25 2.2. DATA ANALYSIS Geology class name 0 Other 1 Alluvial rocks 2 Peat and loam 3 Sedimentary rocks 4 Sand and gravel 5 Granite 6 Gneiss Table 2.4: Geology legend In the studied case, we have a table like the one in Figure2.26, where there are different lines for the different geology classes for each polygon. On the other hand the Figure 2.4 represent the legend of the geology map. The table in Figure 2.26 was then exported to Matlab, in order to group for each polygon the percentage of the different classes of geology. Next with Matlab, different maps of the percentage of different geology classes for each polygon were created and, as did for the altitude, the percentage of the different classes of the upstream basins was performed . In Figure 2.27 the maps of the different percentage of the different geology classes for each basin are shown. According to Figure 2.27a the percentage of alluvial rocks is higher near the rivers, for example it is possible to see clearly a higher percentage of alluvial rocks near the Rhone river. Figure 2.27b shows that the percentage of peat and loam is higher in the upper part of Switzerland. This area is situated at a lower altitude if compared with the lower part of Switzerland, according to Figure 2.14. As per Figure 2.27e and Figure 2.27f the percentages of granite and gneiss are higher in the lower part of Switzerland, which is the area with higher altitude due to the presence of the Alps. 26 CHAPTER 2. DATA Figure 2.26: Extract of the geology table 27 2.2. DATA ANALYSIS geology class 1 geology class 2 100 200 90 400 80 600 100 200 90 400 80 600 70 800 70 800 60 1000 60 1000 50 1200 50 1200 40 1400 40 1400 1600 30 1600 30 1800 20 1800 20 2000 10 2200 2000 10 2200 0 (a) Geology class 1 (b) Geology class 2 geology class 3 geology class 4 100 200 90 400 80 600 100 200 90 400 80 600 70 800 70 800 60 1000 60 1000 50 1200 50 1200 40 1400 40 1400 1600 30 1600 30 1800 20 1800 20 2000 10 2200 2000 10 2200 0 0 (c) Geology class 3 (d) Geology class 4 geology class 5 geology class 6 100 200 90 400 80 600 100 200 90 400 80 600 70 800 70 800 60 1000 60 1000 50 1200 50 1200 40 1400 40 1400 1600 30 1600 30 1800 20 1800 20 2000 10 2200 2000 10 2200 0 (e) Geology class 5 0 (f) Geology class 6 Figure 2.27: Percentage of different classes of geology 2.2.6 Land cover For the land cover there are some similarity with the processes did for the geology, starting from the Geovite data, the land cover was divided in 6 classes: rocks, forest, built up, agriculture, glacier and water/lakes. In order to calculate the percentage of land use for each class an intersect between the file with the land cover and bassisgeometrie was performed. Then the percentage of the different classes for each polygons were performed with Matlab. In Figure 2.28 there is a map of the different classes of land cover all together, then in Figure 2.29 we have one map for each class of land cover. These maps, Figure 2.29 represent the different percentage of the different land cover classes for each basin. As it is possible to see in Figure 2.29a the percentage of rocks is higher in the lower part of Switzerland, the 28 CHAPTER 2. DATA area where the Alps are, this fact is congruent with the previous map of altitude (Figure 2.14) and the map of geology class 5 and 6 (Figure 2.27e and Figure 2.27f). It is possible to notice, in Figure 2.29a, that there are some areas in the lower part of Switzerland, where the percentage of rocks is very low but the surrounding areas have a very high percentage of rocks. This fact is explained by the presence of glaciers, as it is shown in the map of land cover class 5, Figure 2.29e. Figure 2.29c is the map of the distribution of percentage of built up area. This map shows correctly, the presence of the cities in Switzerland, being the areas with the higher percentage of built up. Similarly, In Figure 2.29f, the presence of the lakes is correctly represented. Figure 2.28: Land cover 29 2.2. DATA ANALYSIS land cover class 1 land cover class 2 100 200 90 400 80 600 100 200 90 400 80 600 70 800 70 800 60 1000 60 1000 50 1200 50 1200 40 1400 40 1400 1600 30 1600 30 1800 20 1800 20 2000 10 2200 2000 10 2200 0 0 (a) Land cover class 1 (b) Land cover class 2 land cover class 3 land cover class 4 100 200 90 400 80 600 100 200 90 400 80 600 70 800 70 800 60 1000 60 1000 50 1200 50 1200 40 1400 40 1400 1600 30 1600 30 1800 20 1800 20 2000 10 2200 2000 10 2200 0 (c) Land cover class 3 (d) Land cover class 4 land cover class 5 land cover class 6 100 200 90 400 80 600 100 200 90 400 80 600 70 800 70 800 60 1000 60 1000 50 1200 50 1200 40 1400 40 1400 1600 30 1600 30 1800 20 1800 20 2000 10 2200 2000 10 2200 0 (e) Land cover class5 0 (f) Land cover class 6 Figure 2.29: Percentage of different classes of land cover 2.2.7 Swiss pollutant register In the Swiss pollutant register data there are in total 1427 different facilities and at the first analysis they were all considered. Then only the waste and waste water management activities were considered. The original data was in a csv format, it was first processed with Excel, in order to select only the useful information and to select the waste and waste water management facilities. Subsequently, the two data obtained (the pollutant releases file and the waste and waste water file) were imported in ArcMap 10.2 to create tow new layers. To do that first, the 2 layers were converted in tables with the Conversion Tools Excel to Table. Next, with the Data Management Tools (Layers and Table Views-Make XY Event Layer) 2 points layer were created. Subsequently the Analysis Tools (Overlay- Spatial Join) 30 CHAPTER 2. DATA was used in order to join the attributes from the points layer with the file basisgeometrie according on spatial relationship. Once obtained these 2 layers they were processed in Matlab and added to the table of all the variables previously computed. In Figure 2.30 we can see a map of all the different facilities and in the map2.31 only the waste and waste water management facilities. Figure 2.30: Pollutant releases 2.2. DATA ANALYSIS 31 Figure 2.31: Waste and waste water management facilities With Matlab 3 new variables were computed for both of the 2 new layers. The first variable computed was the presence or not of a pollutant release facility. This information is a binary information, 1 for the presence, 0 for the absence. Then, using the matrix with the information of all the drained basins, a variable concerning the number of pollutant releases facilities in all the basins drained in the considered basin was computed. Then the last variable has the information of the number of pollutant releases facilities in the considered basin. These 3 variables were also calculated for the data of the waste and waste water management facilities. Chapter 3 Maxent 3.1 Introduction to Maxent MaxEnt was created by Steven Philips, Miro Dudik and Robert Schapire with the support from AT&T Labs-Research, Princeton University and Center for Biodiversity and Conservation, American Museum of Natural History. MaxEnt is a kind of Species Distribution Model (SDM), these models are numerical tools that combine observations of a species’ presence with environmental variables that could have effects on the suitability of the environment for that species [17]. They are used to predict the distribution of species across a landscape. SDMs are used in many different domains, such as freshwater ecosystems, both terrestrial and marine [18] and they are used to address questions concerning ecology, biogeography and conservation of species [19]. There are different methods to apply these models, but the biggest difference among them is the kind of data species that they use. They can use both presence and absence occurrence data or only presence data. Presence data are the most common and can be found, for example, in natural history museums or in herbaria. Usually absence data are quite rare, and even if they are available they are of uncertain value in some cases [20]. On the other hand, presence data models are considered very valuable and the research based on museum data is widely used [21] even if they can experience some problems as well. For example, it is possible that in an area the species was not detected because some factors determined its local extinction. This fact could create wrong patterns in the presence data, because this missing detection will suggest that that area has unsuitable environmental conditions for that particular species [22]. 3.2 Explanation of Maxent In order to explain how MaxEnt works, using a statistical approach, let us start with Bayes’ rule shown here: Pr(y = 1 | z) = f1 (z) · Pr(y = 1)/ f (z) Here there are a set of locations where the species has been observed in a certain landscape of interest, L. it is possible to say that y=1 corresponds to the presence of the species and y=0 corresponds to the absence of the species. The vector of the environmental covariates is called z , which represent 32 3.2. EXPLANATION OF MAXENT 33 the environmental conditions. The probability density of covariate across L is defined as f (z), f 1 (z) as the probability density of covariates where the species is present across L and f 0 (z) as the probability density of covariates where the species is absent. Now to estimate the probability of presence of the species conditioned to the environment, PR (y = 1 | z). It is possible to model f 1 (z) using the presenceonly data, but it cannot be approximated to the probability of presence. We can also model f (z) using presence/background data. In order to calculate Pr (y = 1 | z) , the probability of presence of the species conditioned to the environment, using the Bayes’ rule, we have to calculate Pr (y = 1), the prevalence of the species in the landscape [22]. But this quantity, formally, cannot be exactly estimated with the presence-only data [23]. This is an important limitation in presence-only data. However, also absence data can have problems in detection [24], so we can say that also presence-absence data could lead to a bad estimation of prevalence. Another important limitation is that in species distribution models based on presence-only data, the sample selection bias has a greater impact than it has on models that use presence-absence data [25]. The data present in herbaria or museums are typically records of species’ occurrence collected by individuals and therefore can be correlated to more accessible locations such as roads or rivers [26]. This data could also be autocorrelated when, for example, they are collected from nearby locations within a small area. Additionally, the intensity to collect samples and the methods to perform the collection could be different in the study area [27]. The bias problem can be reduced using background data with the same bias that we have in the presence data [25]. In SDMs the environmental factors that are relevant for the habitat suitability are the so called “independent variables” or “covariates” in the statistical literature. In the case of this Master’s thesis they are temperature, land cover, geology, river slope, altitude and presence of pollutant releases facilities. Usually a species distribution has a complex response to these factors, so it is recommended to use nonlinear functions for these kinds of problems [28]. The MaxEnt fitted function is defined over 6 feature classes: linear, product, quadratic, hinge, threshold and categorical. The linear feature class corresponds to the variable itself, it means that the mean of this variable under the estimated distribution should be close to its mean in the sample locations. Quadratic class correspond to the square of the variable and impose a constraint on the variance: the variance of the variable for the estimated distribution is close to the variance in the sample. Product classes express the product of all the pair-wise combinations of covariates, giving us the possibility to fit simple interaction between variables. Threshold classes express the possibility to represent a step in the fitted function, in this way we can have different responses below the threshold or above it. Hinge class are similar to the threshold class but this class allows us to have a change in the gradient of the response. Usually it is not used with the linear class because they are very similar; a linear feature can be created from a hinge feature. The category class is a binary indicator that show if a categorical variable belongs to one class or not [29]. In the study case, even if we have variables that represent different classes (geology or land use classes), we are not using these kinds of feature class because we have the information of the percentage of each class in each polygon. So we can say that MaxEnt fits the model on features that are transformations of the environmental variables, in this way we are able to model complex relationships between covariates. By default, MaxEnt gives the possibility to use all the different types of features if conditions to use them are satisfied, in fact all feature types are used if there are at least 80 training samples. If there are between 15 and 79 samples MaxEnt uses linear, quadratic and hinge features, between 10 34 CHAPTER 3. MAXENT and 14 samples it uses linear and quadratic features and below 10 samples linear features are simply used [30]. As mentioned previously, the landscape of interest is called L and L1 corresponds to a subset of L where the species is present. The distribution of the covariate in the landscape is made by a finite number of sample points called the background sample. These can be represented by a grid of pixels (in ESRI ASCII grid format or Diva-GIS format) or they can be given in a SWD (samples-with-data) format, like in the case studied here. By default, MaxEnt uses 10,000 random samples from the background locations, but this number can be also modified. Referring again to Bayes’ rule, to estimate the ratio ff1(z) (z) MaxEnt uses the samples point of presence and the background samples. It makes an estimation of f 1 (z), choosing the one which is closest to f (z), the null model. In fact if we consider a model without the occurrence data, we can expect that the species will have a random distribution because we do not expect it to prefer particular environments over others. We can consider the distance of f 1 (z) to f (z) as the relative entropy of f 1 (z) with respect to f (z). Minimizing the relative entropy can also be seen as maximizing the entropy of the “raw” distribution. The “raw” distribution is π(x) = Pr(x | y = 1), the probability distribution over the locations x. It expresses the probability to find the species in the pixel x, in relation to where the species is present [30]. • Box entropy In order to find the best approximation for an unknown probability distribution, the maximum-entropy principle says that it is the one that respects all the constrains on it and has the maximum entropy among the distributions satisfying them, the most unconstrained one [31]. The unknown probability distribution is π, over a set of X sites in the study area. It is possible to consider each element X as a point. Each of these points has a non-negative value of probability, all these probabilities sum to one. The approximation of the probability distribution is called π̂ [32]. It is possible to say that the entropy of π̂ is: H(π̂) = − ∑ π̂(x)lnπ̂(x) x∈X The entropy is defined as non-negative. Entropy can be expressed as a “measure of how much ‘choice’ is involved in the selection of an event”[33]. A distribution with higher entropy involves more choices and less constrained. So it is possible to say that the maximum entropy principle says that no baseless constraints should be applied on π̂ [32]. In MaxEnt some constraints are imposed to give the solution the presence record information. For example, if one of the analyzed covariates is the temperature in January, the mean temperature in January for the estimate of f 1 (z) will be close to the mean temperature in January for the location where the species has been found [22]. In order to weight the contribution of each feature certain coefficients are defined. The vector for these coefficients is called β while the vector for the features is called h(z)[22]. In the solution, MaxEnt tries to find the coefficients β in order to satisfy the constraints without over fitting the data. This avoids generating a model with limited power of generalization. To avoid 3.2. EXPLANATION OF MAXENT 35 this problem, in MaxEnt we can set an error bound, or a maximum allowed deviation from the sample feature means. The features are first rescaled to a range from 0 to 1 followed by a computation of the error bound for each feature (λ j ). It is possible to estimate these error bounds by simply using the cross-validation data, for example. However, to simplify the model fitting there are some default settings that were tuned and validated for different datasets [30]. The default parameters can be changed by the user if necessary. • Box Cross-validation Cross-validation is a method used to resample data in order to train and test the generated models. It is also called k-fold cross-validation because the data set is divided into k (usually 5 or 10) mutually exclusive subsets, called “folds”. These subsets have about the same size. In order to compute the model performance, each subset is successively removed, so that there is one subset excluded and k-1 retained. The model is fitted on the k-1 retained data and the omitted one is predicted [34]. This process is repeated k times in order to use each subset exactly once as validation data. Each fold on the other hand, is used k-1 times to fit the model, in different combinations with the other folds. The k different results can be averaged in order to have a more accurate estimation. These error bounds, λ j , allow for the regulation of how focused or closely- fitted the output distribution will be. We can modify the closeness of fit by changing the fitting parameter (by default 1.0). If the value is smaller, the output distribution will be more localized and will have a closer fit to the presence records, but may be over fitted. Conversely, a larger regularization will yield a more spread out prediction [29]. • Box Regularization Maxent can be affected by overfitting the training data, to avoid this, there is the regularization parameter. Regularization affect how focused the output distribution is. This parameter allow us to smooth the distribution or to make it closer to the samples data. Regularization is a frequent used approach to model selection. It relax the constraint on the variables, in order to trade off model fit and model complexity. As said before, the regularization parameter is λ j . r s2 [h j ] λj = λ · m Where s2 [h j ] is the feature’s variance over the presence site m, and λ is the tuning parameter for that features class [22]. Regularization obliges the model to consider more the most important features. These models are less affected by overfitting the training data, because they have less parameters. In order to find the Maxent probability distribution, it start with all the λ j at 0 (uniform distribution), then repeatedly it change the value of the λ j [32]. MaxEnt has three output formats: raw, cumulative and logistic. Logistic is the default output and is the easiest to understand. It gives, for each pixel, a probability of presence between 0 and 1. 36 CHAPTER 3. MAXENT The values are rescaled in a nonlinear way in order to give a better interpretation. The probability of presence depends on the details of the sampling design, such as for example the size of the plot. It also depends on the arbitrary value imposed for the probability presence at sites with the “typical” conditions for the species. Usually this value is set to 0.5, though it can be changed using the “default prevalence” parameter. The raw output is the probability of presence (with range 0-1). The raw output values are usually very small because the sum over all the cells used during the training is 1. In the cumulative output format, the value for each grid cell is the sum of the probabilities of all grid cells with lower or equal probability to the current grid cell, times 100. For example, if the value is 45 this means that 45% of presences would be predicted as absences if we are using this value as a threshold to create a presence-absence surface. The range of this output is between 0 and 100 [30]. Chapter 4 Results 4.1 Data used in Maxent As discussed previously, the species data was obtained from the Office Fédérale de l’Environnement (OFEV). It contains information about PKD and names of the water bodies where PKD had been detected in fish. In total there are 504 records, where 236 records described a positive presence of PKD. We can see how with this amount of samples MaxEnt is able to use all the different kinds of feature types because the constraints on the number of the samples are respected. For this study only the records describing a positive presence of PKD were considered. 46 environmental variables were used, the process to obtain these variables was described previously. The format of the environmental layers is a SWD (samples-with-data) format. In Figure number 4.1 it is possible to see the how the file of the environmental variables is made. The first column is ignored, in this study it was set as 1. In the second and in the third columns there are the geographic X and Y coordinates, respectively, of the center of each polygon. After that ,there are all the environmental variables. In Figure 4.1 we can see all the environmental variables. First there is the mean altitude for each polygon, then we have 6 classes of geology (each cell contains the percentage of that specific geology class for the considered polygon). Next, there are the 6 classes of land cover (also here each cell contains the percentage of a specific class of land cover for the considered polygon). Then, in columns 17 to 28 there are the values of the mean monthly temperatures for each polygon (from January to Figure 4.1: Extract of the csv file used for Maxent 37 38 CHAPTER 4. RESULTS December). Column 29 contains the mean river slope for each polygon. From column 30, there is the altitude, different classes of geology, land covers and mean river slope calculated as the mean of all the polygons that drain into the considered polygon. All of these variables were considered as continuous variables. Then we have 6 variables concerning the pollutant release facilities. In the table 4.1 we can see all the environmental variables used in this study. 4.2 Correlation In order to detect the linear correlation among variables the matrix R is calculated with Matlab. R is the matrix of correlation coefficient calculated from an input matrix X, in our case it was the matrix with all the local variables. The range of the coefficient is between -1 and 1, where 1 means perfect positive correlation, 0 is no correlation, and -1 is perfect negative correlation. The matrix R is related to the covariance matrix C by: C(i, j) R(i, j) = p C(i, i) ·C( j, j) The coefficient of this matrix is defined as the covariance of the two variables divided by the product of their standard deviations. The matrix R is symmetric and, as we can see in Figure 4.2, in the diagonal there are all 1. In the matricies the variables are ordered as in the previous table. For this study the correlation matrix was computed between the local variables and between the upstream variables, we can see them respectively in Figure 4.2a and in Figure 4.2b. 4.3 The runs Several runs were performed with MaxEnt, here there will be an explanation of all the different runs and the parameters used. The html output of MaxEnt contains some information about the model performance, the importance of each variable and its influence. Some information about the chosen output and links to where the data files used can also be found. 4.3.1 Run 1 For the first run, the first 39 variables were used and all the default parameters were kept. 4.3.1.1 Analysis of omission/commission In the html output of MaxEnt there are at first 2 graphs and a table that evaluates model performance/bias. The first output that we have is the omission and prediction area for the species which in this study is called 1 for simplicity. The omission rate, in Figure 4.3, represents the fraction of the test localities that fall into pixels not predicted as suitable for the species, and the predicted area is the fraction of all the pixels that are predicted as suitable for the species Phillips et al. [32]. This plot 4.3. THE RUNS Table 4.1: Environmental variables for the distribution of Fredericella sultana 39 40 CHAPTER 4. RESULTS Matrix R for local variables altitu geo_01 geo_02 geo_03 geo_04 geo_05 geo_06 land01 land02 land03 land04 land05 land06 temp01 temp02 temp03 temp04 temp05 temp06 temp07 temp08 temp09 temp10 temp11 temp12 Rslope 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 (a) Correlation matrix for the local variables Matrix R for the upstream variables 1 D_alti D_geo1 0.8 D_geo2 D_geo3 0.6 D_geo4 D_geo5 0.4 D_geo6 Dland1 0.2 Dland2 Dland3 0 Dland4 Dland5 −0.2 Dland6 Dslope −0.4 P_poll ND_pol −0.6 N_poll Pwaste −0.8 NDwast Nwaste −1 (b) Correlation matrix for the upstream variables Figure 4.2: Correlation matrices for the variables 41 4.3. THE RUNS shows the relationship between predicted values of occurrence probability (in this case from the training samples) and the proportion of occurrences selected. In other words, this plot shows how training omission and predicted area vary when changing the cumulative threshold. The predicted omission rate is a straight line, by definition of the cumulative output format. Figure 4.3: Omission and predicted area for F. sultana- run 1 The second plot is a receiver operating characteristic (ROC), in Figure 4.4 It illustrates how well the model performs in predicting occurrences compared to a random selection of points. An important advantage of the ROC analysis is that the area under the ROC provides a single measure of model performance independent of the thresholds. The random model is represented by a straight line, the bisector. On the other hand, the perfect model would appear as a right angle with the corner on the top left of the graphic. A good curve maximizes sensitivity for low values of the false-positive fraction. The higher the area under the curve (AUC) the bettter the model is performing. The range of AUC can be between 0 and 1. A value of AUC close to 0.5 indicates that the model is not so much better than a random model (AUC of a random model is 0.5). While 1 is the AUC of the perfect model. It is possible to have also value below 0.5, in that case the model is performing worse than the random model. 42 CHAPTER 4. RESULTS Figure 4.4: ROC for Fredericella sultana-run 1 After that in the html file there is the thresholds table.This table provides some common thresholds and corresponding omission rates to represent “suitability” vs. “non-suitability”. Where the sensitivity, also called true positive rate, is a measure of the portion of positives that are correctly identified as positive and the specificity, also called true negative rate, is the measure the portion of negatives that are correctly identified as negative. In Maxent the thresholds are used if we want to display the output in a more discrete way, suitable habitat and unsuitable habitat. To do that it is necessary to choose witch threshold values is the best one, the threshold represents the value of the minimum probability for a suitable habitat. Choosing the best threshold value is not a set rule, and to decide the threshold value it is necessary to understand the species considered and the objective of the map (Young-2011). In the case study no threshold was chosen because we are interested to create a probability map of the presence of Fredericella sultana and not just a suitable-unsuitable habitat map. 4.3.1.2 Response curves Usually, it is interesting to understand how each variable can influence the prediction of MaxEnt, which variables have the greatest influence on the model and how these variables can influence the presence of the species. In order to understand how the prediction depends on the variables, the output of MaxEnt provides some graphics, Figure4.5. The graphics show how each environmental variable affects the prediction. We can see how the prediction changes when each environmental variable changes, if all other variables are kept at their average sample value, in Figure 4.5a. On the x-axis, these graphics have the 4.3. THE RUNS 43 value of the analyzed variable and on the y axes the predicted probability of suitable conditions when all other variables are set to their average values. The second set of responding curves, in Figure 4.5b, represents all the different models created using only one variable. There are as many models as the number of variables that we are considering. 4.3.1.3 Analysis of variable contributions There are two methods which can be used to understand the importance of each variable. First, MaxEnt give the percent contribution of each variable to the final model. To do this, it records how much the overall model gain is improved if small changes are made to each coefficient of those particular features. At the end of the run, all of these small changes are taken into account to compute the proportion of all the contributions. However, if the variables are strongly correlated the result can be difficult to interpret also here. In Figure 4.6 we can see the output of this method used to determine the importance of each variable. As we can see, in Figure 4.6, the variable with the biggest contribution (25.1%) comes from the variable number 39 which represents the percentage of class 3 land cover (built up)of polygons that drain into the considered polygon. Next, variable 29, the mean river slope, shows the next largest contribution with 12.8%. After that we have variable number 4, which represents the mean altitude, with a percent contribution of 8.7%. Another percent contribution with a similar value (8.5%) is the one of the variable 26, mean temperature in October. Another method to compute the importance of each variable is the jackknife approach. In this method MaxEnt excludes one variable at time in order to collect information about the importance of each variable to explain the species distribution and the uniqueness of the information that the variable provides. To do that MaxEnt first creates a model excluding each variable in turn, and then it creates a model using each variable in isolation. In Figure 4.7 there are the results of the jackknife analysis. Dark blue bars represent how well the model performed using only that feature is. The red bar represents the model performed with all the variables. And the light blue bars represent the model performed without the considered variables. 4.3.2 Run 2 For this run the inputs were the same as for Run 1 but the default settings number for the “random test percentage” was changed. This command tells the program to randomly set aside 25% of the sample records for testing. Doing so MaxEnd is able to perform some basic statistical analysis. Here there are the main differences between this Run and the previous one. The main results are the same but there are more statistical analysis. 4.3.2.1 Analysis of omission/commission The first difference is in the first plot, Figure 4.8.As it is possible to see, in Figure 4.8, there is one more line: the omission test samples. It could happen some times that the test omission line is below the predicted omission line, not in the study case. This phenomenon could be due to the dependence between the test and the training data. 44 CHAPTER 4. RESULTS (a) Marginal responding curves- run1 (b) Single variable responding curves-run 1 Figure 4.5: Responding curves- run1 45 4.3. THE RUNS Figure 4.6: Importance of different variables-run 1 46 CHAPTER 4. RESULTS Figure 4.7: Jackknife-run 1 4.3. THE RUNS 47 Figure 4.8: Omission and predicted area for Fredericella sultana- run 2 The second difference is in the ROC plot. Also here, in Figure 4.9 we can see that there is one line more, the one of test data. Before the line of training and test data were overlapping. Here we have divided the data in 2 parts, one for training and another one for testing. Usually the training line has a higher value of AUC than the one of the test data. The training line (the red one) represents the “fit” of the model to the training data. On the other hands, the test line (the blue one) shows the fit of the model to the testing data, and it used to test the real model predictive power. The next difference that we can see is in the table of the thresholds and omission rates, Figure 4.10. Now in the table we have also the test omission rate and the P-value. This is because we used some of the sample records for the training and some others for the testing. The P-value in fact is the value of a hypothesis test and it represents the probability to obtain a result equal or “more extreme” of the observed one, assuming true the null hypothesis. The P-value is compared with a threshold, usually 5% or 1%. If the P-value is equal or smaller than the significance level, it suggests that the null hypothesis has to be rejected and the alternative hypothesis is accepted as true. 4.3.2.2 Analysis of variable contributions Another difference is in the analysis of the variable contributions, now the most important variables are a bit different from before, we can see them in Figure 4.11 The most important one is still the percentage of Land cover of class 3 of the polygons that drain in the considered polygon, but with a contribution of 28%. In this case the second variables of importance and the third one are switched, in fact the mean altitude has a contribution of 15.5% and the mean river slope has a contribution of 10.2% (in Run 1 they were 8.7% and 12.8% respectively). Then we have the variable 30, the mean altitude of the polygons that drain in the considered polygon, with a percentage contribution of 9.7% (before 48 CHAPTER 4. RESULTS Figure 4.9: ROC- run 2 Figure 4.10: Thresholds table-run 2 4.3. THE RUNS 49 it was 6.5%). The value 26, the mean temperature in October, decreases its importance changing from 8.5% to 1.5%. Also the Jackknife of regularized training gain for the species is a bit different but the environmental variable with the highest gain used in isolation and the one that decreases the gain the most when it is omitted is still the number 39, we can see it in Figure 4.12. The biggest difference in the Jackknife analysis is the presence of other 2 plots: one uses test gain, Figure 4.13b and the other use AUC in place of training gain, Figure 4.13a. Comparing these 3 plots (Figure 4.13 and Figure 4.12) can be very useful and can provide additional information. In this case all the 3 plots show that the most effective single variable for predicting the distribution of the occurrence data is the variable 39. We can see as the variable 19, the mean temperature in March, increase the test gain and the value of AUC even if in the model was not very used in the model built using all variables (0.2%). We can also see as in the gain test and AUC plots, some of the light blue bars are longer than the red one, especially the variable number 30 (the mean altitude of the polygons that drain in the considered polygon). 4.3.3 Run 3 and run 4 In these runs all the parameters were kept with the same value as in the previous run with the only exception of the feature class. Instead of using the “Auto features” as in the previous 2 runs, in the run 3 the “Threshold features” was selected and in the run 4 the “Hinge features” was selected. The relevant difference is in the plots of the response curves. The image 4.14a shows the response curves of the run with the threshold features, and the image 4.14b the one with the hinge features. 4.3.4 Run 5 In this run all the parameters were selected as in run 1 but in addition the parameter of replication was modified with the value of 10. The “replicates” parameter is used to do multiple runs for the same species. In this analysis the form of replicate used was the cross-validation (by default). In this case the occurrence data is randomly divided into a number of exclusive subsets called “folds” that have the same size. The model performance is evaluated removing each subset in turn. The omitted subset is used for the evaluation and all the others are used to fit the model. Doing this run 11 html pages are obtained, one for each subset and one to summarize all the statistical information for the cross-validation. 4.3.4.1 Analysis of omission/commission In the html file that correspond to the summary of all the statistical information we can see in the first plot, Figure 4.15, the test omission rate and predicted area as a function of the cumulative threshold, averaged over the replicate runs. The mean omission on test data plus or minus the standard deviation is represented in yellow. While the mean omission on test data is represented in light blue. The mean area predicted is represented in red and its standard deviation is represented in blue. Then there is the plot of the receiver operating characteristic (ROC) curve, Figure 4.16. It is averaged over the replicate runs. The average test AUC is 0.895 and the standard deviation is 0.028. 50 CHAPTER 4. RESULTS Figure 4.11: Importance of different variables-run 2 51 4.3. THE RUNS Figure 4.12: Jackknife-regularized training gain-run 2 52 CHAPTER 4. RESULTS (a) Jackknife-AUC- run2 (b) Jackknife-test gain-run 2 Figure 4.13: Jackknife- run2 53 4.3. THE RUNS (a) Responding curves- run 3 (b) Responding curves- run 4 Figure 4.14: Responding curves run 3 and 4 54 CHAPTER 4. RESULTS Figure 4.15: Average omission and predicted area for F. sultana-run 5 Figure 4.16: ROC- run 5 4.3. THE RUNS 4.3.4.2 55 Response curves Then we have the response curves, the single-variable response and the marginal response curves. We can notice that the single-variable response is, in general, less variable then the marginal one. The image 4.17a shows the single-variable response and the second one shows the marginal one for the variable 39, image 4.17b. 4.3.4.3 Analysis of variable contributions An additional difference from the other run is in the jackknife test, image 4.18. Here the environmental variable that decreases the gain the most when is omitted is the mean river slope (the number 29), that appears to have the most information that is not present in the other variables. In all the previous runs was always the variable 39. 4.3.5 Run 6 In this run the mean river slope for all the polygons that drain in the considered polygon was added (with the number of 43) and the variables concerning the mean monthly temperature was removed with the only exclusion of the mean monthly temperature of May. All the default settings were maintained but the random test percentage that was set as 25. There are no important difference, but we can notice that the new variable has contribution of 5.6%. In the image 4.19 we can see an extract of the 2 tables of the variables contributions for the run 2 and the run 6. The map in Figure 4.20 is obtained using the output of MaxEnt for the run 6. The data in a csv format was imported in ArcMap 10.2. This data was converted in a layer using the Data Management Tools (Layers and Table Views- Make XY Event Layer) in ArcMap. Doing so we obtain a points layer, each point has the value of the predicted probability of presence computed by MaxEnt. Using the bassisgeometrie layer, the value of each point was added to the corresponding polygon in the layer bassisgeometrie. In order to perform this operation the Analysis Tools ArcMap 10.2 was used. More specifically a Spatial Join was used. 4.3.6 Run 7 This run was performed with the same input as run 6 but the setting of replication at 10. Doing so the cross-validation was performed. The results are very similar with the cross validation obtained with the run 5. 4.3.7 Run 8 In this run the “regularization multiplier” parameter was changed. This parameter allow us to change the level of focus of the output distribution. The default value is 1.0, the smaller the parameter,the more localized the output distribution, the closer the fit to the given presence records. This could bring an over fitting of the data and a loose of the generalize model power for other data. The result of a bigger value of this parameter is a more spread out distribution. For this run the parameter was set at 3, and the other inputs were the same as in run 6. 56 CHAPTER 4. RESULTS (a) Single-variable response curve- run 5 (b) Marginal response curve- run 5 Figure 4.17: Responding curves for variable 39- run 5 57 4.3. THE RUNS Figure 4.18: Jackknife-run 5 (a) Importance of the most important variables- run 2 (b) Importance of the most important variables- run 6 Figure 4.19: Importance of the most important variables- run 2 and 6 58 CHAPTER 4. RESULTS Figure 4.20: Probability of presence F. sultana- run 6 59 4.3. THE RUNS The map in Figure 4.21 is obtained using the output of MaxEnt for the run 8 as the one obtained in run 6. Figure 4.21: Probability of presence Fredericella sultana- run 8 4.3.8 Run 9 For this run a new variable (number 44) was added, it concerns the pollutant releases facilities in Switzerland and the variables concerning the mean monthly temperature was removed with the only exclusion of the mean monthly temperature of July (number 23), because the different mean monthly temperatures were very correlated with each others and the they were not so relevant for the previous runs. The new variable (number 44) was considered as Category because it can have 1 or 0 as values, presence or absence respectively. All the default parameters were kept but the “random test percentage” one, which was set as 25, in order to use the 25% of the sample records for testing. There are not important differences comparing with the other runs, as always the variable with the biggest percentage of contribution is the number 39, followed by the variable 4, 29, 31 and 30. The new variable has a contribution of 0.8% and a permutation importance of 0. 4.3.9 Run 10 For this run all the previous settings were kept the same but 2 new variables were introduced. The first variable was the number of pollutant releases facilities in all the basins that drains in the considered basin (number 45). The second variable the number of pollutant releases facilities in the considered 60 CHAPTER 4. RESULTS basin (number 46). There 2 variables were added because maybe the presence of F. sultana could depend also to the number of pollutant releases facilities and not just on their presence. Also here there are not big differences between this run and the previous one but it is possible to notice that the importance of the variable is a bit changed. In fact after the usual important variables there is also the variable 45, the numbers of pollutant releases facilities in the drained basins. On the other hand the variable 44 has lost all the percent contribution, we can see that in Figure 4.22. The map in Figure 4.23 is obtained using the output of MaxEnt for the run 10 as the one obtained in run 6 and 8. Figure 4.23: Probability of presence F. sultana- run 10 4.3.10 Run 11 For this run all the parameters were the same as in run 15 but 3 new variables were introduced. These 3 variables are the presence or absence of waste and waste water management facilities, the number of these facilities in the basins that drain in the considered basin and the number of these facilities in the considered basin, respectively variable 47, 48 and 49. As in run 10 the variable concerning the presence or absence of waste and waste water management facilities is of the feature class category. Also here there are not very important differences, in fact the variables with the biggest percent contribution are always the variable 39, 4, 29, 30, 43 and 31. In this run the variable 45 (number of pollutant releases facilities in all the basins that drains in the considered basin) loses a bit the percent contribution, changing from 5.8% to 4.9%. We can see how the percent contribution of the new variables is very low, and the variable 47 is not even used, we can see it in image 4.24. 61 4.3. THE RUNS Figure 4.22: Importance of different variables-run 10 62 CHAPTER 4. RESULTS Figure 4.24: Importance of different variables-run 11 63 4.3. THE RUNS The map in Figure 4.25 is obtained using the output of MaxEnt for the run 10 as the one obtained in run 6 and 8 and 10. Figure 4.25: Probability of presence F. sultana- run 11 Chapter 5 Discussion and conclusions 5.1 5.1.0.1 Discussion Analysis of the data It is possible to notice in all the maps that there are some large basins, for example, the one under the Lac Léman or the one near the Canton of Ticino, with a unique value for all the different data. This is because they are polygons that are at the border of Switzerland and in the basissisgeometrie data they are considered as unique basins even if they are composed of different basins. For this reason, the values of these basins do not have to be considered as real values. On the other hand, in the geology and land cover maps there is one part missing (on the top right), it is because that part is not in Switzerland and the data used was only of Switzerland. In any case, this is not a significant limitation because the part which is missing is outside our area of study and is small compared with the area analyzed. 5.1.1 Correlation 5.1.1.1 Local variables As we can see in the Figure 4.2a, that represents the correlation matrix for all the local variables, there is a strong positive correlation among all the variables of the mean monthly temperature, as would be expected for a given location. There is another strong, but this time negative correlation, between the mean altitude and the variables of the mean monthly temperature. We have negative correlation when one variable increases while the other decreases, therefore, as we would expect, we observe that when the altitude increases temperature decreases. This is why in the last runs most of the variables of mean monthly temperature were excluded. We can notice also that there is a strong negative correlation between the variable that represents the class of land cover 1 (rocks) and the mean monthly temperatures. This means that when we have low temperatures, the percentage of rocks increases. This fact is in agreement with what we would expect since places where the temperatures are low, possibly at high altitude (negative correlation between temperature and altitude), in large part include mountainous regions. According to this negative correlation, we can also see that there is a positive correlation between mean altitude and land cover 64 5.1. DISCUSSION 65 class 1. As we said before, the percentage of rocks increases with the altitude due to the mountainous regions of Switzerland. Furthermore, altitude and river slope are positively correlated, therefore as the altitude increases, river slopes increase. According to this we can see how the river slope is negatively correlated with the mean monthly temperatures and positively correlated with the class of land cover 1 that represents rocks. 5.1.1.2 Up-stream variables In Figure 4.2b we can see the correlation matrix of the up-stream variables. A strong positive correlation between all the variables concerning the geology and the land cover for the up-stream basins can be observed. it is possible to see that there are also positive correlations between the variables concerning the pollutant releases facilities. For example, there is a strong correlation between the number of waste water management facilities in the considered polygon and the number of pollutant release facilities in the polygons that drain into the considered polygon. This is because all these variables come from the same data, by construction these variables are not independent, so positive correlation is expected. 5.1.2 The runs 5.1.2.1 Run 1 As we can see in Figure 4.4, the value of AUC for run 1 is relatively high (0.957) which suggests that the model is performing well. For the response curves, in all the runs, they can be hard to interpret if the variables are strongly correlated. These curves show the marginal effect of changing exactly that variable, while the model can consider that the variables change together. On the other hand, the single response curves can be easier to interpret if the variables are strongly correlated. If the curves have an upward trend there is a positive association, whereas downward trends represent a negative relationship. The strength of these relationships is represented by the magnitude of these movements. In Figure 4.7 we can see the jackknife test of the variable importance. The longest blue bars, the ones with the most important variables, are 39, 29, 4 and 26. We can see, analyzing the light blue bars, that there are no variables that contains a substantial amount of useful information that is unique comparing with all the other variables. We can understand this because all the light blue bars have more or less the same length, so excluding one of these variables do not reduce the training gain by much. If one of the light blue lines was longer than the red, it would mean that the model without that variable was performing even better than the model created with all the variables. This could be caused by a variable that was causing over fitting or lack of generality in the test runs. 5.1.2.2 Run 2 In run 2 the AUC value for the training data is even higher than run 1 (0.973). As in the previous run, this large AUC value is an indication that the model is performing well. On the other hand, the test line shows the fit of the model to the testing data, it tests the real model predictive power. In this case the value of the AUC of the test data is 0.875, this value is lower than the AUC value found for the training data but in any case is much higher than the AUC for the random model (0.5). This fact 66 CHAPTER 5. DISCUSSION AND CONCLUSIONS is quite expected because the model was created staring from the training data that are therefore well represented. Analyzing the table of thresholds, Figure 4.10, we can see that the p-values are very small, this fact gives us another confirmation that the model predicts the test points better than a random prediction. In the Jackknife test (gain test and AUC plots), shown in Figures 4.13b and 4.13a, some of the light blue bars are longer than the red ones, especially for variable 30 (the mean altitude of the polygons that drain into the considered polygon). It means that the predictive performance improves when the variable of the mean altitude of the polygons that drain into the considered polygon is not used. On the other hand, it is possible to notice that the model made with only variable 8, the percentage of geology class sand and gravel, has a negative test gain. It means that the model is worse than a null model to predict the distribution of occurrences. 5.1.2.3 Run 3 and 4 Analyzing these response curves, we can see that the aspects of these profiles are similar, but there are some differences due to the different feature types. The threshold feature give us a function characterized with steps, and the hinge feature is similar to the threshold feature but it allows changes in the gradient of the responses. The result is a step function for the threshold feature and a piece-wise linear function for the hinge feature. We can see that the response curve of the variable 39 using only hinge features is a sequence of connected line segments. In the Figure 5.1c there is a zoom of the profiles of the most used variable, variable 39, using at first the threshold features, the hinge features and then all features, shown in Figures 5.1a, Figure 5.1b and 5.1c, respectively. it is possible to notice how using all classes together makes it therefore possible to model complex responses. These Figures show how the probability to have the presence of F. sultana increases when the percentage of built up areas in all the basins that drain into the considered basin is around 20%. However, this fact has some issues of interpretation for different reasons. The first reason is that this variable is strongly correlated with other up-stream variables of geology and land cover. Second, it is necessary to consider the possible range that this variable can assume. In fact is almost impossible that the percentage of build up area of the up-stream basins is 100%. 5.1.2.4 Run 5 The cross-validation was performed during this run, for this reason in all the graphics we can see some more statistics. For example in the graphic of the average omission and predicted area and in the ROC curve, we have a mean value and the standard deviation. For this run the mean value of AUC is a bit lower that the ones in the previous runs, now it is 0.895 (in any case still higher that the random model). This value is similar with the value obtained in run 2 for the AUC for the test data (0.875), this is because also here there is the presence of some data that are used as test for the model. 5.1.2.5 Run 6 In this run a new variable was introduced, the mean river slope for the up-stream polygons. It is possible to notice that this variable has the same magnitude of importance as the first 5 most important variables. Figure4.20 represents the probability map obtained with this run. We can see that the probability to find F. sultana is higher in the area that goes from Genève, Lausanne, Bern, Zurich and 67 5.1. DISCUSSION (a) Response curve of variable 39 using threshold feature - run 3 (b) Response curve of variable 39 using hinge feature run 3 (c) Response curve of variable 39 using all feature - run 2 Figure 5.1: Response curve of variable 39 68 CHAPTER 5. DISCUSSION AND CONCLUSIONS Figure 5.2: ROC for Fredericella sultana-run 7 St Gallen. There are as well a high probability to find F. sultana in the Canton of Ticino, especially near Lugano. Also in the area of Basel and the Jura the probability of presence of F. sultana is high. Besides, in the Alps the probability to find F. sultana is low, with the only exclusion of the areas of Sion and Monthey. It seems that the probability to find F. sultana is high near the large cities, this fact is congruent with the results obtained in the test of the variable importance, where the most important variable is the percentage of built up areas in the upstream basins. 5.1.2.6 Run 7 It is possible to see see how in this run, the mean AUC value is higher than the one obtained in run 5 with an increase from 0.895 in run 5 to 0.905 in the current run. This increase could be attributed to the addition of variable 43 in the model. The graphic from this run is shown in Figure5.2 while the cross-validation was also computed. Additionally, here it is possible to see that the single-variable response curves are less variable than the marginal ones. As an example the single-variables and marginal response curves of variable 39 are shown in Figures 5.3b and 5.3a respectively. Figure 5.3 shows how the probability to have the presence of F. sultana increases when the percentage of built up areas in all the basins that drain into the considered basin of around 20%. However, as said before, this result is difficult to interpret for different reasons. The first reason is that this variable is strongly correlated with other up-stream variables of geology and land cover. Second, it is necessary to consider the possible range that this variable can assume. It is unlikely have 100% of an area in the up-stream basins which are built up. 69 5.1. DISCUSSION (a) Marginal responding curve variable 39 - run 7 (b) Single-variable responding curve variable 39 - run 7 Figure 5.3: Responding curves variable 39- run7 70 5.1.2.7 CHAPTER 5. DISCUSSION AND CONCLUSIONS Run 8 In this run the “regularization multiplier” parameter was changed, we can see the result Figure4.21. Comparing this map with the map obtained in run 6, we can see that the distribution obtained in the current run is more spread out than the one in run 6, as seen in Figure 5.4. In this run, there are more areas with a higher probability of presence of F. sultana and we can notice how the areas with the highest probability of presence are the same of run 6. In run 8 the areas with a probability of presence of around 0.5/0.6, are the ones that are increase the most. 5.1.2.8 Run 9,10,11 For these runs, what is interesting to notice are the response curves of the new variables introduced. They are different from the others because the variable was set as category. We can see in Figures 5.5a and 5.5b the two response curves, respectively the marginal one and then the single-variable one of the variable 44 for the run 9. As it is possible to notice in Figure 5.5 the probability to find the species is higher if there are pollutant releases facilities in the considered polygon, but we have to consider the limits mentioned in the discussion of run 7. It is possible to notice how the variables concerning the pollutant releases do not influence the model, this is because they are too general. For example, the presence of pollutant releases facilities is not influent because in this data the size of the facilities, the amount and the types of pollutants released are not considered. In order to improve this data, the density of the population should also be included. Also, the number of pollutant releases facilities and the number of waste or waste water management facilities in the up-stream basins are not influent because they are too general and because the dispersion of the pollutants is not considered. The fact that the model does not use these overly general data sets is a proof of the power of the model: it does not use data that is not useful for the prediction. 5.1.3 All the runs It is possible to see how in all the runs the value of AUC is above 0.5 (value of the random model), precisely all the values are above 0.872. This suggests that the models are performing well. According to all the different runs, the most important variable is variable 39, the percentage of land cover class 3 (built up) of all the polygons that drain in the considered polygon. The others important variables are the variables 29, 4, 30, 32, 31 and 43 (respectively mean river slope, mean altitude, mean altitude up-stream, percentage of geology class 1, alluvial rocks of all the polygons that drain in the considered polygon, and mean river slope up-stream). In Table5.1 it is possible to see the most important variables for all the runs. The river slope has a range of percentage contribution between 5.6 and 13.8, the mean altitude has value of percent contribution between 0.6 and 17.5 and the mean altitude of the up stream basins has it between 0.8 and 10.1. As said before among the most important variables there is the mean altitude. It is possible to analyze the responding curves of this variable in Figure 5.6. With all the limits and the difficulties of interpretation of these curves, it can be noticed that there is a higher probability of presence of F. sultana in rather low altitudes. In both the curves, Figure 5.6a and Figure 5.6b, it is evident that there is a drastic change of probability around the altitude of 800 m above the see level. This fact corresponds 71 5.1. DISCUSSION (a) Probability of presence F. sultana - run 6 (b) Probability of presence F. sultana - run 8 Figure 5.4: Probability of presence F. sultana - run 6 and 8 72 CHAPTER 5. DISCUSSION AND CONCLUSIONS (a) Marginal response curve of variable 39 - run 9 (b) Single-variable response curve of variable 39 - run 9 Figure 5.5: Response curve of variable 39 - run 9 73 5.2. CONCLUSION AND FUTURE WORK perc. cont run 1 39-Dland3 25.1 run2 39-Dland3 28 run 3 39-Dland3 31.7 run 4 39-Dland3 45.9 run 5 39-Dland3 run 6 39-Dland3 27.9 28.6 run 7 39-Dland3 run 8 39-Dland3 45.9 40.7 run 9 39-Dland3 28.7 run 10 39-Dland3 27.6 run 11 39-Dland3 26.7 29-Rslope 12.8 4-altitu 15.5 4-altitu 17.5 31-D_geo1 12.4 29-Rslope 13.8 4-altitu 15.9 31-D_geo1 12.4 4-altitu 14.1 4-altitu 15.7 4-altitu 15.6 4-altitu 14.9 4-altitu 8.7 29-Rslope 10.2 29-Rslope 11.5 37-Dland1 6.5 4-altitu 10.2 29-Rslope 9.8 37-Dland1 6.5 29-Rslope 10.3 29-Rslope 9.6 29-Rslope 9.4 29-Rslope 8.5 26-temp10 8.5 30-D_alti 9.7 30-D_alti 10.1 29-Rslope 5.6 26-temp10 9.8 30-D_alti 7.4 29-Rslope 5.6 31-D_geo1 4.9 31-D_geo1 7.8 31-D_geo1 6.1 30-D_alti 6.4 30-D_alti 32-D_geo2 6.5 5.3 32-D_geo2 31-D_geo1 5.7 5.4 32-D_geo2 31-D_geo1 4.7 3.8 32-D_geo2 36-D_geo6 5.5 4.1 30-D_alti 32-D_geo2 7.6 3.8 31-D_geo1 43-Dslope 7 5.6 32-D_geo2 36-D_geo6 5.5 4.1 32-D_geo2 30-D_alti 3.6 3.2 30-D_alti 32-D_geo2 6 5.6 30-D_alti 45-ND_pol 5.8 5.8 43-Dslope 31-D_geo1 6 5.3 Table 5.1: Percent contribution of the most inportant variables for all the differents runs with the result obtained in the probability maps, where the probability of presence of F. sultana were higher near the big cities (Zurich is an altitude of 408 m, Lausanne is at 495 m and Berne is at 542 m). 5.2 Conclusion and future work All the runs were performing significantly better than the random model. The threshold-independent ROC analysis also showed a considerably better performance than the random model and the area under the ROC curve (AUC) was high in all the different runs. All the runs produced reasonable predictions of the potential distribution of F. sultana. It is possible to see how the predicted probability of presence of F. sultana higher near the big cities of Switzerland, this fact is consistent with the results obtained in the test of the variable importance, where the most important variable is the percentage of built up areas in the upstream basins. We can see in all the probability maps that the diagonal area that goes from Genève to St. Gallen has the highest probability to find F. sultana. In order to evaluate the prediction of the potential distribution of F. sultana, a survey on the true presence of F. sultana is needed. The study could have the objective to find out if, in the places where the infected trout were found, there is also the presence of F. sultana. 74 CHAPTER 5. DISCUSSION AND CONCLUSIONS (a) Marginal response curve of variable 4 - run 2 (b) Single-variable response curve of variable 4 - run 3 Figure 5.6: Response curve of variable 4 - run 2 5.2. CONCLUSION AND FUTURE WORK 75 The species data analyzed in the current study could be affected by bias because we have considered that F. sultana is present just in the places where the trout were found, but it is possible that the F. sultana has a suitable habitat also where the trout are not present. For this reason it will be interesting do a survey to detect the real presence of F. sultana. Bibliography [1] Beth Okamura, Hanna Hartikainen, Heike Schmidt-Posthaus, and Thomas Wahli. Life cycle complexity, environmental change and the emerging status of salmonid proliferative kidney disease. Freshwater Biology, 56:735–753, 2011. ISSN 00465070. [2] Esri. ArcGIS Resources. [3] Shahid Naeem, F S Chapin Iii, Robert Costanza, Paul R Ehrlich, Frank B Golley, David U Hooper, J H Lawton, Robert V O Neill, Harold a Mooney, Osvaldo E Sala, Amy J Symstad, and David Tilman. I ssues in Ecology. Issues in Ecology, 4:1–12, 1999. ISSN 1092-8987. [4] R Hoffmann, S van de Graaff, F Braun, W Körting, H Dangschat, and D Manz. Proliferative kidney disease in salmonid fish. Berliner und Munchener tierarztliche Wochenschrift, 97:288– 291, 1984. ISSN 09598030. [5] C L Anderson, E U Canning, and B Okamura. Molecular data implicate bryozoans as hosts for PKX (phylum Myxozoa) and identify a clade of bryozoan parasites within the Myxozoa. Parasitology, 119 ( Pt 6:555–561, 1999. ISSN 00311820. [6] B. Okamura and T. S. Wood. Bryozoans as hosts for Tetracapsula bryosalmonae, the PKX organism. Journal of Fish Diseases, 25:469–475, 2002. ISSN 01407775. [7] Karen Anna Okland and Jan Okland. Freshwater bryozoans (Bryozoa) of Norway II: Distribution and ecology of two species of Fredericella. Hydrobiologia, 459:103–123, 2001. ISSN 0018-8158. [8] M L Kent and R P Hedrick. Development of the PKX myxosporean\nin rainbow trout Salmo gairdneri. Diseases of Aquatic Organisms, 1(1924):169–182, 1986. [9] Eva Jiménez-Guri, Hervé Philippe, Beth Okamura, and Peter W H Holland. Buddenbrockia is a cnidarian worm. Science (New York, N.Y.), 317(5834):116–118, 2007. [10] E U Canning, a Curry, S W Feist, M Longshaw, and B Okamura. A new class and order of myxozoans to accommodate parasites of bryozoans with ultrastructural observations on Tetracapsula bryosalmonae (PKX organism). The Journal of eukaryotic microbiology, 47(5):456–468, 1999. ISSN 1066-5234. [11] Kathrin Bettge, Thomas Wahli, Helmut Segner, and Heike Schmidt-Posthaus. Proliferative kidney disease in rainbow trout: Time- and temperature-related renal pathology and parasite distribution. Diseases of Aquatic Organisms, 83:67–76, 2009. ISSN 01775103. 76 BIBLIOGRAPHY 77 [12] Timothy S. Wood, Lisa J. Wood, Gaby Geimer, and Jos Massard. Freshwater bryozoans of New Zealand: A preliminary survey. New Zealand Journal of Marine and Freshwater Research, 32 (March 2014):639–648, 1998. ISSN 0028-8330. [13] Timothy S. Wood. Reappraisal of Australian freshwater bryozoans with two new species of Plumatella (Ectoprocta : Phylactolaemata). Invertebrate Systematics, 12(2):257, 1998. ISSN 1445-5226. [14] Dean Jacobsen, Rikke Schultz, and Andrea Encalada. Structure and diversity of stream invertebrate assemblages : the influence of temperature with. Freshwater Biology, 38:247–261, 1997. ISSN 0046-5070. [15] R. P. Smart, C. Soulsby, M. S. Cresser, A. J. Wade, J. Townend, M. F. Billett, and S. Langan. Riparian zone influence on stream water chemistry at different spatial scales: A GIS-based modelling approach, an example for the Dee, NE Scotland. Science of the Total Environment, 280 (1-3):173–193, 2001. [16] John R. Olson and Charles P. Hawkins. Predicting natural base-flow stream water chemistry in the western United States. Water Resources Research, 48(2):1, 2012. ISSN 00431397. [17] T L Root. Environmental factors associated with avian distributional limits. J. Biogeogr, 15(3): 489–505, 1988. [18] Jane Elith and John R. Leathwick. Species Distribution Models: Ecological Explanation and Prediction Across Space and Time. Annual Review of Ecology, Evolution, and Systematics, 40: 677–697, 2009. ISSN 1543-592X. [19] a Townsend Peterson. Uses and requirements of ecological niche models and related distributional models. Biodiversity Informatics, 3:59–72, 2006. ISSN 15469735. [20] R P Anderson. Real vs. artefactual absences in species distributions: Tests for Oryzomys albigularis (Rodentia: Muridae) in Venezuela. Journal of Biogeography, 30:591–605, 2003. ISSN 1365-2699. [21] Catherine H. Graham, Simon Ferrier, Falk Huettman, Craig Moritz, and a. Townsend Peterson. New developments in museum-based informatics and applications in biodiversity analysis. Trends in Ecology and Evolution, 19(9):497–503, 2004. ISSN 01695347. [22] Jane Elith, Steven J. Phillips, Trevor Hastie, Miroslav Dudík, Yung En Chee, and Colin J. Yates. A statistical explanation of MaxEnt for ecologists. Diversity and Distributions, 17:43–57, 2011. ISSN 13669516. [23] Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R. Leathwick. Presence-only data and the em algorithm. Biometrics, 65:554–563, 2009. ISSN 0006341X. [24] Weidong Gu and Robert K. Swihart. Absent or undetected? Effects of non-detection of species occurrence on wildlife-habitat models. Biological Conservation, 116:195–203, 2004. ISSN 00063207. 78 BIBLIOGRAPHY [25] Steven J. Phillips, Miroslav Dudík, Jane Elith, Catherine H. Graham, Anthony Lehmann, John Leathwick, and Simon Ferrier. Sample selection bias and presence-only distribution models: Implications for background and pseudo-absence data. Ecological Applications, 19(1):181–197, 2009. ISSN 10510761. [26] Sushma Reddy and Liliana M. Dávalos. Geographic sampling bias and its implications for conservation priorities in Africa. pages 1719–1727, 2003. [27] Rp Anderson. Evaluating predictive models of species’ distributions: criteria for selecting optimal models. Ecological modelling, 162:211–232, 2003. [28] M.P Austin. Spatial prediction of species distribution: an interface between ecological theory and statistical modelling. Ecological Modelling, 157:101–118, 2002. ISSN 03043800. [29] Steven Phillips. A Brief Tutorial on Maxent. AT&T Research, pages 1–38, 2008. [30] Steven J. Phillips and Miroslav Dudík. Modeling of species distributions with Maxent: New extensions and a comprehensive evaluation. Ecography, 31(2):161–175, 2008. [31] E. Jaynes. Information Theory and Statistical Mechanics. Physical Review, 106(4):620–630, 1957. ISSN 0031-899X. [32] Steven J. Phillips, Robert P. Anderson, and Robert E. Schapire. Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3-4):231–259, 2006. ISSN 03043800. [33] C. E. Shannon. A Mathematical Theory of Communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3, 2001. [34] Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In International Joint Conference on Artificial Intelligence, volume 14, pages 1137– 1143. Citeseer, 1995.