Study of the distribution of Fredericella sultana in

Transcription

Study of the distribution of Fredericella sultana in
Scuola di Ingegneria Civile, Ambientale e Territoriale
Dipartimento di Elettronica, Informazione e Bioingegneria
Master of Science in Environmental and Land Planning Engineering
Study of the distribution of Fredericella
sultana in Switzerland in relation to
environmental variables to predict the
diffusion of Proliferative Kidney Disease in
fish populations.
SUBMITTED APRIL 1ST , 2015
BY
IRENE BARDI
Student Id n. 801780
SUPERVISORS
Enrico Bertuzzo, Prof. Renato Casagrandi
Academic Year 2014-2015
Scuola di Ingegneria Civile, Ambientale e Territoriale
Dipartimento di Elettronica, Informazione e Bioingegneria
Corso di Laurea Magistrale in Ingegneria per l’Ambiente e il Territorio
Studio della distribuzione di Fredericella
sultana in Svizzera in relazione a delle
variabili ambientali per predire la diffusione
della Malattia Renale Proliferativa nelle
popolazioni di pesci.
PRESENTATA IL 1 APRILE, 2015
DI
IRENE BARDI
Matricola n. 801780
RELATORI
Enrico Bertuzzo, Prof. Renato Casagrandi
Anno Accademico 2014-2015
Abstract
Proliferative Kidney disease (PKD) is one of the most important parasitic diseases of salmonid
populations in Europe and North America. It brings important economic losses to fish farms and has
a significant impact on wild fish populations. The causing agent of PKD is the myxozoan Tetracapsuloides bryosalmonae, which uses freshwater bryozoans as intermediate hosts. The most common
host species for myxozoan T. bryosalmonae is the bryozoan Fredericella sultana. The objective of this
Master thesis was to create a probability distribution model for the presence of F. sultana in Switzerland from presence-only records of infected trout and by analyzing various environmental variables.
The selected environmental variables were local but also of the entire upstream catchment area. These
environmental variables estimate climate, river slope, land cover, geology and pollutant release facilities. All of the data was first processed with a GIS system and Matlab in order to compute the
environmental variables required. The probability distribution was then created with MaxEnt, a Species Distribution Model (SDM) that combined observations of a species’ presence with environmental
variables that could have effects on the suitability of the species’ habitat. Various runs were performed
in order to evaluate all the different modelling possibilities. All of the runs performed significantly
better than the random model, and the AUC was high for all the cases. The predicted probability of
presence of F. sultana is higher near the large cities of Switzerland, this fact is consistent with the
most important variable: the percent of built up area in the upstream basins. The other important variables were found to be the mean river slope, mean altitude (local and up-stream) and the percentage of
alluvial rocks of the upstream basins.
Sommario
La malattia renale proliferativa (MRP) è una delle più importanti malattie parassitarie delle popolazioni salmonidae in Europa e nel Nord America. Questa malattia porta importanti perdite economiche agli allevamenti di pesci e ha un notevole impatto sulle popolazioni di pesci presenti in natura.
L’agente che causa la MRP è la myxozoa Tetracapsuloides bryosalmonae, che utilizza i briozoi di
acqua dolce come ospiti intermedi. La più comune specie ospite per la myxozoa T. bryosalmonae è
la bryozoa Fredericella sultana. L’obbiettivo di questa tesi di master è quello di creare un modello di
distribuzione di probabilità per la presenza della F. sultana in Svizzera utilizzando dati di sola presenza
di trote infette e analizzando diverse variabili ambientali. Le variabili ambientali selezionate sono sia
locali che dell’intero bacino idrografico a monte. Queste variabili ambientali considerano il clima, la
pendenza del fiume, la copertura del suolo, la geologia e le eventuali strutture di rilascio di inquinanti.
Tutti i dati sono stati elaborati con dei sistemi GIS e Matlab per ottenere le variabili ambientali volute.
La distribuzione di probabilità è stata creata con MaxEnt, un modello di distribuzione di popolazioni
(SDM) che combina osservazioni di presenza di specie con variabili ambientali che potrebbero avere
un impatto sull’idoneità dell’habitat per le specie. Sono state eseguite diverse runs per valutare tutte
le possibili modalità di modellizzazione. Tutte le runs hanno una performance significativamente migliore del modello random, e il valore di AUC è stato alto in tutti i casi. La probabilità predetta di
presenza delle F. sultana è più alta vicino alle grandi città della Svizzera, questo fatto è consistente
con la variabile più importante: la percentuale di urbanizzato nei bacini a monte. Le altre variabili
ambientali importanti sono risultale la pendenza del fiume, l’altitudine media (locale e dei bacini a
monte) e la percentuale di rocce alluvionali nei bacini a monte.
Acknowledgements
Il mio primo grazie va alle persone che hanno reso possibile questo lavoro, i miei relatori Enrico Bertuzzo e Renato Casagrandi. Grazie ad Enrico Bertuzzo per essere stato sempre disponibile e presente
per ogni nuova idea, dubbio o problema. Grazie a Renato Casagrandi per avermi dato la possibilità di
svolgere la tesi all’estero e per avermi incoraggiata nei momenti di incertezza.
Un enorme grazie va ai miei genitori per l’incoraggiamento continuo e incondizionato. Grazie per
aver creduto sempre in me, siete stati fondamentali. Grazie anche a Elena che mi ha sopportata per
tutti questi anni, anche quando usavo la scusa. . . “eh ma io sono in università. . . le cose sono più
difficili qui”, ora toccherà anche a te!
Grazie a Martin per essere sempre stato presente ed avermi aiutata in ogni piccola difficoltà, sei
stato un punto di riferimento per me. Grazie a Marco, compagno di caffè a Sat, pozzo di sapienza
con Matlab e fidato amico. Grazie anche a Bea, Iris e Gib, non solo colleghi ma amici, sempre pronti
a sostenermi e incoraggiarmi, senza di voi questi cinque anni sarebbero stati molto più duri. Grazie
a Dani, Simo, Umbe, Pietro e Rebbi, compagni di questa bella avventura, sempre capaci di farmi
sorridere.
3
Contents
Acknowledgements
3
1
Introduction
1.1 A general overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
2
Data
2.1 Subdivision of Switzerland in drainage basins
2.1.1 PKD presence data . . . . . . . . . .
2.1.2 Altitude . . . . . . . . . . . . . . . .
2.1.3 Temperature . . . . . . . . . . . . .
2.1.4 Geology . . . . . . . . . . . . . . . .
2.1.5 Land cover . . . . . . . . . . . . . .
2.1.6 Swiss pollutant register . . . . . . . .
2.2 Data Analysis . . . . . . . . . . . . . . . . .
2.2.1 Drainage basins . . . . . . . . . . .
2.2.2 Altitude . . . . . . . . . . . . . . . .
2.2.3 River slope . . . . . . . . . . . . . .
2.2.4 Temperature . . . . . . . . . . . . .
2.2.5 Geology . . . . . . . . . . . . . . . .
2.2.6 Land cover . . . . . . . . . . . . . .
2.2.7 Swiss pollutant register . . . . . . . .
3
Maxent
32
3.1 Introduction to Maxent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Explanation of Maxent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4
Results
4.1 Data used in Maxent . . . . . . . . . . . . . . . .
4.2 Correlation . . . . . . . . . . . . . . . . . . . . .
4.3 The runs . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Run 1 . . . . . . . . . . . . . . . . . . . .
4.3.1.1 Analysis of omission/commission
4.3.1.2 Response curves . . . . . . . . .
4.3.1.3 Analysis of variable contributions
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
8
10
10
11
12
12
12
12
15
17
19
23
27
29
37
37
38
38
38
38
42
43
4.3.2
Run 2 . . . . . . . . . . . . . . . . . . . .
4.3.2.1 Analysis of omission/commission
4.3.2.2 Analysis of variable contributions
4.3.3 Run 3 and run 4 . . . . . . . . . . . . . . .
4.3.4 Run 5 . . . . . . . . . . . . . . . . . . . .
4.3.4.1 Analysis of omission/commission
4.3.4.2 Response curves . . . . . . . . .
4.3.4.3 Analysis of variable contributions
4.3.5 Run 6 . . . . . . . . . . . . . . . . . . . .
4.3.6 Run 7 . . . . . . . . . . . . . . . . . . . .
4.3.7 Run 8 . . . . . . . . . . . . . . . . . . . .
4.3.8 Run 9 . . . . . . . . . . . . . . . . . . . .
4.3.9 Run 10 . . . . . . . . . . . . . . . . . . .
4.3.10 Run 11 . . . . . . . . . . . . . . . . . . .
5
Discussion and conclusions
5.1 Discussion . . . . . . . . . . . . . .
5.1.0.1 Analysis of the data
5.1.1 Correlation . . . . . . . . . .
5.1.1.1 Local variables . . .
5.1.1.2 Up-stream variables
5.1.2 The runs . . . . . . . . . . .
5.1.2.1 Run 1 . . . . . . .
5.1.2.2 Run 2 . . . . . . .
5.1.2.3 Run 3 and 4 . . . .
5.1.2.4 Run 5 . . . . . . .
5.1.2.5 Run 6 . . . . . . .
5.1.2.6 Run 7 . . . . . . .
5.1.2.7 Run 8 . . . . . . .
5.1.2.8 Run 9,10,11 . . . .
5.1.3 All the runs . . . . . . . . . .
5.2 Conclusion and future work . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
47
49
49
49
55
55
55
55
55
59
59
60
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
64
64
64
64
65
65
65
65
66
66
66
68
70
70
70
73
List of Tables
2.1
2.2
2.3
2.4
Data sources PKD . . . . . . . . . . . .
Extract from the legend of the lithology
Data source land cover . . . . . . . . .
Geology legend . . . . . . . . . . . . .
.
.
.
.
4.1
Environmental variables for the distribution of Fredericella sultana . . . . . . . . . . . 39
5.1
Percent contribution of the most inportant variables for all the differents runs . . . . . 73
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
12
13
25
List of Figures
1.1
1.2
T. bryosalmonae life cycle [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Map of Switzerland . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
2.23
2.24
2.25
2.26
2.27
2.28
2.29
2.30
2.31
Basic geometry . . . . . . . . . . . . . . . . . . . . . . . .
Zoom of basic geomerty . . . . . . . . . . . . . . . . . . .
Example of selection of all the upstream basins . . . . . . .
Example of selection of all the downstream basins . . . . .
Map presence PKD . . . . . . . . . . . . . . . . . . . . . .
Altitude . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mean temperature in January . . . . . . . . . . . . . . . . .
Example of a system of basins and its adjacency matrix . . .
Adjacency matrix . . . . . . . . . . . . . . . . . . . . . . .
Coloured adjacency matrix with coordinates . . . . . . . .
Adjacency matrix with coordinates and lakes . . . . . . . .
Workflow altitude . . . . . . . . . . . . . . . . . . . . . . .
Comparison between the raster of the polygons and the DTM
Mean altitude . . . . . . . . . . . . . . . . . . . . . . . . .
Mean upstream altitude . . . . . . . . . . . . . . . . . . . .
Polt log flow accumulation Vs. log river slope . . . . . . . .
Mean flow accumulation . . . . . . . . . . . . . . . . . . .
Log mean flow accumulation . . . . . . . . . . . . . . . . .
Log mean river slope . . . . . . . . . . . . . . . . . . . . .
Log mean drained river slope . . . . . . . . . . . . . . . . .
How Zonal Statistics works [2] . . . . . . . . . . . . . . . .
Mean monthly temperature January 2013 . . . . . . . . . .
Geology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Zoom of geology before and after dissolve . . . . . . . . .
How Statistic Tabulate Intersection works [2] . . . . . . . .
Extract of the geology table . . . . . . . . . . . . . . . . . .
Percentage of different classes of geology . . . . . . . . . .
Land cover . . . . . . . . . . . . . . . . . . . . . . . . . .
Percentage of different classes of land cover . . . . . . . . .
Pollutant releases . . . . . . . . . . . . . . . . . . . . . . .
Waste and waste water management facilities . . . . . . . .
4.1
Extract of the csv file used for Maxent . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
6
7
8
9
9
11
11
14
14
15
16
16
17
18
18
19
20
20
21
22
22
23
23
24
24
26
27
28
29
30
31
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.23
4.22
4.24
4.25
Correlation matrices for the variables . . . . . . . . . . . .
Omission and predicted area for F. sultana- run 1 . . . . .
ROC for Fredericella sultana-run 1 . . . . . . . . . . . . .
Responding curves- run1 . . . . . . . . . . . . . . . . . .
Importance of different variables-run 1 . . . . . . . . . . .
Jackknife-run 1 . . . . . . . . . . . . . . . . . . . . . . .
Omission and predicted area for Fredericella sultana- run 2
ROC- run 2 . . . . . . . . . . . . . . . . . . . . . . . . .
Thresholds table-run 2 . . . . . . . . . . . . . . . . . . .
Importance of different variables-run 2 . . . . . . . . . . .
Jackknife-regularized training gain-run 2 . . . . . . . . . .
Jackknife- run2 . . . . . . . . . . . . . . . . . . . . . . .
Responding curves run 3 and 4 . . . . . . . . . . . . . . .
Average omission and predicted area for F. sultana-run 5 .
ROC- run 5 . . . . . . . . . . . . . . . . . . . . . . . . .
Responding curves for variable 39- run 5 . . . . . . . . . .
Jackknife-run 5 . . . . . . . . . . . . . . . . . . . . . . .
Importance of the most important variables- run 2 and 6 . .
Probability of presence F. sultana- run 6 . . . . . . . . . .
Probability of presence Fredericella sultana- run 8 . . . . .
Probability of presence F. sultana- run 10 . . . . . . . . .
Importance of different variables-run 10 . . . . . . . . . .
Importance of different variables-run 11 . . . . . . . . . .
Probability of presence F. sultana- run 11 . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
41
42
44
45
46
47
48
48
50
51
52
53
54
54
56
57
57
58
59
60
61
62
63
5.1
5.2
5.3
5.4
5.5
5.6
Response curve of variable 39 . . . . . . . . .
ROC for Fredericella sultana-run 7 . . . . . . .
Responding curves variable 39- run7 . . . . . .
Probability of presence F. sultana - run 6 and 8
Response curve of variable 39 - run 9 . . . . .
Response curve of variable 4 - run 2 . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
68
69
71
72
74
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
1.1
A general overview
Fresh water ecosystems are among the most important earth ecosystems, in fact it is estimated that
40% of all species in the word have their origins in fresh water ecosystems. These types of habitats
are also home to a lot of different other organisms such as aquatic plants, invertebrates or amphibians
[3]. However, they are also one of the most threatened natural resources. Human activity has an
important role in this process. The degradation of freshwater ecosystems has important impacts on
the biodiversity of species; it can change water quality and cause disease. The increased presence
of disease threatens aquatic animal health and can change the resilience and biodiversity of a whole
population. Proliferativa Kidney Disease (PKD) is one of the most important parasitic diseases of
salmonid populations in Europe and North America [4]. It brings important economic losses to fish
farms and has a significant impact on populations of wild fish [5]. It causes a massive proliferation
of the intestinal kidney tissue, anemia, ascites, exophthalmos and apathy[1]. The causing agent of
PKD is the myxozoan Tetracapsuloides bryosalmonae. It is a multicellular endoparasite of fresh water
bryozoans and salmon fish. Its life-cycle is based on the exploitation of vertebrate and invertebrate
hosts. The most common host species for myxozoan T. bryosalmonae are bryozoa [6]. Bryozoan
colonies can be mistaken for a mat of moss, and they are found in environments where there is scarce
light, as for example under stones or logs [7]. They represent a source of food for many species of
fish and a microhabitat for small invertebrates. The dispersion of bryozoa is facilitated by statoblasts.
Their colonies expand during summer and regress to inactive, asexual hibernating stages (statoblast)
in the fall. Summer, when the water reaches temperatures around 15 degrees C or more, is when PKD
appears the most [4].
1
2
CHAPTER 1. INTRODUCTION
Box Tetracapsuloides bryosalmonae life cycle
T. bryosalmonae is a myxozoan parasite of salmonid fish and it is the causative agent of PKD. PKD
is a known illness since the 1900s but it was only discovered in 1985 that myxozoan were the cause
of it [8] and in 1999 that freshwater bryozoans were the invertebrate hosts [5]. Myxozoa are a group
of endoparasites of vertebrates and invertebrates animals of aquatic environments of a very small size,
usually ranging from 10 µm to 20 µm in size. There are around 1300 species of myxozoans and most
of them have a two-host life cycle, for example fish, annelid worms or bryozoan. They are multicellular organisms, and studies have shown that they can originate from cnidarians [9]. The life cycle
of T. bryosalmonae is still under study but the identification of bryozoans as intermediate hosts has
brought important progress for the understanding of it. The first stage of the infection caused by T.
bryosalmonae is a covert infection when there are single-cell stages in the bryozoans. Subsequently,
multicellular sacs are developed from the single-cell stages and they multiply in the cavity of the bryozoan colonies. The mature sacs contain spores, two internal amoeboid cells and four polar capsules
[10]. The spores released by the bryozoans infect trout via filaments contained in the polar capsules,
which enter the fish via gills or skin. The infection in fish is caused by amoeboid cells in spores that
reach the vascular system. The bryozoans can cycle from covert to overt infection [1].
Figure 1.1: T. bryosalmonae life cycle [1]
T. bryosamonae reaches the kidneys and spleen, causing and inflammatory responses and damage to
kidney tissues. Some of these spores reach the lumen of the kidney tubules and are released in urine of
the trout. These spores are infective to bryozoans, in the case of brown and brook trout [1]. Laboratory
experiments show that infection by T. bryosalmonae alone can lead to the death of the fish, without
necessarily a secondary infection [11].
Fredericella sultana (Blumenbach) is the most common bryozoan responsible of PKD. It is a fresh
water bryozoan typical of lotic and lentic habitats [12]. It has been found in Europe, North America,
Australia and New Zealand [13]. F. sultana prefers lakes that are rather low in altitude, have rich
vegetation with plant species typical of eutrophic environments, gyttja sediments, stony “hard” shores,
some wave action, high in calcium content, and slightly colored water. On the other hand F. sultana
3
1.2. METHODS
Figure 1.2: Map of Switzerland
avoids lakes with pH below 5.4, ponds, ditches and mires, dystrophic lakes surrounded by Sphagnum
bogs and with dy sediments [7]. The objective of this study is to create a probability distribution model
for the presence of F. sultana in all of Switzerland from presence-only in rivers and by analyzing
various environmental variables.
This study has a large scale of interest (about 40.000 km2), as said before, all of the Swiss territory. In Figure 1.2 is shown see the map of Switzerland with all the different cantons and cities. In
Switzerland, PKD is the most prevalent disease in fish [4].
1.2
Methods
In this study, in order to predict the presence of PKD, the distribution of F. sultana is studied. Unfortunately no data of the presence of F. sultana is available in Switzerland, but is it known that F. sultana
is the most common intermediate host species for T. bryosalmonae. The available data is a record of
presence or absence of infected trout collected in the whole Swiss territory.
The spatial scale of analysis is the basins one, all the territory of Switzerland is divided into drainage basins and partial drainage basins (the mean area of of them is 2 km2 ).
Considering the spatial scale of analysis and the mobility of the trout (which is quite limited especially in spring, the season when the trout contract the disease) it is possible to not consider the
mobility of the trout. For this reason, the presence of F. sultana is deducted from the presence of
infected trout.
4
CHAPTER 1. INTRODUCTION
The environmental variables selected for this study were local but also of the entire upstream
catchment area. The local variables were referred to the considered basin while the upstream basins
take into account all the basins that drain in the considered basin. These environmental variables
estimate climate, river slope, land cover, geology and pollutant release facilities. In order to represent
the climate, altitude was selected because in previous studies [7], it was a discriminant variable to
describe the presence of F. sultana and has relevant impact on the local climate conditions.
The other variable selected to represent climate was the mean monthly temperature of 2013, because temperature has strong impact on climate and on the habitat of many species [14]. To represent
the river characteristics, river slope and flow accumulation were calculated. Geological and land cover
characteristics may have an impact on the water quality [15], [16]. For land cover, 6 classes were considered: rocks, agriculture, built up, forest, glacier and lakes/water. Concerning the geology, 6 further
classes were studied: alluvial rocks, peat/loam, sedimentary rocks, sand/gravel, granite and gneiss.
This classification was made according to the rock chemical and physical properties that can influence
water quality [16]. In order to consider human impact, land cover and pollutant release facilities were
investigated. All of these environmental variables were also calculated for all the upstream basins of
the considered basin.
Chapter 2
Data
2.1
Subdivision of Switzerland in drainage basins
To analyze the distribution of F. sultana in the whole territory of Switzerland the environmental variables were linked with the hydrology of Switzerland. To study the hydrology of Switzerland the
data bassisgeometrie (basic geometry) was used. The source of this data is OFEV (Office Fédéral
de l’Environnement). This file contains the subdivision of Switzerland into drainage basins and is a
mosaic of polygons that correspond to partial drainage basins that cover the whole territory. The mean
surface of these polygons is 2 km2 . The subdivision in drainage basins can be divided into partial and
complete drainage basins. For each partial drainage basin there is an univocal complete drainage basin
where the drainage is made in the same estuary and each complete drainage basin can be defined using
a simple query. In Figure 2.1 is possible to see the basic geometry of this file,while in Figure 2.2 there
is a zoom of it.
5
6
CHAPTER 2. DATA
Figure 2.1: Basic geometry
Among the different attributes of this file, H1 and H2 represent two auxiliary codes developed by
OFEV. They are created to help the user to represent the hierarchical structure of these basins. They
correspond at the “right” and “left” value of a dataset, according to a model. With these two parameters
it is possible to select the complete drainage basin for each of the partial derange basin. They are not a
unique key to identify the polygon selected because they are not fixed values and at each new version
of the subdivision, they change. For all the partial drainage basins Tn where the drainage is made
through the estuary of the Ti basin and that all together they make the complete drainage basin Ti ,
Where:
H1Ti < H1T n < H2Ti
H1Ti : value of H1 of the partial drainage basin Ti
H2Ti : value of H2 of the partial drainage basin Ti
H1T n : value of H1 of all the partial drainage basins Tn
2.1. SUBDIVISION OF SWITZERLAND IN DRAINAGE BASINS
7
Figure 2.2: Zoom of basic geomerty
With a GIS system it is possible to select all the basins where the drainage is made through the
estuary of the selected basin, Figure 2.3, using the following query:
H1≥P1 AND H1 < P2
Where P1 and P2 are respectively the value of H1 and H2 of the selected basin.
It is also possible to do the opposite; find all the basins downstream from the selected basin, Figure
2.4. With the flowing query:
H1≤P1ANDH2≥P2
As before, P1 and P2 are the value of H1 and H2 of the selected polygon (Office federal de
l’environnement OFEV, Subdivision de la Suisse en bassins versants (Bassins versants Suisse)).
8
CHAPTER 2. DATA
Figure 2.3: Example of selection of all the upstream basins
2.1.1
PKD presence data
The PKD presence data were obtained from the OFEV -Office Fédéral de l’Environnement. This data
contains information about PKD and names of the water bodies where PKD had been detected in fish.
This data is a point Shapefile with 504 records, where 236 records are positive for PKD. The table 2.1
shows the characteristics of this dataset, while the Figure 2.5 represents the distribution of the PKD
presence data. The class Befund contains the information of the presence or not of PKD in the fish
analyzed.
2.1. SUBDIVISION OF SWITZERLAND IN DRAINAGE BASINS
Figure 2.4: Example of selection of all the downstream basins
Figure 2.5: Map presence PKD
9
10
CHAPTER 2. DATA
Data tipe
Shapefile Feature Class
Shapefile
pkd_06.shp
Geometry Type
Point
Coordinates have Z values
Yes
Coordinates have measures
Yes
Projected Coordinate System
CH1903_LV03
Projection
Hotine_Oblique_Mercator_Azimuth_Center
False_Easting
600000,00000000
False_Northing
200000,00000000
Scale_Factor
1,00000000
Azimuth
90,00000000
Longitude_Of_Center
7,43958333
Latitude_Of_Center
46,95240556
Linear Unit
Meter
Geographic Coordinate System
GCS_CH1903
Datum
D_CH1903
Prime Meridian
Greenwich
Angular Unit
Degrees
Table 2.1: Data sources PKD
2.1.2
Altitude
The data used to represent altitude was the DTM (digital terrain model) of Switzerland. The DTM is a
raster file with a resolution of 25 m, we can see it in Figure 2.6.
2.1.3
Temperature
The data used for temperature was obtained from MeteoSwiss. The variable, used in this study, contains the temperature two meters above ground, averaged over calendar months in degrees Celsius.
The mean values were calculated by averaging daily mean values that were calculated form automatic
10-minute measurements both day and night, there were about 80 measurement stations. To interpolate these measurements, MeteoSwiss used spatial interpolation. This method has better results than
normal linear temperature interpolation with a height relationship. This product should better reproduce temperature variations such as those from inversions over the Swiss Plateau or winter-time cold
pools. However, some physical effects are not modelled and there is some spatial variation of the interpolation accuracy. For example, in winter months, the standard error has a range of 0.6-1.8 degrees
and for summer months, it has a range of 0.5-0.7 degrees (MeteoSwiss). This data was downloaded in
the form of a raster file in TIFF format. Figure 2.7 shows the mean temperature of January.
11
2.1. SUBDIVISION OF SWITZERLAND IN DRAINAGE BASINS
Altitude
4500
200
4000
400
3500
600
800
3000
1000
2500
1200
2000
1400
1500
1600
1800
1000
2000
500
2200
0
Figure 2.6: Altitude
Mean Tenperature January
10
10
5
20
30
0
40
50
−5
60
−10
70
80
−15
90
100
−20
Figure 2.7: Mean temperature in January
2.1.4
Geology
The data used for the variable geology made by Swisstopo and was a structure data vector GK500_V1_1_FR.
It corresponds to the printed geology and tectonic map of Switzerland 1:500 000. The part of the data
used was PY_surfaces_base, a shapefile that contains some information about the geology formations,
the tectonic units and the reservoir aquifers. In this file there were 46 different classes for the lithology.
In table 2.2 we can see an extract from the legend of the lithology data.
12
CHAPTER 2. DATA
ID
1
2
3
4
5
6
7
8
9
LITHOLOGY
Roche meuble en general
Limon, argile, tourbe (tourbieres, marais)
Limon, argile, (loss, limon de pente, limon d’alteration)
Principalement blocs (eboulement)
Gravier limoneux, sable, limon, p.p. blocs (moraine)
Blocs, gravier grossier, sable (depot d’eboulis)
Gravier, sable, limon, p.p. blocs (cone de dejection)
Gravier et sable («Schotter»)
Gravier et sable, p.p. cimentes («Schotter»)
Table 2.2: Extract from the legend of the lithology
2.1.5
Land cover
The data used for the land cover was downloaded through GeoVITE (Geodata Visualization and Interactive Training Environment). The aim of Geovite is to provide easy-to-use online access to the most
important Swisstopo geodatasets. Data used was an extract of VECTOR200, a landscape model that
represent the features of the landscape of Switzerland in vector format. The shapefile had 11 different
categories for the land cover. Table 2.3 represents the data source of the file used for the variable land
cover:
2.1.6
Swiss pollutant register
SwissPRTR is a Swiss Pollutant Release and Transfer Register. It provides information about the
releases of pollutants and transfers of wastes from facilities and diffuses sources. This data includes
different types of facilities, pollutants and waste treatment. The field of activity of these facilities
are: animal and vegetable products from the food and beverage sector, chemical industries, energy
industries, mineral industries, paper and wood production and processing, production and processing
of metals, waste and waste water management or other industrial activities.
2.2
Data Analysis
2.2.1
Drainage basins
All the data was processed in order to use only the necessary information for the study, to validate the
spatial position and to compute some statistics. First, an adjacency matrix was created with MATLAB
in order to create a network of all the connection between all the different partial drainage basins. To
do that the query in Equation 2.1 was used:
H1≤P1ANDH2≥P2
(2.1)
13
2.2. DATA ANALYSIS
Data Type
Shapefile Feature Class
Shapefile
Primary surface land cover\VEC200_LandCover
Geometry Type
Polygon
Coordinates have Z values
no
Coordinates have measures
no
Projected Coordinate System
CH1903_LV03
Projection
Hotine_Oblique_Mercator_Azimuth_Center
False_Easting
600000,00000000
False_Northing
200000,00000000
Scale_Factor
1,00000000
Azimuth
90,00000000
Longitude_Of_Center
7,43958333
Latitude_Of_Center
46,95240556
Linear Unit
Meter
Geographic Coordinate System
GCS_CH1903
Datum
D_CH1903
Prime Meridian
Greenwich
Angular Unit
Degree
Table 2.3: Data source land cover
As explained before, H1 and H2 were attributes in the bassisgeometrie data and P1 and P2 was the
value of H1 and H2 of the selected basin. The matrix was created in order to have one in the position
Cij if the basin i is immediately upstream of basin j (where i is the index of the rows, and j is the one
for the columns). Figure 2.8 shows an example of a system of basins and its adjacency matrix.
The Figure 2.9 shows how the sparse matrix looks, the points represent the value of the cells of the
matrix different from zero, and in this case, they are equal to one.
14
CHAPTER 2. DATA
Figure 2.8: Example of a system of basins and its adjacency matrix
4
0
x 10
0.5
1
1.5
2
0
0.5
1
nz = 23440
1.5
2
4
x 10
Figure 2.9: Adjacency matrix
Next, the coordinates of each polygon centroid were calculated. The file bassisgeometrie also had the
attribute SEE that was equal to one if the polygon was a lake or backwater, two if the polygon drains
15
2.2. DATA ANALYSIS
5
3.5
x 10
first square
second square
third square
fourth square
extreme right values
3
2.5
2
1.5
1
0.5
4.5
5
5.5
6
6.5
7
7.5
8
8.5
5
x 10
Figure 2.10: Coloured adjacency matrix with coordinates
in the polygon with value one or zero in the normal case. Using this information, it was possible to
identify the lakes. In Figure 2.11 we can see how the sparse matrix looks with the coordinates and the
information of the lakes. In order to understand the structure of the matrix in Figure2.9, the different
squares in it were plotted with different colours. We can see the result of this plot in Figure 2.10. As
it is possible to see the different squares in Figure 2.9 represent the different parts of Switzerland. The
different polygons were, in fact, numerate according to their position in the Swiss territory. The
connections represented in blue, in Figure 2.10, are some lakes; we can see them in the right-down
corner, in the lower part and in the right side of the adjacency matrix, in Figure 2.9.
With the same original data, bassisgeometrie, a matrix was also created where, for each polygon,
the information of all the polygons that drained in that polygon was registered . This matrix was
used to calculate the mean value of all the environmental variables of all the upstream basins for each
polygon.
2.2.2
Altitude
With Matlab, the mean altitude was calculated for each polygon. In order to do this, the shape file
bassisgeometrie was converted into a Raster file with a cell size of 100 m. In each cell, this new file
contained the ID of the polygon that was at that precise position in the bassisgeometrie file. The first
16
CHAPTER 2. DATA
5
3.5
x 10
connections
lake
3
2.5
2
1.5
1
0.5
4.5
5
5.5
6
6.5
7
7.5
8
8.5
5
x 10
Figure 2.11: Adjacency matrix with coordinates and lakes
Figure 2.12: Workflow altitude
2.2. DATA ANALYSIS
17
operation that was made to the DTM was to fill the sinks because it is common to have anomalies in
digital terrain models. To be able to compare the two Raster files, the DTM was resampled in order to
change the size of the cell from 25m to 100m, in Figure 2.13 there is an example of the 2 raster files.
This change was made with ArcMap using the resample function. Using the clip function, the two
raster files were resized to have the same dimension. Next, they were converted into ASCII format to
be processed in Matlab. With Matlab, the mean altitude for each polygon was then computed, Figure
2.14.
Figure 2.13: Comparison between the raster of the polygons and the DTM
Subsequently, using the matrix with the information of all the drained basins, the mean altitude
of each basin and the raster derived from bassisgeometrie with Matlab, the mean elevation of the
upstream basins for all the basins was computed, Figure 2.15.
2.2.3
River slope
The river slope was calculated with GIS using the Spatial Analyst tool (Surface toolset- Slope), starting
with the DTM of Switzerland. The resolution of each cell was 25 m. The flow accumulation was
calculated with the DTM as well. In order to do this, the DTM was filled using the hydrology toolset
in Arcmap 10.2 (Spatial Analysis toolbox- Hydrology toolset-fill), and then the flow direction was
performed using the Hydrology toolset’s flow direction tool. Using flow direction as an input, flow
accumulation was calculated with the hydrology toolset flow accumulation tool. To calculate the mean
slope for each polygon Matlab was used. The shape file bassisgeometrie was rasterized with resolution
of 25m, then the flow accumulation, the slope and the bassisgeometrie files were clipped to have the
same dimension in order to have a matrix with the same indexes in Matlab. The flow accumulation
18
CHAPTER 2. DATA
Mean Altitude
3500
200
3000
400
600
2500
800
2000
1000
1200
1500
1400
1600
1000
1800
500
2000
2200
0
Figure 2.14: Mean altitude
Mean Upstream Altitude
3500
200
3000
400
600
2500
800
2000
1000
1200
1500
1400
1600
1000
1800
500
2000
2200
0
Figure 2.15: Mean upstream altitude
19
2.2. DATA ANALYSIS
was used to select the pixels and compute the mean slope: first, all the pixels in the bassisgeometrie
file with the same ID were selected. Then, they were sorted according to decreasing flow accumulation
and then the value of the slope of the first pixels were used to calculate the mean slope. In Figure 2.16
we can see the flow accumulation vs the log of the river slope.
log accumulation Vs log river slope
10
data
fitted line
8
6
4
2
0
−2
−4
0
5
10
15
20
Figure 2.16: Polt log flow accumulation Vs. log river slope
In Figure 2.17, Figure 2.18, Figure 2.19 and Figure 2.20, there are the maps of the mean flow
accumulation, the log of the mean flow accumulation, the log of the mean river slope and the log of
the mean slope for each polygon, respectively.
2.2.4
Temperature
The mean temperature for each polygons was calculated with ArcMap 10.2 using the Spatial Analyst
Tools-Zonal Statistics. For each of the 12 files containing mean monthly temperature the spatial coordinate system was added. The function Zonal statistics as a Table summarizes the value of a raster
within the zones of another dataset and give us the result with a table. In this case, the feature zone
data was bassisgeometrie and the zone field was the polygon ID, the input value raster was the raster
contained the mean monthly temperature. In Figure 2.21 we can see how Zonal Statistics works.
Once a table with the mean monthly temperature was created it was exported and used in Matlab.
In Figure 2.22 we can see the map of the mean monthly temperature of January 2013.
20
CHAPTER 2. DATA
mean flow accumulation
7
x 10
3.5
200
3
400
600
2.5
800
2
1000
1200
1.5
1400
1600
1
1800
0.5
2000
2200
Figure 2.17: Mean flow accumulation
log mean flow accumulation
16
200
400
14
600
12
800
10
1000
1200
8
1400
6
1600
4
1800
2000
2
2200
Figure 2.18: Log mean flow accumulation
21
2.2. DATA ANALYSIS
log mean slope
8
200
400
6
600
800
4
1000
2
1200
1400
0
1600
1800
−2
2000
2200
Figure 2.19: Log mean river slope
22
CHAPTER 2. DATA
log mean drained river slope
8
200
400
6
600
800
4
1000
2
1200
1400
0
1600
1800
−2
2000
2200
−4
Figure 2.20: Log mean drained river slope
Figure 2.21: How Zonal Statistics works [2]
23
2.2. DATA ANALYSIS
Mean Tenperature January
10
10
5
20
30
0
40
50
−5
60
−10
70
80
−15
90
100
−20
Figure 2.22: Mean monthly temperature January 2013
2.2.5
Geology
For this study, the different types of geology were grouped in six classes: alluvial rocks, peat/loam,
sedimentary rocks, sand/gravel, granite and gneiss. In Figure 2.23 we can see the map of the different
classes of geology.
Figure 2.23: Geology
24
CHAPTER 2. DATA
(a) Geology before dissolve
(b) Geology after dissolve
Figure 2.24: Zoom of geology before and after dissolve
Figure 2.25: How Statistic Tabulate Intersection works [2]
As said before, the classification was made according to the information found in the literature
about the rock chemical and physical properties that can influence the water quality [16] and their
presence and importance in Switzerland. Usually the chemical that are associated with the water
quality are Ca, Mg and SO4 because these constituents form the principal solutes derived from rock in
most stream system. Also some physical attributes are important: rock strength (uniaxial compressive
strength) and rock hydraulic conductivity [16]. In order to compute the percentage of the different
geology classes for each polygons, a dissolve was performed on the file PY_surfaces_base of the
different types of geology. The dissolve was used to aggregate the features of this file based on the
classification performed before. In Figure 2.24a and Figure 2.24b we can see a zoom on the geology
map, before and after the dissolve.
Once the dissolve was performed, the percentage of each geology class for each polygon was
calculated. It was computed using ArcMap 10.2 and the Analysis Tool-Statistics-Tabulate Intersection.
In Figure 2.25 we can see how Statistic Tabulate Intersection works.
25
2.2. DATA ANALYSIS
Geology class
name
0
Other
1
Alluvial rocks
2
Peat and loam
3
Sedimentary rocks
4
Sand and gravel
5
Granite
6
Gneiss
Table 2.4: Geology legend
In the studied case, we have a table like the one in Figure2.26, where there are different lines for
the different geology classes for each polygon. On the other hand the Figure 2.4 represent the legend
of the geology map.
The table in Figure 2.26 was then exported to Matlab, in order to group for each polygon the
percentage of the different classes of geology. Next with Matlab, different maps of the percentage of
different geology classes for each polygon were created and, as did for the altitude, the percentage of
the different classes of the upstream basins was performed .
In Figure 2.27 the maps of the different percentage of the different geology classes for each basin
are shown. According to Figure 2.27a the percentage of alluvial rocks is higher near the rivers, for
example it is possible to see clearly a higher percentage of alluvial rocks near the Rhone river. Figure
2.27b shows that the percentage of peat and loam is higher in the upper part of Switzerland. This area
is situated at a lower altitude if compared with the lower part of Switzerland, according to Figure 2.14.
As per Figure 2.27e and Figure 2.27f the percentages of granite and gneiss are higher in the lower part
of Switzerland, which is the area with higher altitude due to the presence of the Alps.
26
CHAPTER 2. DATA
Figure 2.26: Extract of the geology table
27
2.2. DATA ANALYSIS
geology class 1
geology class 2
100
200
90
400
80
600
100
200
90
400
80
600
70
800
70
800
60
1000
60
1000
50
1200
50
1200
40
1400
40
1400
1600
30
1600
30
1800
20
1800
20
2000
10
2200
2000
10
2200
0
(a) Geology class 1
(b) Geology class 2
geology class 3
geology class 4
100
200
90
400
80
600
100
200
90
400
80
600
70
800
70
800
60
1000
60
1000
50
1200
50
1200
40
1400
40
1400
1600
30
1600
30
1800
20
1800
20
2000
10
2200
2000
10
2200
0
0
(c) Geology class 3
(d) Geology class 4
geology class 5
geology class 6
100
200
90
400
80
600
100
200
90
400
80
600
70
800
70
800
60
1000
60
1000
50
1200
50
1200
40
1400
40
1400
1600
30
1600
30
1800
20
1800
20
2000
10
2200
2000
10
2200
0
(e) Geology class 5
0
(f) Geology class 6
Figure 2.27: Percentage of different classes of geology
2.2.6
Land cover
For the land cover there are some similarity with the processes did for the geology, starting from the
Geovite data, the land cover was divided in 6 classes: rocks, forest, built up, agriculture, glacier and
water/lakes. In order to calculate the percentage of land use for each class an intersect between the file
with the land cover and bassisgeometrie was performed. Then the percentage of the different classes
for each polygons were performed with Matlab. In Figure 2.28 there is a map of the different classes of
land cover all together, then in Figure 2.29 we have one map for each class of land cover. These maps,
Figure 2.29 represent the different percentage of the different land cover classes for each basin. As it
is possible to see in Figure 2.29a the percentage of rocks is higher in the lower part of Switzerland, the
28
CHAPTER 2. DATA
area where the Alps are, this fact is congruent with the previous map of altitude (Figure 2.14) and the
map of geology class 5 and 6 (Figure 2.27e and Figure 2.27f). It is possible to notice, in Figure 2.29a,
that there are some areas in the lower part of Switzerland, where the percentage of rocks is very low
but the surrounding areas have a very high percentage of rocks. This fact is explained by the presence
of glaciers, as it is shown in the map of land cover class 5, Figure 2.29e. Figure 2.29c is the map of
the distribution of percentage of built up area. This map shows correctly, the presence of the cities
in Switzerland, being the areas with the higher percentage of built up. Similarly, In Figure 2.29f, the
presence of the lakes is correctly represented.
Figure 2.28: Land cover
29
2.2. DATA ANALYSIS
land cover class 1
land cover class 2
100
200
90
400
80
600
100
200
90
400
80
600
70
800
70
800
60
1000
60
1000
50
1200
50
1200
40
1400
40
1400
1600
30
1600
30
1800
20
1800
20
2000
10
2200
2000
10
2200
0
0
(a) Land cover class 1
(b) Land cover class 2
land cover class 3
land cover class 4
100
200
90
400
80
600
100
200
90
400
80
600
70
800
70
800
60
1000
60
1000
50
1200
50
1200
40
1400
40
1400
1600
30
1600
30
1800
20
1800
20
2000
10
2200
2000
10
2200
0
(c) Land cover class 3
(d) Land cover class 4
land cover class 5
land cover class 6
100
200
90
400
80
600
100
200
90
400
80
600
70
800
70
800
60
1000
60
1000
50
1200
50
1200
40
1400
40
1400
1600
30
1600
30
1800
20
1800
20
2000
10
2200
2000
10
2200
0
(e) Land cover class5
0
(f) Land cover class 6
Figure 2.29: Percentage of different classes of land cover
2.2.7
Swiss pollutant register
In the Swiss pollutant register data there are in total 1427 different facilities and at the first analysis they
were all considered. Then only the waste and waste water management activities were considered. The
original data was in a csv format, it was first processed with Excel, in order to select only the useful
information and to select the waste and waste water management facilities. Subsequently, the two data
obtained (the pollutant releases file and the waste and waste water file) were imported in ArcMap 10.2
to create tow new layers. To do that first, the 2 layers were converted in tables with the Conversion
Tools Excel to Table. Next, with the Data Management Tools (Layers and Table Views-Make XY
Event Layer) 2 points layer were created. Subsequently the Analysis Tools (Overlay- Spatial Join)
30
CHAPTER 2. DATA
was used in order to join the attributes from the points layer with the file basisgeometrie according on
spatial relationship. Once obtained these 2 layers they were processed in Matlab and added to the table
of all the variables previously computed. In Figure 2.30 we can see a map of all the different facilities
and in the map2.31 only the waste and waste water management facilities.
Figure 2.30: Pollutant releases
2.2. DATA ANALYSIS
31
Figure 2.31: Waste and waste water management facilities
With Matlab 3 new variables were computed for both of the 2 new layers. The first variable computed was the presence or not of a pollutant release facility. This information is a binary information,
1 for the presence, 0 for the absence. Then, using the matrix with the information of all the drained
basins, a variable concerning the number of pollutant releases facilities in all the basins drained in the
considered basin was computed. Then the last variable has the information of the number of pollutant
releases facilities in the considered basin. These 3 variables were also calculated for the data of the
waste and waste water management facilities.
Chapter 3
Maxent
3.1
Introduction to Maxent
MaxEnt was created by Steven Philips, Miro Dudik and Robert Schapire with the support from AT&T
Labs-Research, Princeton University and Center for Biodiversity and Conservation, American Museum of Natural History.
MaxEnt is a kind of Species Distribution Model (SDM), these models are numerical tools that combine observations of a species’ presence with environmental variables that could have effects on the
suitability of the environment for that species [17]. They are used to predict the distribution of species
across a landscape. SDMs are used in many different domains, such as freshwater ecosystems, both
terrestrial and marine [18] and they are used to address questions concerning ecology, biogeography
and conservation of species [19].
There are different methods to apply these models, but the biggest difference among them is the
kind of data species that they use. They can use both presence and absence occurrence data or only
presence data.
Presence data are the most common and can be found, for example, in natural history museums or
in herbaria. Usually absence data are quite rare, and even if they are available they are of uncertain
value in some cases [20].
On the other hand, presence data models are considered very valuable and the research based on
museum data is widely used [21] even if they can experience some problems as well. For example,
it is possible that in an area the species was not detected because some factors determined its local
extinction. This fact could create wrong patterns in the presence data, because this missing detection
will suggest that that area has unsuitable environmental conditions for that particular species [22].
3.2
Explanation of Maxent
In order to explain how MaxEnt works, using a statistical approach, let us start with Bayes’ rule shown
here:
Pr(y = 1 | z) = f1 (z) · Pr(y = 1)/ f (z)
Here there are a set of locations where the species has been observed in a certain landscape of
interest, L. it is possible to say that y=1 corresponds to the presence of the species and y=0 corresponds
to the absence of the species. The vector of the environmental covariates is called z , which represent
32
3.2. EXPLANATION OF MAXENT
33
the environmental conditions. The probability density of covariate across L is defined as f (z), f 1 (z) as
the probability density of covariates where the species is present across L and f 0 (z) as the probability
density of covariates where the species is absent. Now to estimate the probability of presence of the
species conditioned to the environment, PR (y = 1 | z). It is possible to model f 1 (z) using the presenceonly data, but it cannot be approximated to the probability of presence. We can also model f (z) using
presence/background data.
In order to calculate Pr (y = 1 | z) , the probability of presence of the species conditioned to the
environment, using the Bayes’ rule, we have to calculate Pr (y = 1), the prevalence of the species in
the landscape [22]. But this quantity, formally, cannot be exactly estimated with the presence-only
data [23]. This is an important limitation in presence-only data. However, also absence data can
have problems in detection [24], so we can say that also presence-absence data could lead to a bad
estimation of prevalence.
Another important limitation is that in species distribution models based on presence-only data,
the sample selection bias has a greater impact than it has on models that use presence-absence data
[25]. The data present in herbaria or museums are typically records of species’ occurrence collected
by individuals and therefore can be correlated to more accessible locations such as roads or rivers [26].
This data could also be autocorrelated when, for example, they are collected from nearby locations
within a small area. Additionally, the intensity to collect samples and the methods to perform the
collection could be different in the study area [27]. The bias problem can be reduced using background
data with the same bias that we have in the presence data [25].
In SDMs the environmental factors that are relevant for the habitat suitability are the so called “independent variables” or “covariates” in the statistical literature. In the case of this Master’s thesis they
are temperature, land cover, geology, river slope, altitude and presence of pollutant releases facilities.
Usually a species distribution has a complex response to these factors, so it is recommended to use
nonlinear functions for these kinds of problems [28].
The MaxEnt fitted function is defined over 6 feature classes: linear, product, quadratic, hinge,
threshold and categorical. The linear feature class corresponds to the variable itself, it means that
the mean of this variable under the estimated distribution should be close to its mean in the sample
locations. Quadratic class correspond to the square of the variable and impose a constraint on the
variance: the variance of the variable for the estimated distribution is close to the variance in the
sample. Product classes express the product of all the pair-wise combinations of covariates, giving us
the possibility to fit simple interaction between variables. Threshold classes express the possibility to
represent a step in the fitted function, in this way we can have different responses below the threshold
or above it. Hinge class are similar to the threshold class but this class allows us to have a change in
the gradient of the response. Usually it is not used with the linear class because they are very similar;
a linear feature can be created from a hinge feature. The category class is a binary indicator that show
if a categorical variable belongs to one class or not [29]. In the study case, even if we have variables
that represent different classes (geology or land use classes), we are not using these kinds of feature
class because we have the information of the percentage of each class in each polygon. So we can say
that MaxEnt fits the model on features that are transformations of the environmental variables, in this
way we are able to model complex relationships between covariates.
By default, MaxEnt gives the possibility to use all the different types of features if conditions to
use them are satisfied, in fact all feature types are used if there are at least 80 training samples. If
there are between 15 and 79 samples MaxEnt uses linear, quadratic and hinge features, between 10
34
CHAPTER 3. MAXENT
and 14 samples it uses linear and quadratic features and below 10 samples linear features are simply
used [30].
As mentioned previously, the landscape of interest is called L and L1 corresponds to a subset of
L where the species is present. The distribution of the covariate in the landscape is made by a finite
number of sample points called the background sample. These can be represented by a grid of pixels
(in ESRI ASCII grid format or Diva-GIS format) or they can be given in a SWD (samples-with-data)
format, like in the case studied here. By default, MaxEnt uses 10,000 random samples from the
background locations, but this number can be also modified.
Referring again to Bayes’ rule, to estimate the ratio ff1(z)
(z) MaxEnt uses the samples point of presence
and the background samples. It makes an estimation of f 1 (z), choosing the one which is closest to
f (z), the null model. In fact if we consider a model without the occurrence data, we can expect that the
species will have a random distribution because we do not expect it to prefer particular environments
over others. We can consider the distance of f 1 (z) to f (z) as the relative entropy of f 1 (z) with respect
to f (z).
Minimizing the relative entropy can also be seen as maximizing the entropy of the “raw” distribution. The “raw” distribution is π(x) = Pr(x | y = 1), the probability distribution over the locations x.
It expresses the probability to find the species in the pixel x, in relation to where the species is present
[30].
• Box entropy
In order to find the best approximation for an unknown probability distribution, the maximum-entropy
principle says that it is the one that respects all the constrains on it and has the maximum entropy
among the distributions satisfying them, the most unconstrained one [31].
The unknown probability distribution is π, over a set of X sites in the study area. It is possible to
consider each element X as a point. Each of these points has a non-negative value of probability, all
these probabilities sum to one. The approximation of the probability distribution is called π̂ [32]. It is
possible to say that the entropy of π̂ is:
H(π̂) = − ∑ π̂(x)lnπ̂(x)
x∈X
The entropy is defined as non-negative. Entropy can be expressed as a “measure of how much ‘choice’
is involved in the selection of an event”[33]. A distribution with higher entropy involves more choices
and less constrained. So it is possible to say that the maximum entropy principle says that no baseless
constraints should be applied on π̂ [32].
In MaxEnt some constraints are imposed to give the solution the presence record information. For
example, if one of the analyzed covariates is the temperature in January, the mean temperature in
January for the estimate of f 1 (z) will be close to the mean temperature in January for the location
where the species has been found [22].
In order to weight the contribution of each feature certain coefficients are defined. The vector for
these coefficients is called β while the vector for the features is called h(z)[22].
In the solution, MaxEnt tries to find the coefficients β in order to satisfy the constraints without
over fitting the data. This avoids generating a model with limited power of generalization. To avoid
3.2. EXPLANATION OF MAXENT
35
this problem, in MaxEnt we can set an error bound, or a maximum allowed deviation from the sample
feature means. The features are first rescaled to a range from 0 to 1 followed by a computation of the
error bound for each feature (λ j ). It is possible to estimate these error bounds by simply using the
cross-validation data, for example. However, to simplify the model fitting there are some default settings that were tuned and validated for different datasets [30]. The default parameters can be changed
by the user if necessary.
• Box Cross-validation
Cross-validation is a method used to resample data in order to train and test the generated models. It
is also called k-fold cross-validation because the data set is divided into k (usually 5 or 10) mutually
exclusive subsets, called “folds”. These subsets have about the same size. In order to compute the
model performance, each subset is successively removed, so that there is one subset excluded and k-1
retained. The model is fitted on the k-1 retained data and the omitted one is predicted [34]. This
process is repeated k times in order to use each subset exactly once as validation data. Each fold on
the other hand, is used k-1 times to fit the model, in different combinations with the other folds. The k
different results can be averaged in order to have a more accurate estimation.
These error bounds, λ j , allow for the regulation of how focused or closely- fitted the output distribution will be. We can modify the closeness of fit by changing the fitting parameter (by default 1.0).
If the value is smaller, the output distribution will be more localized and will have a closer fit to the
presence records, but may be over fitted. Conversely, a larger regularization will yield a more spread
out prediction [29].
• Box Regularization
Maxent can be affected by overfitting the training data, to avoid this, there is the regularization parameter. Regularization affect how focused the output distribution is. This parameter allow us to smooth
the distribution or to make it closer to the samples data. Regularization is a frequent used approach
to model selection. It relax the constraint on the variables, in order to trade off model fit and model
complexity. As said before, the regularization parameter is λ j .
r
s2 [h j ]
λj = λ ·
m
Where s2 [h j ] is the feature’s variance over the presence site m, and λ is the tuning parameter for that
features class [22]. Regularization obliges the model to consider more the most important features.
These models are less affected by overfitting the training data, because they have less parameters. In
order to find the Maxent probability distribution, it start with all the λ j at 0 (uniform distribution), then
repeatedly it change the value of the λ j [32].
MaxEnt has three output formats: raw, cumulative and logistic. Logistic is the default output
and is the easiest to understand. It gives, for each pixel, a probability of presence between 0 and 1.
36
CHAPTER 3. MAXENT
The values are rescaled in a nonlinear way in order to give a better interpretation. The probability
of presence depends on the details of the sampling design, such as for example the size of the plot.
It also depends on the arbitrary value imposed for the probability presence at sites with the “typical”
conditions for the species. Usually this value is set to 0.5, though it can be changed using the “default
prevalence” parameter. The raw output is the probability of presence (with range 0-1). The raw output
values are usually very small because the sum over all the cells used during the training is 1. In the
cumulative output format, the value for each grid cell is the sum of the probabilities of all grid cells
with lower or equal probability to the current grid cell, times 100. For example, if the value is 45 this
means that 45% of presences would be predicted as absences if we are using this value as a threshold
to create a presence-absence surface. The range of this output is between 0 and 100 [30].
Chapter 4
Results
4.1
Data used in Maxent
As discussed previously, the species data was obtained from the Office Fédérale de l’Environnement
(OFEV). It contains information about PKD and names of the water bodies where PKD had been
detected in fish. In total there are 504 records, where 236 records described a positive presence of
PKD. We can see how with this amount of samples MaxEnt is able to use all the different kinds of
feature types because the constraints on the number of the samples are respected. For this study only
the records describing a positive presence of PKD were considered.
46 environmental variables were used, the process to obtain these variables was described previously. The format of the environmental layers is a SWD (samples-with-data) format. In Figure
number 4.1 it is possible to see the how the file of the environmental variables is made. The first
column is ignored, in this study it was set as 1. In the second and in the third columns there are the
geographic X and Y coordinates, respectively, of the center of each polygon. After that ,there are all
the environmental variables.
In Figure 4.1 we can see all the environmental variables. First there is the mean altitude for each
polygon, then we have 6 classes of geology (each cell contains the percentage of that specific geology
class for the considered polygon). Next, there are the 6 classes of land cover (also here each cell
contains the percentage of a specific class of land cover for the considered polygon). Then, in columns
17 to 28 there are the values of the mean monthly temperatures for each polygon (from January to
Figure 4.1: Extract of the csv file used for Maxent
37
38
CHAPTER 4. RESULTS
December). Column 29 contains the mean river slope for each polygon. From column 30, there is the
altitude, different classes of geology, land covers and mean river slope calculated as the mean of all the
polygons that drain into the considered polygon. All of these variables were considered as continuous
variables. Then we have 6 variables concerning the pollutant release facilities. In the table 4.1 we can
see all the environmental variables used in this study.
4.2
Correlation
In order to detect the linear correlation among variables the matrix R is calculated with Matlab. R is
the matrix of correlation coefficient calculated from an input matrix X, in our case it was the matrix
with all the local variables. The range of the coefficient is between -1 and 1, where 1 means perfect
positive correlation, 0 is no correlation, and -1 is perfect negative correlation. The matrix R is related
to the covariance matrix C by:
C(i, j)
R(i, j) = p
C(i, i) ·C( j, j)
The coefficient of this matrix is defined as the covariance of the two variables divided by the
product of their standard deviations. The matrix R is symmetric and, as we can see in Figure 4.2, in
the diagonal there are all 1. In the matricies the variables are ordered as in the previous table. For
this study the correlation matrix was computed between the local variables and between the upstream
variables, we can see them respectively in Figure 4.2a and in Figure 4.2b.
4.3
The runs
Several runs were performed with MaxEnt, here there will be an explanation of all the different runs
and the parameters used. The html output of MaxEnt contains some information about the model
performance, the importance of each variable and its influence. Some information about the chosen
output and links to where the data files used can also be found.
4.3.1
Run 1
For the first run, the first 39 variables were used and all the default parameters were kept.
4.3.1.1
Analysis of omission/commission
In the html output of MaxEnt there are at first 2 graphs and a table that evaluates model performance/bias. The first output that we have is the omission and prediction area for the species which in
this study is called 1 for simplicity. The omission rate, in Figure 4.3, represents the fraction of the
test localities that fall into pixels not predicted as suitable for the species, and the predicted area is
the fraction of all the pixels that are predicted as suitable for the species Phillips et al. [32]. This plot
4.3. THE RUNS
Table 4.1: Environmental variables for the distribution of Fredericella sultana
39
40
CHAPTER 4. RESULTS
Matrix R for local variables
altitu
geo_01
geo_02
geo_03
geo_04
geo_05
geo_06
land01
land02
land03
land04
land05
land06
temp01
temp02
temp03
temp04
temp05
temp06
temp07
temp08
temp09
temp10
temp11
temp12
Rslope
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
(a) Correlation matrix for the local variables
Matrix R for the upstream variables
1
D_alti
D_geo1
0.8
D_geo2
D_geo3
0.6
D_geo4
D_geo5
0.4
D_geo6
Dland1
0.2
Dland2
Dland3
0
Dland4
Dland5
−0.2
Dland6
Dslope
−0.4
P_poll
ND_pol
−0.6
N_poll
Pwaste
−0.8
NDwast
Nwaste
−1
(b) Correlation matrix for the upstream variables
Figure 4.2: Correlation matrices for the variables
41
4.3. THE RUNS
shows the relationship between predicted values of occurrence probability (in this case from the training samples) and the proportion of occurrences selected. In other words, this plot shows how training
omission and predicted area vary when changing the cumulative threshold. The predicted omission
rate is a straight line, by definition of the cumulative output format.
Figure 4.3: Omission and predicted area for F. sultana- run 1
The second plot is a receiver operating characteristic (ROC), in Figure 4.4 It illustrates how well
the model performs in predicting occurrences compared to a random selection of points. An important
advantage of the ROC analysis is that the area under the ROC provides a single measure of model
performance independent of the thresholds. The random model is represented by a straight line, the
bisector. On the other hand, the perfect model would appear as a right angle with the corner on the top
left of the graphic. A good curve maximizes sensitivity for low values of the false-positive fraction.
The higher the area under the curve (AUC) the bettter the model is performing. The range of AUC
can be between 0 and 1. A value of AUC close to 0.5 indicates that the model is not so much better
than a random model (AUC of a random model is 0.5). While 1 is the AUC of the perfect model. It
is possible to have also value below 0.5, in that case the model is performing worse than the random
model.
42
CHAPTER 4. RESULTS
Figure 4.4: ROC for Fredericella sultana-run 1
After that in the html file there is the thresholds table.This table provides some common thresholds
and corresponding omission rates to represent “suitability” vs. “non-suitability”. Where the sensitivity,
also called true positive rate, is a measure of the portion of positives that are correctly identified as
positive and the specificity, also called true negative rate, is the measure the portion of negatives that
are correctly identified as negative.
In Maxent the thresholds are used if we want to display the output in a more discrete way, suitable
habitat and unsuitable habitat. To do that it is necessary to choose witch threshold values is the best
one, the threshold represents the value of the minimum probability for a suitable habitat. Choosing the
best threshold value is not a set rule, and to decide the threshold value it is necessary to understand
the species considered and the objective of the map (Young-2011). In the case study no threshold was
chosen because we are interested to create a probability map of the presence of Fredericella sultana
and not just a suitable-unsuitable habitat map.
4.3.1.2
Response curves
Usually, it is interesting to understand how each variable can influence the prediction of MaxEnt,
which variables have the greatest influence on the model and how these variables can influence the
presence of the species.
In order to understand how the prediction depends on the variables, the output of MaxEnt provides
some graphics, Figure4.5. The graphics show how each environmental variable affects the prediction.
We can see how the prediction changes when each environmental variable changes, if all other variables are kept at their average sample value, in Figure 4.5a. On the x-axis, these graphics have the
4.3. THE RUNS
43
value of the analyzed variable and on the y axes the predicted probability of suitable conditions when
all other variables are set to their average values.
The second set of responding curves, in Figure 4.5b, represents all the different models created
using only one variable. There are as many models as the number of variables that we are considering.
4.3.1.3
Analysis of variable contributions
There are two methods which can be used to understand the importance of each variable. First, MaxEnt
give the percent contribution of each variable to the final model. To do this, it records how much
the overall model gain is improved if small changes are made to each coefficient of those particular
features. At the end of the run, all of these small changes are taken into account to compute the
proportion of all the contributions. However, if the variables are strongly correlated the result can be
difficult to interpret also here.
In Figure 4.6 we can see the output of this method used to determine the importance of each
variable.
As we can see, in Figure 4.6, the variable with the biggest contribution (25.1%) comes from the
variable number 39 which represents the percentage of class 3 land cover (built up)of polygons that
drain into the considered polygon. Next, variable 29, the mean river slope, shows the next largest
contribution with 12.8%. After that we have variable number 4, which represents the mean altitude,
with a percent contribution of 8.7%. Another percent contribution with a similar value (8.5%) is the
one of the variable 26, mean temperature in October.
Another method to compute the importance of each variable is the jackknife approach. In this
method MaxEnt excludes one variable at time in order to collect information about the importance of
each variable to explain the species distribution and the uniqueness of the information that the variable
provides. To do that MaxEnt first creates a model excluding each variable in turn, and then it creates a
model using each variable in isolation. In Figure 4.7 there are the results of the jackknife analysis.
Dark blue bars represent how well the model performed using only that feature is. The red bar
represents the model performed with all the variables. And the light blue bars represent the model
performed without the considered variables.
4.3.2
Run 2
For this run the inputs were the same as for Run 1 but the default settings number for the “random test
percentage” was changed. This command tells the program to randomly set aside 25% of the sample
records for testing. Doing so MaxEnd is able to perform some basic statistical analysis.
Here there are the main differences between this Run and the previous one. The main results are
the same but there are more statistical analysis.
4.3.2.1
Analysis of omission/commission
The first difference is in the first plot, Figure 4.8.As it is possible to see, in Figure 4.8, there is one
more line: the omission test samples. It could happen some times that the test omission line is below
the predicted omission line, not in the study case. This phenomenon could be due to the dependence
between the test and the training data.
44
CHAPTER 4. RESULTS
(a) Marginal responding curves- run1
(b) Single variable responding curves-run 1
Figure 4.5: Responding curves- run1
45
4.3. THE RUNS
Figure 4.6: Importance of different variables-run 1
46
CHAPTER 4. RESULTS
Figure 4.7: Jackknife-run 1
4.3. THE RUNS
47
Figure 4.8: Omission and predicted area for Fredericella sultana- run 2
The second difference is in the ROC plot. Also here, in Figure 4.9 we can see that there is one line
more, the one of test data. Before the line of training and test data were overlapping. Here we have
divided the data in 2 parts, one for training and another one for testing. Usually the training line has a
higher value of AUC than the one of the test data. The training line (the red one) represents the “fit”
of the model to the training data. On the other hands, the test line (the blue one) shows the fit of the
model to the testing data, and it used to test the real model predictive power.
The next difference that we can see is in the table of the thresholds and omission rates, Figure
4.10. Now in the table we have also the test omission rate and the P-value. This is because we used
some of the sample records for the training and some others for the testing. The P-value in fact is the
value of a hypothesis test and it represents the probability to obtain a result equal or “more extreme”
of the observed one, assuming true the null hypothesis. The P-value is compared with a threshold,
usually 5% or 1%. If the P-value is equal or smaller than the significance level, it suggests that the null
hypothesis has to be rejected and the alternative hypothesis is accepted as true.
4.3.2.2
Analysis of variable contributions
Another difference is in the analysis of the variable contributions, now the most important variables
are a bit different from before, we can see them in Figure 4.11 The most important one is still the
percentage of Land cover of class 3 of the polygons that drain in the considered polygon, but with a
contribution of 28%. In this case the second variables of importance and the third one are switched, in
fact the mean altitude has a contribution of 15.5% and the mean river slope has a contribution of 10.2%
(in Run 1 they were 8.7% and 12.8% respectively). Then we have the variable 30, the mean altitude
of the polygons that drain in the considered polygon, with a percentage contribution of 9.7% (before
48
CHAPTER 4. RESULTS
Figure 4.9: ROC- run 2
Figure 4.10: Thresholds table-run 2
4.3. THE RUNS
49
it was 6.5%). The value 26, the mean temperature in October, decreases its importance changing from
8.5% to 1.5%.
Also the Jackknife of regularized training gain for the species is a bit different but the environmental variable with the highest gain used in isolation and the one that decreases the gain the most
when it is omitted is still the number 39, we can see it in Figure 4.12.
The biggest difference in the Jackknife analysis is the presence of other 2 plots: one uses test
gain, Figure 4.13b and the other use AUC in place of training gain, Figure 4.13a. Comparing these
3 plots (Figure 4.13 and Figure 4.12) can be very useful and can provide additional information. In
this case all the 3 plots show that the most effective single variable for predicting the distribution of
the occurrence data is the variable 39. We can see as the variable 19, the mean temperature in March,
increase the test gain and the value of AUC even if in the model was not very used in the model built
using all variables (0.2%).
We can also see as in the gain test and AUC plots, some of the light blue bars are longer than the red
one, especially the variable number 30 (the mean altitude of the polygons that drain in the considered
polygon).
4.3.3
Run 3 and run 4
In these runs all the parameters were kept with the same value as in the previous run with the only
exception of the feature class. Instead of using the “Auto features” as in the previous 2 runs, in the run
3 the “Threshold features” was selected and in the run 4 the “Hinge features” was selected.
The relevant difference is in the plots of the response curves. The image 4.14a shows the response
curves of the run with the threshold features, and the image 4.14b the one with the hinge features.
4.3.4
Run 5
In this run all the parameters were selected as in run 1 but in addition the parameter of replication was
modified with the value of 10.
The “replicates” parameter is used to do multiple runs for the same species. In this analysis the
form of replicate used was the cross-validation (by default). In this case the occurrence data is randomly divided into a number of exclusive subsets called “folds” that have the same size. The model
performance is evaluated removing each subset in turn. The omitted subset is used for the evaluation
and all the others are used to fit the model. Doing this run 11 html pages are obtained, one for each
subset and one to summarize all the statistical information for the cross-validation.
4.3.4.1
Analysis of omission/commission
In the html file that correspond to the summary of all the statistical information we can see in the first
plot, Figure 4.15, the test omission rate and predicted area as a function of the cumulative threshold,
averaged over the replicate runs.
The mean omission on test data plus or minus the standard deviation is represented in yellow. While
the mean omission on test data is represented in light blue. The mean area predicted is represented in
red and its standard deviation is represented in blue.
Then there is the plot of the receiver operating characteristic (ROC) curve, Figure 4.16. It is
averaged over the replicate runs. The average test AUC is 0.895 and the standard deviation is 0.028.
50
CHAPTER 4. RESULTS
Figure 4.11: Importance of different variables-run 2
51
4.3. THE RUNS
Figure 4.12: Jackknife-regularized training gain-run 2
52
CHAPTER 4. RESULTS
(a) Jackknife-AUC- run2
(b) Jackknife-test gain-run 2
Figure 4.13: Jackknife- run2
53
4.3. THE RUNS
(a) Responding curves- run 3
(b) Responding curves- run 4
Figure 4.14: Responding curves run 3 and 4
54
CHAPTER 4. RESULTS
Figure 4.15: Average omission and predicted area for F. sultana-run 5
Figure 4.16: ROC- run 5
4.3. THE RUNS
4.3.4.2
55
Response curves
Then we have the response curves, the single-variable response and the marginal response curves. We
can notice that the single-variable response is, in general, less variable then the marginal one. The
image 4.17a shows the single-variable response and the second one shows the marginal one for the
variable 39, image 4.17b.
4.3.4.3
Analysis of variable contributions
An additional difference from the other run is in the jackknife test, image 4.18. Here the environmental
variable that decreases the gain the most when is omitted is the mean river slope (the number 29), that
appears to have the most information that is not present in the other variables. In all the previous runs
was always the variable 39.
4.3.5
Run 6
In this run the mean river slope for all the polygons that drain in the considered polygon was added
(with the number of 43) and the variables concerning the mean monthly temperature was removed with
the only exclusion of the mean monthly temperature of May. All the default settings were maintained
but the random test percentage that was set as 25.
There are no important difference, but we can notice that the new variable has contribution of 5.6%.
In the image 4.19 we can see an extract of the 2 tables of the variables contributions for the run 2 and
the run 6.
The map in Figure 4.20 is obtained using the output of MaxEnt for the run 6.
The data in a csv format was imported in ArcMap 10.2. This data was converted in a layer using
the Data Management Tools (Layers and Table Views- Make XY Event Layer) in ArcMap. Doing so
we obtain a points layer, each point has the value of the predicted probability of presence computed
by MaxEnt. Using the bassisgeometrie layer, the value of each point was added to the corresponding
polygon in the layer bassisgeometrie. In order to perform this operation the Analysis Tools ArcMap
10.2 was used. More specifically a Spatial Join was used.
4.3.6
Run 7
This run was performed with the same input as run 6 but the setting of replication at 10. Doing so the
cross-validation was performed. The results are very similar with the cross validation obtained with
the run 5.
4.3.7
Run 8
In this run the “regularization multiplier” parameter was changed. This parameter allow us to change
the level of focus of the output distribution. The default value is 1.0, the smaller the parameter,the
more localized the output distribution, the closer the fit to the given presence records. This could bring
an over fitting of the data and a loose of the generalize model power for other data. The result of a
bigger value of this parameter is a more spread out distribution. For this run the parameter was set at
3, and the other inputs were the same as in run 6.
56
CHAPTER 4. RESULTS
(a) Single-variable response curve- run 5
(b) Marginal response curve- run 5
Figure 4.17: Responding curves for variable 39- run 5
57
4.3. THE RUNS
Figure 4.18: Jackknife-run 5
(a) Importance of the most important variables- run 2 (b) Importance of the most important variables- run 6
Figure 4.19: Importance of the most important variables- run 2 and 6
58
CHAPTER 4. RESULTS
Figure 4.20: Probability of presence F. sultana- run 6
59
4.3. THE RUNS
The map in Figure 4.21 is obtained using the output of MaxEnt for the run 8 as the one obtained in
run 6.
Figure 4.21: Probability of presence Fredericella sultana- run 8
4.3.8
Run 9
For this run a new variable (number 44) was added, it concerns the pollutant releases facilities in
Switzerland and the variables concerning the mean monthly temperature was removed with the only
exclusion of the mean monthly temperature of July (number 23), because the different mean monthly
temperatures were very correlated with each others and the they were not so relevant for the previous runs. The new variable (number 44) was considered as Category because it can have 1 or 0 as
values, presence or absence respectively. All the default parameters were kept but the “random test
percentage” one, which was set as 25, in order to use the 25% of the sample records for testing.
There are not important differences comparing with the other runs, as always the variable with the
biggest percentage of contribution is the number 39, followed by the variable 4, 29, 31 and 30. The
new variable has a contribution of 0.8% and a permutation importance of 0.
4.3.9
Run 10
For this run all the previous settings were kept the same but 2 new variables were introduced. The first
variable was the number of pollutant releases facilities in all the basins that drains in the considered
basin (number 45). The second variable the number of pollutant releases facilities in the considered
60
CHAPTER 4. RESULTS
basin (number 46). There 2 variables were added because maybe the presence of F. sultana could
depend also to the number of pollutant releases facilities and not just on their presence.
Also here there are not big differences between this run and the previous one but it is possible to
notice that the importance of the variable is a bit changed. In fact after the usual important variables
there is also the variable 45, the numbers of pollutant releases facilities in the drained basins. On the
other hand the variable 44 has lost all the percent contribution, we can see that in Figure 4.22.
The map in Figure 4.23 is obtained using the output of MaxEnt for the run 10 as the one obtained
in run 6 and 8.
Figure 4.23: Probability of presence F. sultana- run 10
4.3.10
Run 11
For this run all the parameters were the same as in run 15 but 3 new variables were introduced. These
3 variables are the presence or absence of waste and waste water management facilities, the number
of these facilities in the basins that drain in the considered basin and the number of these facilities in
the considered basin, respectively variable 47, 48 and 49. As in run 10 the variable concerning the
presence or absence of waste and waste water management facilities is of the feature class category.
Also here there are not very important differences, in fact the variables with the biggest percent
contribution are always the variable 39, 4, 29, 30, 43 and 31. In this run the variable 45 (number of
pollutant releases facilities in all the basins that drains in the considered basin) loses a bit the percent
contribution, changing from 5.8% to 4.9%. We can see how the percent contribution of the new
variables is very low, and the variable 47 is not even used, we can see it in image 4.24.
61
4.3. THE RUNS
Figure 4.22: Importance of different variables-run 10
62
CHAPTER 4. RESULTS
Figure 4.24: Importance of different variables-run 11
63
4.3. THE RUNS
The map in Figure 4.25 is obtained using the output of MaxEnt for the run 10 as the one obtained
in run 6 and 8 and 10.
Figure 4.25: Probability of presence F. sultana- run 11
Chapter 5
Discussion and conclusions
5.1
5.1.0.1
Discussion
Analysis of the data
It is possible to notice in all the maps that there are some large basins, for example, the one under the
Lac Léman or the one near the Canton of Ticino, with a unique value for all the different data. This
is because they are polygons that are at the border of Switzerland and in the basissisgeometrie data
they are considered as unique basins even if they are composed of different basins. For this reason, the
values of these basins do not have to be considered as real values.
On the other hand, in the geology and land cover maps there is one part missing (on the top right),
it is because that part is not in Switzerland and the data used was only of Switzerland. In any case,
this is not a significant limitation because the part which is missing is outside our area of study and is
small compared with the area analyzed.
5.1.1
Correlation
5.1.1.1
Local variables
As we can see in the Figure 4.2a, that represents the correlation matrix for all the local variables, there
is a strong positive correlation among all the variables of the mean monthly temperature, as would be
expected for a given location. There is another strong, but this time negative correlation, between the
mean altitude and the variables of the mean monthly temperature. We have negative correlation when
one variable increases while the other decreases, therefore, as we would expect, we observe that when
the altitude increases temperature decreases. This is why in the last runs most of the variables of mean
monthly temperature were excluded.
We can notice also that there is a strong negative correlation between the variable that represents
the class of land cover 1 (rocks) and the mean monthly temperatures. This means that when we have
low temperatures, the percentage of rocks increases. This fact is in agreement with what we would
expect since places where the temperatures are low, possibly at high altitude (negative correlation
between temperature and altitude), in large part include mountainous regions. According to this negative correlation, we can also see that there is a positive correlation between mean altitude and land cover
64
5.1. DISCUSSION
65
class 1. As we said before, the percentage of rocks increases with the altitude due to the mountainous
regions of Switzerland.
Furthermore, altitude and river slope are positively correlated, therefore as the altitude increases,
river slopes increase. According to this we can see how the river slope is negatively correlated with
the mean monthly temperatures and positively correlated with the class of land cover 1 that represents
rocks.
5.1.1.2
Up-stream variables
In Figure 4.2b we can see the correlation matrix of the up-stream variables. A strong positive correlation between all the variables concerning the geology and the land cover for the up-stream basins can
be observed.
it is possible to see that there are also positive correlations between the variables concerning the
pollutant releases facilities. For example, there is a strong correlation between the number of waste
water management facilities in the considered polygon and the number of pollutant release facilities in
the polygons that drain into the considered polygon. This is because all these variables come from the
same data, by construction these variables are not independent, so positive correlation is expected.
5.1.2
The runs
5.1.2.1
Run 1
As we can see in Figure 4.4, the value of AUC for run 1 is relatively high (0.957) which suggests that
the model is performing well.
For the response curves, in all the runs, they can be hard to interpret if the variables are strongly
correlated. These curves show the marginal effect of changing exactly that variable, while the model
can consider that the variables change together. On the other hand, the single response curves can be
easier to interpret if the variables are strongly correlated. If the curves have an upward trend there is a
positive association, whereas downward trends represent a negative relationship. The strength of these
relationships is represented by the magnitude of these movements.
In Figure 4.7 we can see the jackknife test of the variable importance. The longest blue bars, the
ones with the most important variables, are 39, 29, 4 and 26. We can see, analyzing the light blue
bars, that there are no variables that contains a substantial amount of useful information that is unique
comparing with all the other variables. We can understand this because all the light blue bars have
more or less the same length, so excluding one of these variables do not reduce the training gain by
much. If one of the light blue lines was longer than the red, it would mean that the model without
that variable was performing even better than the model created with all the variables. This could be
caused by a variable that was causing over fitting or lack of generality in the test runs.
5.1.2.2
Run 2
In run 2 the AUC value for the training data is even higher than run 1 (0.973). As in the previous
run, this large AUC value is an indication that the model is performing well. On the other hand, the
test line shows the fit of the model to the testing data, it tests the real model predictive power. In this
case the value of the AUC of the test data is 0.875, this value is lower than the AUC value found for
the training data but in any case is much higher than the AUC for the random model (0.5). This fact
66
CHAPTER 5. DISCUSSION AND CONCLUSIONS
is quite expected because the model was created staring from the training data that are therefore well
represented.
Analyzing the table of thresholds, Figure 4.10, we can see that the p-values are very small, this fact
gives us another confirmation that the model predicts the test points better than a random prediction.
In the Jackknife test (gain test and AUC plots), shown in Figures 4.13b and 4.13a, some of the light
blue bars are longer than the red ones, especially for variable 30 (the mean altitude of the polygons
that drain into the considered polygon). It means that the predictive performance improves when the
variable of the mean altitude of the polygons that drain into the considered polygon is not used. On the
other hand, it is possible to notice that the model made with only variable 8, the percentage of geology
class sand and gravel, has a negative test gain. It means that the model is worse than a null model to
predict the distribution of occurrences.
5.1.2.3
Run 3 and 4
Analyzing these response curves, we can see that the aspects of these profiles are similar, but there are
some differences due to the different feature types. The threshold feature give us a function characterized with steps, and the hinge feature is similar to the threshold feature but it allows changes in the
gradient of the responses. The result is a step function for the threshold feature and a piece-wise linear
function for the hinge feature. We can see that the response curve of the variable 39 using only hinge
features is a sequence of connected line segments. In the Figure 5.1c there is a zoom of the profiles of
the most used variable, variable 39, using at first the threshold features, the hinge features and then all
features, shown in Figures 5.1a, Figure 5.1b and 5.1c, respectively. it is possible to notice how using
all classes together makes it therefore possible to model complex responses. These Figures show how
the probability to have the presence of F. sultana increases when the percentage of built up areas in
all the basins that drain into the considered basin is around 20%. However, this fact has some issues
of interpretation for different reasons. The first reason is that this variable is strongly correlated with
other up-stream variables of geology and land cover. Second, it is necessary to consider the possible
range that this variable can assume. In fact is almost impossible that the percentage of build up area of
the up-stream basins is 100%.
5.1.2.4
Run 5
The cross-validation was performed during this run, for this reason in all the graphics we can see some
more statistics. For example in the graphic of the average omission and predicted area and in the ROC
curve, we have a mean value and the standard deviation. For this run the mean value of AUC is a
bit lower that the ones in the previous runs, now it is 0.895 (in any case still higher that the random
model). This value is similar with the value obtained in run 2 for the AUC for the test data (0.875),
this is because also here there is the presence of some data that are used as test for the model.
5.1.2.5
Run 6
In this run a new variable was introduced, the mean river slope for the up-stream polygons. It is
possible to notice that this variable has the same magnitude of importance as the first 5 most important
variables. Figure4.20 represents the probability map obtained with this run. We can see that the
probability to find F. sultana is higher in the area that goes from Genève, Lausanne, Bern, Zurich and
67
5.1. DISCUSSION
(a) Response curve of variable 39 using threshold feature - run 3
(b) Response curve of variable 39 using hinge feature run 3
(c) Response curve of variable 39 using all feature - run
2
Figure 5.1: Response curve of variable 39
68
CHAPTER 5. DISCUSSION AND CONCLUSIONS
Figure 5.2: ROC for Fredericella sultana-run 7
St Gallen. There are as well a high probability to find F. sultana in the Canton of Ticino, especially
near Lugano. Also in the area of Basel and the Jura the probability of presence of F. sultana is high.
Besides, in the Alps the probability to find F. sultana is low, with the only exclusion of the areas of
Sion and Monthey. It seems that the probability to find F. sultana is high near the large cities, this fact
is congruent with the results obtained in the test of the variable importance, where the most important
variable is the percentage of built up areas in the upstream basins.
5.1.2.6
Run 7
It is possible to see see how in this run, the mean AUC value is higher than the one obtained in run 5
with an increase from 0.895 in run 5 to 0.905 in the current run. This increase could be attributed to
the addition of variable 43 in the model. The graphic from this run is shown in Figure5.2 while the
cross-validation was also computed.
Additionally, here it is possible to see that the single-variable response curves are less variable than
the marginal ones. As an example the single-variables and marginal response curves of variable 39 are
shown in Figures 5.3b and 5.3a respectively.
Figure 5.3 shows how the probability to have the presence of F. sultana increases when the percentage of built up areas in all the basins that drain into the considered basin of around 20%. However, as
said before, this result is difficult to interpret for different reasons. The first reason is that this variable
is strongly correlated with other up-stream variables of geology and land cover. Second, it is necessary
to consider the possible range that this variable can assume. It is unlikely have 100% of an area in the
up-stream basins which are built up.
69
5.1. DISCUSSION
(a) Marginal responding curve variable 39 - run 7
(b) Single-variable responding curve variable 39 - run 7
Figure 5.3: Responding curves variable 39- run7
70
5.1.2.7
CHAPTER 5. DISCUSSION AND CONCLUSIONS
Run 8
In this run the “regularization multiplier” parameter was changed, we can see the result Figure4.21.
Comparing this map with the map obtained in run 6, we can see that the distribution obtained in the
current run is more spread out than the one in run 6, as seen in Figure 5.4. In this run, there are more
areas with a higher probability of presence of F. sultana and we can notice how the areas with the
highest probability of presence are the same of run 6. In run 8 the areas with a probability of presence
of around 0.5/0.6, are the ones that are increase the most.
5.1.2.8
Run 9,10,11
For these runs, what is interesting to notice are the response curves of the new variables introduced.
They are different from the others because the variable was set as category. We can see in Figures 5.5a
and 5.5b the two response curves, respectively the marginal one and then the single-variable one of the
variable 44 for the run 9.
As it is possible to notice in Figure 5.5 the probability to find the species is higher if there are
pollutant releases facilities in the considered polygon, but we have to consider the limits mentioned in
the discussion of run 7.
It is possible to notice how the variables concerning the pollutant releases do not influence the
model, this is because they are too general. For example, the presence of pollutant releases facilities
is not influent because in this data the size of the facilities, the amount and the types of pollutants
released are not considered. In order to improve this data, the density of the population should also
be included. Also, the number of pollutant releases facilities and the number of waste or waste water
management facilities in the up-stream basins are not influent because they are too general and because
the dispersion of the pollutants is not considered. The fact that the model does not use these overly
general data sets is a proof of the power of the model: it does not use data that is not useful for the
prediction.
5.1.3
All the runs
It is possible to see how in all the runs the value of AUC is above 0.5 (value of the random model),
precisely all the values are above 0.872. This suggests that the models are performing well.
According to all the different runs, the most important variable is variable 39, the percentage of
land cover class 3 (built up) of all the polygons that drain in the considered polygon. The others
important variables are the variables 29, 4, 30, 32, 31 and 43 (respectively mean river slope, mean
altitude, mean altitude up-stream, percentage of geology class 1, alluvial rocks of all the polygons that
drain in the considered polygon, and mean river slope up-stream). In Table5.1 it is possible to see
the most important variables for all the runs. The river slope has a range of percentage contribution
between 5.6 and 13.8, the mean altitude has value of percent contribution between 0.6 and 17.5 and
the mean altitude of the up stream basins has it between 0.8 and 10.1.
As said before among the most important variables there is the mean altitude. It is possible to
analyze the responding curves of this variable in Figure 5.6. With all the limits and the difficulties
of interpretation of these curves, it can be noticed that there is a higher probability of presence of F.
sultana in rather low altitudes. In both the curves, Figure 5.6a and Figure 5.6b, it is evident that there is
a drastic change of probability around the altitude of 800 m above the see level. This fact corresponds
71
5.1. DISCUSSION
(a) Probability of presence F. sultana - run 6
(b) Probability of presence F. sultana - run 8
Figure 5.4: Probability of presence F. sultana - run 6 and 8
72
CHAPTER 5. DISCUSSION AND CONCLUSIONS
(a) Marginal response curve of variable 39 - run 9
(b) Single-variable response curve of variable 39 - run 9
Figure 5.5: Response curve of variable 39 - run 9
73
5.2. CONCLUSION AND FUTURE WORK
perc. cont
run 1
39-Dland3
25.1
run2
39-Dland3
28
run 3
39-Dland3
31.7
run 4
39-Dland3
45.9
run 5
39-Dland3
run 6
39-Dland3
27.9
28.6
run 7
39-Dland3
run 8
39-Dland3
45.9
40.7
run 9
39-Dland3
28.7
run 10
39-Dland3
27.6
run 11
39-Dland3
26.7
29-Rslope
12.8
4-altitu
15.5
4-altitu
17.5
31-D_geo1
12.4
29-Rslope
13.8
4-altitu
15.9
31-D_geo1
12.4
4-altitu
14.1
4-altitu
15.7
4-altitu
15.6
4-altitu
14.9
4-altitu
8.7
29-Rslope
10.2
29-Rslope
11.5
37-Dland1
6.5
4-altitu
10.2
29-Rslope
9.8
37-Dland1
6.5
29-Rslope
10.3
29-Rslope
9.6
29-Rslope
9.4
29-Rslope
8.5
26-temp10
8.5
30-D_alti
9.7
30-D_alti
10.1
29-Rslope
5.6
26-temp10
9.8
30-D_alti
7.4
29-Rslope
5.6
31-D_geo1
4.9
31-D_geo1
7.8
31-D_geo1
6.1
30-D_alti
6.4
30-D_alti 32-D_geo2
6.5
5.3
32-D_geo2 31-D_geo1
5.7
5.4
32-D_geo2 31-D_geo1
4.7
3.8
32-D_geo2 36-D_geo6
5.5
4.1
30-D_alti 32-D_geo2
7.6
3.8
31-D_geo1 43-Dslope
7
5.6
32-D_geo2 36-D_geo6
5.5
4.1
32-D_geo2 30-D_alti
3.6
3.2
30-D_alti 32-D_geo2
6
5.6
30-D_alti 45-ND_pol
5.8
5.8
43-Dslope 31-D_geo1
6
5.3
Table 5.1: Percent contribution of the most inportant variables for all the differents runs
with the result obtained in the probability maps, where the probability of presence of F. sultana were
higher near the big cities (Zurich is an altitude of 408 m, Lausanne is at 495 m and Berne is at 542 m).
5.2
Conclusion and future work
All the runs were performing significantly better than the random model. The threshold-independent
ROC analysis also showed a considerably better performance than the random model and the area
under the ROC curve (AUC) was high in all the different runs. All the runs produced reasonable
predictions of the potential distribution of F. sultana. It is possible to see how the predicted probability
of presence of F. sultana higher near the big cities of Switzerland, this fact is consistent with the results
obtained in the test of the variable importance, where the most important variable is the percentage of
built up areas in the upstream basins. We can see in all the probability maps that the diagonal area that
goes from Genève to St. Gallen has the highest probability to find F. sultana.
In order to evaluate the prediction of the potential distribution of F. sultana, a survey on the true
presence of F. sultana is needed. The study could have the objective to find out if, in the places where
the infected trout were found, there is also the presence of F. sultana.
74
CHAPTER 5. DISCUSSION AND CONCLUSIONS
(a) Marginal response curve of variable 4 - run 2
(b) Single-variable response curve of variable 4 - run 3
Figure 5.6: Response curve of variable 4 - run 2
5.2. CONCLUSION AND FUTURE WORK
75
The species data analyzed in the current study could be affected by bias because we have considered that F. sultana is present just in the places where the trout were found, but it is possible that
the F. sultana has a suitable habitat also where the trout are not present. For this reason it will be
interesting do a survey to detect the real presence of F. sultana.
Bibliography
[1] Beth Okamura, Hanna Hartikainen, Heike Schmidt-Posthaus, and Thomas Wahli. Life cycle
complexity, environmental change and the emerging status of salmonid proliferative kidney disease. Freshwater Biology, 56:735–753, 2011. ISSN 00465070.
[2] Esri. ArcGIS Resources.
[3] Shahid Naeem, F S Chapin Iii, Robert Costanza, Paul R Ehrlich, Frank B Golley, David U
Hooper, J H Lawton, Robert V O Neill, Harold a Mooney, Osvaldo E Sala, Amy J Symstad,
and David Tilman. I ssues in Ecology. Issues in Ecology, 4:1–12, 1999. ISSN 1092-8987.
[4] R Hoffmann, S van de Graaff, F Braun, W Körting, H Dangschat, and D Manz. Proliferative
kidney disease in salmonid fish. Berliner und Munchener tierarztliche Wochenschrift, 97:288–
291, 1984. ISSN 09598030.
[5] C L Anderson, E U Canning, and B Okamura. Molecular data implicate bryozoans as hosts
for PKX (phylum Myxozoa) and identify a clade of bryozoan parasites within the Myxozoa.
Parasitology, 119 ( Pt 6:555–561, 1999. ISSN 00311820.
[6] B. Okamura and T. S. Wood. Bryozoans as hosts for Tetracapsula bryosalmonae, the PKX organism. Journal of Fish Diseases, 25:469–475, 2002. ISSN 01407775.
[7] Karen Anna Okland and Jan Okland. Freshwater bryozoans (Bryozoa) of Norway II: Distribution
and ecology of two species of Fredericella. Hydrobiologia, 459:103–123, 2001. ISSN 0018-8158.
[8] M L Kent and R P Hedrick. Development of the PKX myxosporean\nin rainbow trout Salmo
gairdneri. Diseases of Aquatic Organisms, 1(1924):169–182, 1986.
[9] Eva Jiménez-Guri, Hervé Philippe, Beth Okamura, and Peter W H Holland. Buddenbrockia is a
cnidarian worm. Science (New York, N.Y.), 317(5834):116–118, 2007.
[10] E U Canning, a Curry, S W Feist, M Longshaw, and B Okamura. A new class and order of myxozoans to accommodate parasites of bryozoans with ultrastructural observations on Tetracapsula
bryosalmonae (PKX organism). The Journal of eukaryotic microbiology, 47(5):456–468, 1999.
ISSN 1066-5234.
[11] Kathrin Bettge, Thomas Wahli, Helmut Segner, and Heike Schmidt-Posthaus. Proliferative kidney disease in rainbow trout: Time- and temperature-related renal pathology and parasite distribution. Diseases of Aquatic Organisms, 83:67–76, 2009. ISSN 01775103.
76
BIBLIOGRAPHY
77
[12] Timothy S. Wood, Lisa J. Wood, Gaby Geimer, and Jos Massard. Freshwater bryozoans of New
Zealand: A preliminary survey. New Zealand Journal of Marine and Freshwater Research, 32
(March 2014):639–648, 1998. ISSN 0028-8330.
[13] Timothy S. Wood. Reappraisal of Australian freshwater bryozoans with two new species of
Plumatella (Ectoprocta : Phylactolaemata). Invertebrate Systematics, 12(2):257, 1998. ISSN
1445-5226.
[14] Dean Jacobsen, Rikke Schultz, and Andrea Encalada. Structure and diversity of stream invertebrate assemblages : the influence of temperature with. Freshwater Biology, 38:247–261, 1997.
ISSN 0046-5070.
[15] R. P. Smart, C. Soulsby, M. S. Cresser, A. J. Wade, J. Townend, M. F. Billett, and S. Langan.
Riparian zone influence on stream water chemistry at different spatial scales: A GIS-based modelling approach, an example for the Dee, NE Scotland. Science of the Total Environment, 280
(1-3):173–193, 2001.
[16] John R. Olson and Charles P. Hawkins. Predicting natural base-flow stream water chemistry in
the western United States. Water Resources Research, 48(2):1, 2012. ISSN 00431397.
[17] T L Root. Environmental factors associated with avian distributional limits. J. Biogeogr, 15(3):
489–505, 1988.
[18] Jane Elith and John R. Leathwick. Species Distribution Models: Ecological Explanation and
Prediction Across Space and Time. Annual Review of Ecology, Evolution, and Systematics, 40:
677–697, 2009. ISSN 1543-592X.
[19] a Townsend Peterson. Uses and requirements of ecological niche models and related distributional models. Biodiversity Informatics, 3:59–72, 2006. ISSN 15469735.
[20] R P Anderson. Real vs. artefactual absences in species distributions: Tests for Oryzomys albigularis (Rodentia: Muridae) in Venezuela. Journal of Biogeography, 30:591–605, 2003. ISSN
1365-2699.
[21] Catherine H. Graham, Simon Ferrier, Falk Huettman, Craig Moritz, and a. Townsend Peterson.
New developments in museum-based informatics and applications in biodiversity analysis.
Trends in Ecology and Evolution, 19(9):497–503, 2004. ISSN 01695347.
[22] Jane Elith, Steven J. Phillips, Trevor Hastie, Miroslav Dudík, Yung En Chee, and Colin J. Yates.
A statistical explanation of MaxEnt for ecologists. Diversity and Distributions, 17:43–57, 2011.
ISSN 13669516.
[23] Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R. Leathwick. Presence-only data
and the em algorithm. Biometrics, 65:554–563, 2009. ISSN 0006341X.
[24] Weidong Gu and Robert K. Swihart. Absent or undetected? Effects of non-detection of species
occurrence on wildlife-habitat models. Biological Conservation, 116:195–203, 2004. ISSN
00063207.
78
BIBLIOGRAPHY
[25] Steven J. Phillips, Miroslav Dudík, Jane Elith, Catherine H. Graham, Anthony Lehmann, John
Leathwick, and Simon Ferrier. Sample selection bias and presence-only distribution models:
Implications for background and pseudo-absence data. Ecological Applications, 19(1):181–197,
2009. ISSN 10510761.
[26] Sushma Reddy and Liliana M. Dávalos. Geographic sampling bias and its implications for conservation priorities in Africa. pages 1719–1727, 2003.
[27] Rp Anderson. Evaluating predictive models of species’ distributions: criteria for selecting optimal models. Ecological modelling, 162:211–232, 2003.
[28] M.P Austin. Spatial prediction of species distribution: an interface between ecological theory
and statistical modelling. Ecological Modelling, 157:101–118, 2002. ISSN 03043800.
[29] Steven Phillips. A Brief Tutorial on Maxent. AT&T Research, pages 1–38, 2008.
[30] Steven J. Phillips and Miroslav Dudík. Modeling of species distributions with Maxent: New
extensions and a comprehensive evaluation. Ecography, 31(2):161–175, 2008.
[31] E. Jaynes. Information Theory and Statistical Mechanics. Physical Review, 106(4):620–630,
1957. ISSN 0031-899X.
[32] Steven J. Phillips, Robert P. Anderson, and Robert E. Schapire. Maximum entropy modeling
of species geographic distributions. Ecological Modelling, 190(3-4):231–259, 2006. ISSN
03043800.
[33] C. E. Shannon. A Mathematical Theory of Communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3, 2001.
[34] Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model
Selection. In International Joint Conference on Artificial Intelligence, volume 14, pages 1137–
1143. Citeseer, 1995.