FSU (Multimodel Superensemble)

Transcription

FSU (Multimodel Superensemble)
Evaluation of 2014 FSU-MMSE Stream 1.5 Candidate
23 June 2014
TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa
Fowler, Paul Kucera, Kathryn Newman, and Christopher Williams
Data Manager: Mrinal Biswas
Overview
The Florida State University (FSU) Multi-Model Super Ensemble (MMSE) is an ensemble
research model used to predict hurricane track and intensity by giving variable weights to
member models based on previous performance. The evaluation of FSU-MMSE focused
primarily on a direct comparison between FSU-MMSE and each of last year’s top-flight models
and the operational consensus. A direct comparison between FSU-MMSE and the operational
consensus was chosen over considering the impact of including FSU-MMSE in the operational
consensus because FSU-MMSE is already a multi-model consensus based on both operational
and HFIP experimental models. Given all aspects of the evaluation are based on homogenous
samples for each type of analysis, the number of cases varied depending on the availability of the
specific operational baseline. Table 1 contains descriptions of the configurations used in the
evaluation that are associated with FSU-MMSE, as well as their corresponding ATCF IDs.
Table 2 contains a summary of the baselines used to evaluate FSU-MMSE. Definitions of the
operational baselines and their corresponding ATCF IDs can be found in the “2014 Stream 1.5
Methodology” write-up. Note that only early versions of all model guidance were considered in
this analysis. Cases were aggregated over ‘land and water’ for track metrics; ‘land and water’
and ‘water only’ for intensity metrics. Except when noted, results are for aggregations over both
land and water.
Inventory
The FSU team delivered 314 retrospective Multi-Model Super Ensemble (hereafter referred to as
MMSI) forecasts for 33 storms in the Atlantic (AL) basin for the 2011-2013 hurricane seasons.
No forecast data were available for 33 of these cases, which reduced the number of cases to 281.
When generating the interpolated or early model versions, both the CARQ record and storm
information from the National Hurricane Center (NHC) Best Track (BT) must be available for
each case. CARQ or BT were not available for 14 cases. Furthermore, the storm was not
classified as tropical or subtropical at the initial time of the early model version for 12 additional
cases. Given the requirement put forth by NHC of tropical or subtropical classification in order
to be verified, the total sample used in this analysis consisted of 255 cases in the AL basin.
Top-Flight Models
Atlantic Basin
Track Analysis
The MMSI mean track errors were statistically indistinguishable from those for the global topflight models and the variable track consensus TVCA, whereas MMSI improved upon the
regional HWFI mean track errors at all lead times (Fig. 1). These statistically significant (SS)
improvements ranged from 14-23% (Table 3). The frequency of superior performance (FSP),
which does not take into account the magnitude of the error differences, corroborated the mean
error analyses for comparisons between MMSI and all of the baselines (not shown). A
comparison of the MMSI and TVCA track error distributions (Fig. 2) revealed that while MMSI
mean track errors were not statistically distinguishable from those for TVCA, the largest track
errors associated with MMSI were substantially larger than those for outliers for TVCA (large
outliers from 48 to 108 h). While the disparity between the largest outliers for the other three
baselines and these large MMSI outliers was not as great, the pairwise difference distributions
(not shown) indicated MMSI performed substantially worse than all four baselines for this
particular case, which corresponded to Philippe (AL172011).
A comparison of MMSI’s performance with the three top-flight models and the operational
variable track consensus through a rank frequency analysis (Fig. 3) indicated MMSI was less
likely than random to perform worst at most lead times. It was also more likely to rank 2nd or 3rd
at intermediate lead times, but the most prominent signature was for MMSI to outperform at least
one baseline at most lead times.
Intensity
The direct comparisons between the MMSI absolute intensity errors and those for the top-flight
intensity models led to varying degrees of SS improvements (Fig. 4, Table 4). The pairwise
difference tests showed five SS improvements at longer lead times (72-120 h) of 18 to 35% for
the comparison with HWFI, five SS improvements at more intermediate lead times (36-72 h and
96 h) of 15-21% for the comparison with LGEM, and two SS improvements at 36-48 h of 1316% for the comparison with DSHP. Note that all of the non-SS differences for these
comparisons are also positive except at 12 h. Conversely, the direct comparison with the
operational fixed consensus for intensity led to one SS degradation at 12 h with the performance
being statistically indistinguishable at all other lead times. The non-SS differences for this
comparison were negative or zero except at the longest lead times. Limiting the sample to cases
over water only had little to no impact on the comparisons with HWFI, LGEM and ICON.
Aggregating over water only produced one additional SS improvement for the HWFI
comparison, reduced the lead times with SS improvements over LGEM by one and did not
change the number of SS differences for the ICON comparison. In contrast, the comparison with
DSHP for over water only cases produced a SS degradation at 12 h and only one SS
improvement, which occurred at a longer lead time than either SS improvement for the land and
water sample. While the numbers and timing of SS differences were slightly different between
the two samples, the general behavior for the non-SS differences did not change except that the
ICON comparison produced a few more positive non-SS differences at longer lead times.
The absolute intensity error distributions revealed some interesting differences between the
performance of MMSI and that of the top-flight models for intensity and ICON (Fig. 5). At
shorter lead times (12-48 h), the largest outliers associated with the MMSI distributions, which
were associated with Hurricane Irene (AL092011), were substantially larger than those for the
top-flight models, as well as ICON. In contrast, at longer lead times (60-96 h), MMSI was able
to substantially improve upon the top-flight guidance for the cases where the top-flight models
produced the largest errors, but in this case, ICON was already able to substantially improve
upon the individual guidance provided by the top-flight models.
The frequency of superior performance (FSP) technique, which does not take into consideration
the magnitude of the error differences, yielded results that were consistent with those for the
pairwise difference tests, except for the timing and number of SS differences in performance
(Fig. 6). For the HWFI comparison, MMSI outperformed HWFI for lead times starting at 84 h,
whereas the mean absolute intensity errors were statistically distinguishable starting at 72 h. The
FSP analysis led to MMSI improvements upon LGEM for four lead times, where two of these
lead times corresponded to lead times for which the mean error analysis also produced SS
improvements (60-72 h), but two of these lead times corresponded to lead times for which the
mean error analysis produced statistically indistinguishable results (24 and 120 h). The
comparison with DSHP simply led to one less lead time for which the results were statistically
distinguishable. In terms of FSP, MMSI and ICON were evenly matched for all lead times.
Limiting the sample to cases over water only (not shown) produced one more lead for which
MMSI outperformed HWFI, reduced the number of lead times at which MMSI outperformed
LGEM to two and evenly matched performance at all lead times for the DSHP and ICON
comparisons.
A comparison of MMSI’s intensity performance to that of the three top-flight models and the
operational fixed consensus (Fig. 7) indicated MMSI was more likely to have the smallest errors,
i.e., rank 1st, than would be expected based on random forecasts at 12 to 36 h, 60 h and 120 h.
When all cases with ties (i.e., the same intensity error for MMSI and at least one other model)
awarded MMSI with the best rank, the proportion of best rankings increased substantially at
most lead times (shown in solid black numbers). Conversely, MMSI was also more likely than
random to have the largest errors, i.e., rank 5th, at 12 h, and it was less likely to have the largest
errors at longer lead times (72-120 h). It was also less likely to rank 3rd at 12-36 h and 60 h and
more likely to rank 3rd at 96 h. The overall signature of the rankings for the water only sample
was generally consistent with that for the land and water sample (not shown).
Overall Evaluation
The direct comparisons between the FSU-MMSE and the global top-flight models and
operational consensus guidance indicated FSU-MMSE was able to improve upon the track
guidance, but only with respect to that provided by the operational HWRF (improvements of 1423%), and the intensity guidance, but only with respect to the individual top-flight models
(improvements of 13-35%). The rank frequency analysis only showed that the FSU-MMSE was
less likely to produce largest errors for track, whereas this analysis showed some positive signs
of FSU-MMSE being able to outperform the top-flight models and the fixed consensus for
intensity (more likely than random to produce forecasts with the smallest errors at a number of
lead times), but this positive signature was dampened by the fact that FSU-MMSE was also more
likely than random to produce the largest errors at one of these lead times. Given FSU-MMSE is
basically a weighted consensus guidance and it was not able to show SS improvement over the
guidance provided by the variable consensus for track and operational fixed consensus for
intensity, the results were not favorable for selecting FSU-MMSE for explicit track or intensity
guidance in the Atlantic Basin.
Table 1: Description of the various FSU-MMSE related configurations used in this evaluation, as
well as their assigned ATCF IDs.
ID
Description of Configuration
MMSE
Late model version (non-interpolated)
MMSI
Early Model version (interpolated, adjustment window 0 to 12 h – basically time-lag only)
Table 2: Summary of baselines used for evaluation of FSU-MMSE for the specified metrics.
Variables Verified
Aggregation
Intensity
land and water
Intensity
water only
●

DSHP
●

LGEM
●

ICON
●

ID
Track
land and water
EMXI

GFSI

HWFI
●
TVCA

Table 3: Inventory of statistically significant (SS) pairwise differences for track stemming from the comparison of MMSI and each
individual top-flight model and the operational variable consensus for track in the AL basin. See 2014 Stream 1.5 methodology writeup for description of entries.
Table 4: Inventory of statistically significant (SS) pairwise differences for intensity stemming from the comparison of MMSI and
each individual top-flight model and the operational fixed consensus for intensity in the AL basin. See 2014 Stream 1.5 methodology
write-up for description of entries.
Figure 1: Mean track errors (MMSI-red, baselines-black) and mean pairwise differences (blue) with 95% confidence intervals with
respect to lead time for EMXI and MMSI (top left panel), GFSI and MMSI (top right panel), HWFI and MMSI (bottom left panel),
and TVCA and MMSI (bottom right panel) in the Atlantic basin.
Figure 2: Track error distributions with respect to lead time for MMSI and TVCA in the Atlantic
basin.
Figure 3: Rankings with 95% confidence intervals for MMSI compared to the three top-flight
models and variable operational consensus for track guidance with respect to lead time.
Aggregations are for land and water (top panel) in the Atlantic basin. The grey horizontal line
highlights the 20% frequency for reference. Black numbers indicate the frequencies of the first
and fifth rankings where the candidate model was assigned the better (lower) ranking for all ties.
Figure 4: Mean absolute intensity errors (MMSI-red, baselines-black) and mean pairwise differences (blue) with 95% confidence
intervals with respect to lead time for DSHP and MMSI (top left panel), LGEM and MMSI (top right panel), HWFI and MMSI
(bottom left panel), and ICON and MMSI (bottom right panel) in the Atlantic basin.
Figure 5: Intensity error distributions with respect to lead time for DSHP and MMSI (top left panel), LGEM and MMSI (top right
panel), HWFI and MMSI (bottom left panel) and ICON and MMSI (bottom right panel) in the Atlantic basin.
Figure 6: Frequency of superior performance (FSP) with 95% confidence intervals for intensity error differences stemming from the
comparison of DSHP and MMSI (top left panel) and LGEM and MMSI (top right panel), HWFI and MMSI (bottom left panel), and
ICON and MMSI (bottom right panel) with respect to lead time for cases in the Atlantic basin. Ties are defined as cases for which the
difference was less than 1 kt.
Figure 7: Rankings with 95% confidence intervals for MMSI compared to the three top-flight models and fixed operational consensus
for intensity guidance with respect to lead time. Aggregations are for land and water in Atlantic basin. The grey horizontal line
highlights the 20% frequency for reference. Black numbers indicate the frequencies of the first and fifth rankings where the candidate
model was assigned the better (lower) ranking for all ties.