evmix (Version 0.2.11): An R package for Extreme Value Mixture Modelling, Threshold Estimation and Boundary Corrected Kernel Density Estimation

The evmix package for R aims to provide a suite of tools for extreme value threshold estimation and uncertainty quantification, following the review paper by Scarrott and MacDonald (2013) and comparative simulation study by Hu (2013). At this time, it provides almost all of the univariate stationary extreme value mixture models proposed in the current literature. Similar to the evd package the classic quartet of model fit diagnostics (probability, quantile, return level and density plots) are implemented. The following threshold diagnostics are implemented:

  • mean residual life plot
  • threshold stability plots
  • Hill plot and it's variants (altHill, smooHill, altsmooHill)
  • Pickand's plot

The package also includes a wide range of kernel density estimators, with cross-validation likelihood estimation. In particular, there are few publicly available boundary corrected kernel density estimation functions for R at this time, i.e. designed to cope with populations on bounded support (lower/upper or both). We have implemented a wide range of boundary corrected kernel density estimators, see full list below.

For each mixture model the usual distribution functions (density, cumulative distribution function, quantile and random number generation), likelihood, negative log-likelihood and maximum likelihood estimation (MLE) fitting routines are provided. The usual prefixes are used for these functions (d, p, q, r, l, nl and f respectively). For most mixture models these functions can take vectorised parameter inputs, allowing for evaluation of nonstationary forms of the mixture models. However, at this time the likelihood and MLE fitting functions can only take single parameter inputs.

Be Careful! Default Settings May Not Be Optimal...

By default the MLE based fitting procedures use black-box numerical optimisation of the complete likelihood (for all the model parameters), and initial values are calculated from the sample data wherever possible. Standard black-box optimisers of the complete log-likelihood of the the extreme value mixture models are very sensitive to the initial parameter values as there are often many local modes, which is an inherent and unavoidable feature of such models. They are particularly sensitive to the the initial value for the threshold.

The initial value for the threshold is set to be the 90% sample quantile of the data. The black-box optimisation of the complete likelihood may get stuck in a local mode close to this initial value. If one really needs to maximise the complete likelihood then you could attempt to overcome this problem by either using alternative optimisers (e.g. SANN is an option in the fitting functions) or manually try out different initial parameter vectors.

Profile likelihood estimation of the threshold using a grid search over potential thresholds seemingly overcomes the multi-modality problem, as the source of the problem seems to be the threshold parameter. A sequence of thresholds to be considered is specified using the "useq" fitting function input ("ulseq" and "urseq" for lower and upper tail thresholds in the two-tailed models). The threshold which maximises the profile likelihood can be either used as an initial value in the optimisation (default behaviour when "fixedu = FALSE"), or treated as a fixed threshold (when "fixedu = TRUE"). Given our current experience, we suggest that profile likelihood estimation of the threshold, which is then fixed, is the estimation approach of choice when using likelihood inference. We are considering changing this to the default approach, feedback on your experiences would be appreciated.

The user can also pre-choose the threshold as is used in the classical "fixed threshold approach" for extreme value threshold modelling. This is achieved by providing a single threshold in the "useq" input and set "fixedu=TRUE".

Downloads and Documentation

The current release version for is available here on CRAN.

There is a basic User Guide and full Reference Manual. A peer reviewed paper outlining the features in more detail is available from the Journal of Statistical Software here. All these are provided in the package directories (accessed via html help system for package in R).

Yang Hu wrote the original code in R and carried out a detailed simulation study all as part of his Masters thesis which is also available from here. Some of the code is based on functions written for MATLAB by Anna MacDonald who's thesis is also available here.

A published review of threshold estimation and uncertainty quantification techniques is available here which includes such mixture models: Scarrott and MacDonald (2012). A Review of Extreme Value Threshold Estimation and Uncertainty Quantification, REVSTAT 10(1), 33-60.

The first public showcase of the package was at the EVA2013 conference with the presentation here, which includes some extra examples of functionality and advice for practitioners and mixture model developers. The code in this presentation was included using the listing package for LaTeX can be simply copied straight into R or is available as a R script.

Version 0.2.0 was presented at the NZSA2013 conference with the presentation here, which includes some extra examples of functionality and advice for practitioners and mixture model developers and here is the R script.

Some new features (P-spline density estimator) added in version 0.2.4 was and invited presentation at the ERCIM2014 conference with the presentation here.

We created and continue to maintain the package in RStudio using Roxygen2 for some automation of the documentation creation. The raw source files used to create the current package are available here, in case these would of use to anyone wanting to create their own package using these tools.

Please Get in Touch...

We are really interested to hear about usage of the package and your experiences with it. So please get in touch with the maintainer Carl Scarrott

Future Updates

Future updates will likely include some Bayesian inference functionality using MCMC, thus avoiding the challenges associated with optimisation of the likelihood which can have lots of local modes as mentioned above.

Current Functionality

All functions are visible to the user, so no functionality is hidden. The functions which are designed for the user to call directly are fully documented and have detailed input error checking. However, some functions are designed for internal use and so are undocumented (and no checking of user inputs is carried out). These are fully user visible and useable, but the user must know what they are doing before using these functions. All functions are currently written as raw R code, so there is an inherent performance penalty for this. We decided to go with fully transparent code for the average user. If you are dealing with large datasets you should look at reformulating the code in a lower level language to make the computations more efficient.

The following standard parametric extreme value mixture models, with bulk model upto threshold and GPD above threshold are provided:

  • normal bulk and GPD tail (Behrens, et al, 2004; Cabras and Castellanos, 2011)
  • gamma bulk and GPD tail (Behrens, et al, 2004; Zheng et al. ,2014)
  • Weibull bulk and GPD tail (Behrens, et al, 2004; Cabras and Castellanos ,2011)
  • log-normal bulk and GPD tail (Solari and Losada, 2004; Scollnik, 2007; Nadarajah and Baka, 2012)
  • beta bulk and GPD tail (MacDonald, 2012)
  • normal bulk and GPD for upper and lower tail (Zhao, et al, 2010; So and Chan, 2013; Oumow et al, 2012; Mendes and Lopes, 2004)

The gamma bulk includes the exponential (Wong and Li, 2010) as a special case. Various authors (e.g. Teodorescu and Vernicu, 2013; Scollnik, 2007; Nadarajah and Baka, 2012) have considered many of the above parametric mixtures with the GPD tail constrained to be a Pareto type tail (positive shape parameter, xi > 0).

The following semi-parametric extreme value mixture model is provided:

  • mixture of normals bulk and GPD tail
  • mixture of gammas bulk and GPD tail (Nascimento, Gamerman and Lopes, 2011)

which uses the EM algorithm for iterative estimation of the mixture weights and component parameters. Again, the mixture of gammas includes the mixture of exponentials as a special case (Lee et al, 2012). The standalone "mixture of gammas" model (no extreme value tail modelling) has also been implemented using the EM algorithm.

The following nonparametric extreme value mixture models are provided:

  • kernel density estimate for bulk and GPD tail, using cross-validation likelihood used for KDE component (MacDonald et al., 2011; Gonzalez et al. ,2013))
  • P-splines density estimate for bulk and GPD tail
  • boundary corrected kernel density estimate for bulk and GPD tail for populations with bounded support (MacDonald, 2012 and MacDonald et al., 2013)
  • kernel density estimate for bulk and GPD for upper and lower tails (MacDonald et al., 2011)
  • P-splines density estimate for bulk and GPD for upper and lower tails

Additional versions of the above models have also been implemented with an extra constraint of continuity at the threshold.

The following nonstandard parametric extreme value mixture models are also implemented:

  • hybrid Pareto, with continuity and continuity in first derivative at threshold (Carreau and Bengio, 2008)
  • hybrid Pareto, with only continuity at threshold (Hu, 2013)
  • dynamically weighted mixture
  • interval transitions function mixture model with normal or Weibull and single GPD tail, and normal with both GPD tails (Holden and Haug, 2009)

Users should be aware that the hybrid Pareto based models have some unexpected properties and frequent performance issues in practice, as they do not include the tail fraction scaling of the GPD tail components, which is usually included in classical extreme value tail modelling approaches.

Generic kernel density estimation (without tail modelling) using a normal density for kernels is provided. A wide range of boundary correction kernel density estimation methods are provided (with and without tail modelling):

  • simple boundary correction method using generalised jacknifing (Jones, 1993)
  • cut and normalise (Gasser and Muller, 1979)
  • renormalisation (first order bias correction of Diggle, 1985)
  • reflection (Boneva, Kendall and Stefanov, 1971)
  • log transformation (Marron and Ruppert, 1992)
  • beta kernels (and modified version of Chen, 1999)
  • gamma kernels (and modified version of Chen, 2000)
  • copula kernels (Jones and Henderson, 2007)

The simple boundary correction method may provide negative density estimates, so the non-negative density transformation method of Jones and Foster (1996) is applied by default. The simple boundary correction method, beta and gamma boundary corrected KDEs also require normalisation to unity, which is also applied by default. Cross-validation likelihood is used in the MLE fitting for all the KDE based models. However, we have found this has a positive bias in the bandwidth estimates for generalised jacknifing, leading to oversmoothing. We use an alternative like the reflection or renormalisation method in the MLE of the bandwidth in this case.

A P-splines based nonparametric density estimator (based on the approach of Eilers and Marx, 1998 in Statistical Science) is also provided for wider usage for non-extreme value problems.

All the implemented extreme value mixture models, except the non-standard ones listed above (hybrid Pareto, dynamically weighted mixture and interval transition models), use the GPD as a conditional model (i.e. describe the exceedances, conditional on them being above the threshold). Therefore, when used in the mixture model the required unconditional GPD is obtained by application of the probability of being above the threshold, called the tail fraction. All the relevant mixture model functions permit the user to choose between treating the tail fraction as an extra parameter to be estimated (for which the MLE is the sample proportion of exceedances) or to parameterise it according to the bulk model parameters. See Yang Hu's thesis above for definitions and discussion.

The diagnostic plots and base GPD distribution functions are heavily based on those of the evd package, so the syntax is reasonably consistent and users can safely interchange most code for these.

Caveat

The package is still in its early developmental phase, so you can expect some changes in the functionality, and full backward compatibility is not guaranteed. We will continue to add functionality in each release. If you find bugs then do let us know (carl.scarrott@canterbury.ac.nz), but always provide a reproducible example.

If you wish to extend the functionality of the package with your own relevant code, then you are welcome to submit it to us and we try our best to include it, with acknowledgment of authorship of course!

Various researchers have submitted bug reports which are appreciated and acknowledged in the help files, keep them coming...