Abstract
A mixture of multiple scaled generalized hyperbolic distributions (MMSGHDs) is introduced. Then, a coalesced generalized hyperbolic distribution (CGHD) is developed by joining a generalized hyperbolic distribution with a multiple scaled generalized hyperbolic distribution. After detailing the development of the MMSGHDs, which arises via implementation of a multi-dimensional weight function, the density of the mixture of CGHDs is developed. A parameter estimation scheme is developed using the ever-expanding class of MM algorithms and the Bayesian information criterion is used for model selection. The issue of cluster convexity is examined and a special case of the MMSGHDs is developed that is guaranteed to have convex clusters. These approaches are illustrated and compared using simulated and real data. The identifiability of the MMSGHDs and the mixture of CGHDs are discussed in an appendix.
Similar content being viewed by others
References
Aitken, A.C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45, 14–22.
Altman, E. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589–609.
Andrews, J.L., & McNicholas, P.D. (2011a). Extending mixtures of multivariate t-factor analyzers. Statistics and Computing, 21(3), 361–373.
Andrews, J.L., & McNicholas, P.D. (2011b). Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis. Journal of Statistical Planning and Inference, 141(4), 1479–1486.
Andrews, J.L., & McNicholas, P. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), 1021–1029.
Azzalini, A., Browne, R.P., Genton, M.G., McNicholas, P.D. (2016). On nomenclature for, and the relative merits of, two formulations of skew distributions. Statistics and Probability Letters, 110, 201–206.
Baek, J., & McLachlan, G.J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27, 1269–1276.
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.
Barndorff-Nielsen, O. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics, 5(3), 151–157.
Barndorff-Nielsen, O., Kent, J., Sørensen, M. (1982). Normal variance-mean mixtures and z distributions. International Statistical Review / Revue Internationale de Statistique, 50(2), 145–159.
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46, 373–388.
Browne, R.P., & McNicholas, P.D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification, 8(2), 217–226.
Browne, R.P., & McNicholas, P.D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2), 176–198.
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P.A., Łukasik, S., żak, S. (2010). A complete gradient clustering algorithm for features analysis of x-ray images. In Piȩtka, E., & Kawa, J. (Eds.) Information Technologies in Biomedicine, (Vol. 2 pp. 15–24). Berlin: Springer.
Cook, R.D., & Weisberg, S. (1994). An Introduction to Regression Graphics. New York: Wiley.
Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society: Series A, 34, 321–367.
Debreu, G., & Koopmans, T.C. (1982). Additively decomposed quasiconvex functions. Mathematical Programming, 24(1), 1–38.
Demarta, S., & McNeil, A.J. (2005). The t copula and related copulas. International Statistical Review, 73(1), 111–129.
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38.
Flury, B., & Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. London: Chapman & Hall.
Forbes, F., & Wraith, D. (2014). A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweights: Application to robust clustering. Statistics and Computing, 24(6), 971–984.
Fraley, C., & Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631.
Franczak, B.C., Browne, R.P., McNicholas, P.D. (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1149–1157.
Franczak, B.C., Tortora, C., Browne, R.P., McNicholas, P.D. (2015). Unsupervised learning via mixtures of skewed distributions with hypercube contours. Pattern Recognition Letters, 58(1), 69–76.
Gallaugher, M.P.B., & McNicholas, P.D. (2018). Finite mixtures of skewed matrix variate distributions. Pattern Recognition, 80, 83–93.
Gallaugher, M.P.B., & McNicholas, P.D. (2019a). On fractionally-supervised classification: weight selection and extension to the multivariate t-distribution. Journal of Classification 36. In press.
Gallaugher, M.P.B., & McNicholas, P.D. (2019b). Three skewed matrix variate distributions. Statistics and Probability Letters, 145, 103–109.
Ghahramani, Z., & Hinton, G.E. (1997). The EM algorithm for factor analyzers Technical Report CRG-TR-96-1. Toronto: University Of Toronto.
Gneiting, T. (1997). Normal scale mixtures and dual probability densities. Journal of Statistical Computation and Simulation, 59(4), 375–384.
Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 63, 53–62.
Holzmann, H., Munk, A., Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics, 33, 753–763.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Hunter, D.R., & Lange, K. (2000). Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 9(1), 60–77.
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Kent, J.T. (1983). Identifiability of finite mixtures for directional data. The Annals of Statistics, 11, 984–988.
Kiers, H.A. (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics and Data Analysis, 41(1), 157–170.
Kotz, S., Kozubowski, T.J., Podgorski, K. (2001). The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance 1st edn: Burkhauser Boston.
Kotz, S., & Nadarajah, S. (2004). Multivariate t-distributions and their applications. Cambridge: Cambridge University Press.
Lee, S.X., & McLachlan, G.J. (2013a). EMMIXuskew: fitting unrestricted multivariate skew t Mixture Models. R package version 0.11–5.
Lee, S.X., & McLachlan, G.J. (2013b). On mixtures of skew normal and skew t-distributions. Advances in Data Analysis and Classification, 7(3), 241–266.
Lee, S.X., & McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing, 24(2), 181–202.
Lin, T.I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis, 100(2), 257–265.
Lin, T.I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing, 20(3), 343–356.
Lin, T.-I., McNicholas, P.D., Hsiu, J.H. (2014). Capturing patterns via parsimonious t mixture models. Statistics and Probability Letters, 88, 80–87.
Lindsay, B. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 5. California: Institute of Mathematical Statistics: Hayward.
McLachlan, G.J., & Peel, D. (2000). Mixtures of factor analyzers. In Proceedings of the Seventh International Conference on Machine Learning (pp. 599–606). San Francisco: Morgan Kaufmann.
McLachlan, G.J., Bean, R.W., Jones, L. B. -T. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics and Data Analysis, 51(11), 5327–5338.
McLachlan, G.J., & Krishnan, T. (2008). The EM Algorithm and Extensions. New York: Wiley.
McNeil, A.J., Frey, R., Embrechts, P. (2005). Quantitative risk management: concepts, techniques and tools. Princeton: Princeton University Press.
McNicholas, P.D. (2016a). Mixture Model-Based Classification. Boca-Raton: Chapman & Hall/CRC press.
McNicholas, P.D. (2016b). Model-based clustering. Journal of Classification, 33 (3), 331–373.
McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D. (2010). Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis, 54(3), 711–723.
McNicholas, S.M., McNicholas, P.D., Browne, R.P. (2017). A mixture of variance-gamma factor analyzers. In Ahmed, S. E. (Ed.) Big and Complex Data Analysis: Methodologies and Applications (pp. 369–385). Cham: Springer International Publishing.
Murray, P.M., Browne, R.B., McNicholas, P.D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics and Data Analysis, 77, 326–335.
Murray, P.M., Browne, R.B., McNicholas, P.D. (2017). Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. Journal of Multivariate Analysis, 161, 141–156.
Niculescu, C., & Persson, L. (2006). Convex Functions and Their Applications. New York: Springer.
Ortega, J.M., & Rheinboldt, W.C. (1970). Iterative Solutions of Nonlinear Equations in Several Variables. New York: Academic Press.
Peel, D., & McLachlan, G.J. (2000). Robust mixture modelling using the t distribution. Statistics and Computing, 10(4), 339–348.
Pesevski, A., Franczak, B.C., McNicholas, P.D. (2018). Subspace clustering with the multivariate-t distribution. Pattern Recognition Letters, 112(1), 297–302.
R Core Team. (2017). R: A Language and Environment for Statistical Computing Vienna. Austria: R Foundation for Statistical Computing.
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.
Rockafellar, R.T., & Wets, R.J.B. (2009). Variational Analysis. New York: Springer.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Steane, M.A., McNicholas, P.D., Yada, R. (2012). Model-based classification via mixtures of multivariate t-factor analyzers. Communications in Statistics – Simulation and Computation, 41(4), 510–523.
Steinley, D. (2004). Properties of the Hubert-Arable adjusted Rand index. Psychological methods, 9(3), 386.
Tang, Y., Browne, R.P., McNicholas, P.D. (2018). Flexible clustering of high-dimensional data via mixtures of joint generalized hyperbolic distributions. Stat, 7 (1), e177.
Tipping, M.E., & Bishop, C.M. (1999). Mixtures of probabilistic principal component analysers. Neural Computation, 11(2), 443–482.
Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D. (2014). Mixtures of multiple scaled generalized hyperbolic distributions. arXiv:1403.2332v1.
Tortora, C., Browne, R.P., Franczak, B.C., McNicholas, P.D. (2017). MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. R package version 2.1.
Vrbik, I., & McNicholas, P.D. (2012). Analytic calculations for the EM algorithm for multivariate skew-mixture models. Statistics and Probability Letters, 82(6), 1169–1174.
Vrbik, I., & McNicholas, P.D. (2014). Parsimonious skew mixture models for model-based clustering and classification. Computational Statistics and Data Analysis, 71, 196–210.
Vrbik, I., & McNicholas, P.D. (2015). Fractionally-supervised classification. Journal of Classification, 32(3), 359–381.
Wei, Y., Tang, Y., McNicholas, P.D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics and Data Analysis, 130, 18–41.
Wraith, D., & Forbes, F. (2015). Clustering using skewed multivariate heavy tailed distributions with flexible tail behaviour. arXiv:http://arXiv.org/abs/1408.0711.
Yakowitz, S.J., & Spragins, J. (1968). On the identifiability of finite mixtures. Annals of Mathematical Statistics, 39, 209–214.
Acknowledgements
The authors are grateful to two anonymous reviewers and the Editor for helpful comments that have improved this manuscript. This work was supported by a grant-in-aid from Compusense Inc., a Collaborative Research and Development Grant from the Natural Sciences and Engineering Research Council of Canada, and the Canada Research Chairs program. The work on the introduction of the MMSGHD presented herein was first made publicly available as an arXiv preprint (Tortora et al. 2014).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Parameter Estimation
We use the EM algorithm to estimate the parameters of the MCGHDs. The EM algorithm belongs to a larger class of algorithms known as MM algorithms (Ortega and Rheinboldt 1970; Hunter and Lange 2000) and is well-suited for problems involving missing data. “MM” stands for “minorize-maximize” or “majorize-minimize,” depending on the purpose of the algorithm; in the EM context, the minorizing function is the expected value of the complete-data log-likelihood. The EM algorithm iterates between two steps, an E-step and a M-step, and has been used to estimate the parameters of mixture models in many experiments (McLachlan and Krishnan 2008). On each E-step, the expected value of the complete-data log-likelihood, \(\mathcal {Q}\), is calculated and on each M-step is maximized with respect to πg, μg, Φg, αg, ωg, λg, ω0g, λ0g, ϖg. However, in each M-step \(\mathcal {Q}\) increases with respect to Γg rather than maximize; accordingly, the algorithm is formally a generalized EM (GEM) algorithm. For our MCGHDs, there are four sources of missing data: the latent variable W0ig, the multi-dimensional weight variable Δwig, the group component indicator labels zig, and inner component labels uig, for i = 1, … , n and g = 1, … , G. For each observation i, zig = 1 if observation i is in component g and zig = 0 otherwise. Similarly, for each observation i, uig = 1 if observation i, in component g, is distributed generalized hyperbolic and uig = 0 if observation i, in component g, is distributed multiple scaled generalized hyperbolic. It follows that the complete-data log-likelihood for the MCGHDs is given by
where ϕp(⋅) represents a p-dimensional Gaussian density function, ϕ1(⋅) is a unidimensional Gaussian density function, and h(⋅) is the density of a GIG distribution given in Eq. 15.
We are now prepared to outline the calculations for our GEM algorithm for the MCGHDs. On the E-step, the expected value of the complete-data log-likelihood,\( \mathcal {Q}\), is computed by replacing the sufficient statistics of the missing data by their expected values. For each component indicator label zig and inner component label uig, for i = 1, … , n and g = 1, … , G, we require the expectations
and
where fCGHD is given in Eq. 20, fGHD is given in Eq. 10 and fMSGHD is given in Eq. 17. For the latent variable W0ig, we use the expected value given in Browne and McNicholas (2015). The authors show that, given the density in Eq. 15, the following is true
For the MCGHDs, the maximization of \(\mathcal {Q}\) requires the expected values of W0ig, \(W^{-1}_{0ig}\), and log W0ig, i.e.,
where \(d_{g}= \omega _{0g}+\boldsymbol {\alpha }_{g}^{\prime } (\boldsymbol {\Gamma }_{g}\boldsymbol {\Phi }_{g}\boldsymbol {\Gamma }^{\prime }_{g})^{-1} \boldsymbol {\alpha }_{g}\) and \(e_{ig}= \omega _{0g}+\delta (\mathbf {x}_{i}, \boldsymbol {\mu }_{g} \mid \boldsymbol {\Gamma }_{g}\boldsymbol {\Phi }_{g}\boldsymbol {\Gamma }^{\prime }_{g})\).
The maximization of \(\mathcal {Q}\) also requires the expected values of the multidimensional weight variables Δwig, \(\boldsymbol {\Delta }_{\mathbf {w} ig}^{-1}\), and log Δwig. Given the density in Eq. 17, it follows that
Each multidimensional weight variable is replaced by its expected value and so we need to compute E1ig = diag{E1i1g, … , E1ipg}, E2ig = diag{E2i1g, … , E2ipg}, and E3ig = diag{E3i1g, … , E3ipg}, where
\(\bar {d}_{jg} = \omega _{jg} + \alpha _{jg}^{2}{\Phi }_{jg}^{-1}\) and \(\bar {e}_{ijg} =\omega _{jg} + ([\mathbf {x}_{i} -\boldsymbol {\mu }_{g}]_{j} )^{2}/\phi _{jg}\). Let \(n_{g}={\sum }_{i = 1}^{n} \hat z_{ig}\), \(A_{g}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}a_{ig}\), \(B_{g}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}b_{ig}\), \(C_{g}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}c_{ig}\), \({\bar {E}}_{1jg}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}{ E}_{1ijg}\), \({\bar {E}}_{2jg}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}{ E}_{2ijg}\), and \({\bar {E}}_{3jg}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig} { E}_{3ijg}\).
In the M-step, we maximize the expected value of the complete-data log-likelihood with respect to the model parameters. The mixing proportions and inner mixing proportions are updated via \(\hat {\pi }_{g}=n_{g}/n\) and \(\hat {\varpi }_{g}={{\sum }_{i = 1}^{n} \hat {u}_{ig} \hat {z}_{ig}}/{n_{g}}\), respectively. The elements of the location parameter μg and skewness parameter αg are replaced with
respectively, where \([\boldsymbol {\Gamma }_{g}^{\prime } \mathbf {x}_{i}]_{j}\) is the j th element of the matrix \(\boldsymbol {\Gamma }_{g}^{\prime } \mathbf {x}_{i}\), \(s_{1ijg}= \hat {u}_{ig}a_{ig}+\left (1- \hat {u}_{ig} \right ) { E}_{1ijg}\), \(s_{2ijg} =\hat {u}_{ig}b_{ig}+\left (1- \hat {u}_{ig} \right ) { E}_{2ijg}\), \(\bar {s}_{1jg}= 1/n_{g}{\sum }_{i = 1}^{n} \hat {z}_{ig}s_{1ijg},\bar {s}_{2jg}= 1/n_{g}{\sum }_{i = 1}^{n} \hat {z}_{ig}s_{2ijg}\). The diagonal elements of the matrix Φg are updated using
To update the component eigenvector matrices Γg, we wish to minimize the objective function
with respect to Γg, where \(\mathbf V_{ig} = \hat {u}_{ig} b_{ig}\mathbf {I}_{p} + (1-\hat {u}_{ig})\textbf {E}_{2ig}\). We employ an optimization routine that uses two simpler majorization-minimization algorithms. Our optimization routine exploits the convexity of the objective function in Eq. 24, providing a computationally stable algorithm for estimating Γg. Specifically, we follow Kiers (2002) and Browne and McNicholas (2014) and use the surrogate function
where C is a constant that does not depend on Γg, r ∈ {1, 2} is an index, and the matrices Frg are defined in Eqs. 26 and 27.
Therefore, on each M-step, we calculate either
or
where α1ig is the largest eigenvalue of the diagonal matrix \( \boldsymbol {\Phi }_{g}^{-1}\mathbf V_{ig}\), and α2ig is equal to \(\hat {z}_{ig}\mathbf {x}_{i}^{\prime } \mathbf {x}_{i}\), which is the largest eigenvalue of the rank-1 matrix \(\hat {z}_{ig}\mathbf {x}_{i} \mathbf {x}_{i}^{\prime }\). Following this, we compute the singular value decomposition of Frg given by
It follows that our update for Γg is given by
The p-dimensional concentration and index parameters, i.e., ωg and λg, are estimated by maximizing the function
This leads to
and
where the superscript “prev” denotes that the estimate from the previous iteration is used. The univariate parameters ω0g and λ0g are estimated by maximizing the function
giving
and
Our GEM algorithm is iterated until convergence, which is determined using the Aitken acceleration (Aitken 1926). Formally, the Aitken acceleration is given by
where l(k) is the value of the log-likelihood at the iteration k and
is an asymptotic estimate of the log-likelihood on iteration k + 1. The algorithm can be considered to have converged when \(l^{(k + 1)}_{\infty }-l^{(k)}< \epsilon \), provided this difference is positive (Böhning et al. 1994; Lindsay 1995; McNicholas et al. 2010). Herein, we set 𝜖 = 0.01. When the algorithm converges, we compute the maximum a posteriori (MAP) classification values using the posterior \(\hat {z}_{ig}\), where \(\text {MAP}\left \{\hat {z}_{ig}\right \}= 1\) if \(g=\arg \max _{h}\left \{\hat {z}_{ih}\right \}\), and \(\text {MAP}\left \{\hat {z}_{ig}\right \}= 0\) otherwise.
Appendix 2: Quasi-Concavity of the cMSGHD
In essence, we might want to consider only densities whose contours contain a set of points that are convex. Formally, such densities are quasi-concave. Extensive details on quasi-concavity, quasi-convexity, and related notions are given by Niculescu and Persson (2006) and Rockafellar and Wets (2009).
Definition 1
A function f(x) is quasi-concave if each upper-level set Uα(f) = {x|f(x) ≥ α} is convex, for \(\alpha \in \mathbb {R}\).
Definition 2
A function f(x) is quasi-convex if each sub-level set Sα(f) = {x|f(x) ≤ α} is convex, for \(\alpha \in \mathbb {R}\).
Lemma 1
The class of elliptical distributions, whose density functions have theform
are quasi-concave if the generator function, g, is monotonic non-increasing.
Proof
Result follows from the fact that if the function δ (x, μ|Σ) is convex since Σ is positive definite and the function g is monotonic non-increasing, then the function f(x) is quasi-concave. □
Theorem 1
The generalized hyperbolic distribution (GHD) is quasi-concave.
Proof
It is straightforward to show that the function
is convex, where a and b are positive constants, and δ (x, μ|Σ) is the Malahanobis distance between x and μ. Let τ = λ − p/2. Then, the function
where \(z\in \mathbb {R}^{+}\), and Kτ is the modified Bessel function of the third kind with index τ, is monotonic decreasing (or non-increasing) because the first derivative
is negative for all \(\tau \in \mathbb {R}\) and z > 0. In addition to being monotonic decreasing, k(z) is convex for τ < 1/2, concave and convex (linear) for τ = 1/2, and concave for τ > 1/2. Because k(z) is a monotonic function, it satisfies the criteria for quasi-convexity and quasi-concavity, so it is simultaneously quasi-convex and quasi-concave. In this context, monotone functions are also known as quasi-linear or quasi-montone.
Recall that if the function U is quasi-convex and the function g is decreasing, then the function f(x) = g(U(x)) is quasi-concave. It follows that the composition k(h(x)) is quasi-concave. Consider the skewness part of the GHD density function, i.e., a(x) = −(x −μ)′Σ− 1α, which is a linear function. It follows that the function
is also quasi-concave, and the result follows from the fact that Eq. 30 is proportional to the density of the GHD. □
Theorem 2
The convex multiple scaled generalized hyperbolic distribution(cMSGHD) is quasi-concave. In other words, the multiple scaledgeneralized hyperbolic distribution (MSGHD) is quasi-concave provided thatλj > 1 for allj = 1, … , p.
Proof
A p-dimensional multiple scaled distribution is a product of p independent univariate densities. The density of the MSGHD has form
where g1(xj|θj) is the density of the univariate hyperbolic distribution with parameters θj, j = 1, … , p. From Theorem 1, log g1(xj|θj) is a concave function for τj > 1/2, i.e., for λj > 1 (because p = 1). Therefore, the function
is concave provided that λj > 1 for all j = 1, … , p. Therefore, the function
is quasi-concave provided that λj > 1 for all j = 1, … , p. □
Note that addition does not preserve quasi-convexity or quasi-concavity. The sum of two quasi-convex functions defined on different domains will be quasi-concave if they are additively decomposed (see Debreu and Koopmans 1982). Debreu and Koopmans (1982) give necessary and sufficient conditions for the sum f of a set of functions f1, … , fm to be additively decomposed. These conditions depend on the convexity index c(f) in which f is quasi convex if and only if either of the following hold: (i) c(fi) ≥ 0 for every i, or (ii) c(fj) < 0 for some j, c(fi) > 0 for every i ≠ j, and \({\sum }_{i = 1}^{m} \frac {1}{c(f_{i})} \le 0\). For differentiable functions, the convexity index satisfies the inequality f″(x)/[f′(x)]2 ≥ c(f).
We have that a sufficient condition for the MSGHD to be quasi-concave is that all λj > 1. Furthermore, a sufficient condition for the MSGHD not to be quasi-concave is that all λj < 1 and finite. Interestingly, this means the multiple scaled t-distribution cannot provide convex level sets for any finite degrees of freedom. For large degrees of freedom, the multiple scaled t-distribution will behave similarly to a normal distribution near the mode; however, as one moves away from the mode, non-convex contours will be encountered. Finally, note that Debreu and Koopmans (1982) give necessary and sufficient conditions that suggest a quasi-concave MSGHD with some λj positive and others negative is possible, but going this route would greatly complicate the estimation procedure.
Appendix 3: Finite Mixture Identifiability
In this section, we consider the notion of identifiability for finite mixtures of MSGHDs and coalesced generalized hyperbolic distributions (CGHDs). Herein, we take the term identifiability to mean finite mixture identifiability.
3.1 3.1 Background
Holzmann et al. (2006) prove identifiability of finite mixtures of elliptical distributions. They state that “finite mixtures are said to be identifiable if distinct mixing distributions with finite support correspond to distinct mixtures.” A finite mixture of the densities fp(x|Ψ1), … , fp(x|ΨG) is identifiable if the family \(\left \{ f_{p}(\mathbf {x}|\boldsymbol {\Psi }) : \boldsymbol {\Psi } \in \mathcal {A}^{p} \right \}\) is linearly independent. The founding work on finite mixture identifiability is by Yakowitz and Spragins (1968), who state that this linear independence is a necessary and sufficient condition for identifiability.
The GHD can be expressed as a normal variance-mean mixture. The stochastic relationship of the normal variance-mean mixture is given by
where \(\mathbf {U} \backsim \mathcal {N}_{p}(\mathbf {0}, \boldsymbol {\Sigma })\) and W, independent of U, is a positive univariate random variable with density h(w|θ). Browne and McNicholas (2015) proved identifiability for finite mixtures of GHDs through additivity of disjoint sets of identifiable distributions.
Definition 3
In the present context, a finite mixture of the multiple scale distributions f(x|θ1), … , f(x|θG) is identifiable if
for \(\mathbf {x} \in \mathbb {R}^{p}\), where G is a positive integer, \({\sum }_{g = 1}^{G} \pi _{g} = {\sum }_{g = 1}^{G} \pi _{g}^{\star } = 1\) and \(\pi _{g}, \pi _{g}^{\star } > 0\) for g = 1, … , G, implies that there exists a permutation σ such that (πg, θg) = (πσ(g), θσ(g)) for all g.
Browne and McNicholas (2015) prove identifiability for normal variance-mean mixtures, which includes the generalized hyperbolic. Here, we view the results from a different vantage point to illustrate the concepts required for the identifiability of the multiple scaled distributions. We begin by noting the characteristic function for the generalized hyperbolic arises from the characteristic function of the normal variance-mean mixture,
where
The characteristic function for the generalized hyperbolic is
In the context of a coalesced distribution, with a eigen-decomposed scale matrix, the characteristic function is
Now, we let v = tz and obtain
To prove identifiability of the generalized hyperbolic, we could now use the results from Browne and McNicholas (2015) and Yakowitz and Spragins (1968, p. 211) that implies there exists z such that the tuple \((\mathbf {z}^{\prime } \boldsymbol {\Sigma }_{g} \mathbf {z}, \boldsymbol {\beta }_{g}^{\prime } \mathbf {z}, \mathbf {z}^{\prime }\boldsymbol {\mu }_{g})\), where \(\boldsymbol {\Sigma }_{g}= \boldsymbol {\Gamma }_{g} \boldsymbol {\Phi }_{g} \boldsymbol {\Gamma }_{g}^{\prime }\) is unique for all g = 1, … , G, allows a reduction to the univariate case. Now, we rewrite the term z′Σgz as
which implies the tuple
is unique for all g = 1, … , G. A similar argument indicates there exists a z such that the tuple
is unique. In fact, a more general statement indicates that there exists a z such that the tuple
is unique for monotonic \(\varphi : \mathbb {R}^{+} \mapsto \mathbb {R}^{+} \). Deriving this unique set of tuples facilitates the reduction to the univariate case. This is useful because the univariate generalized hyperbolic density is identifiable (see Browne and McNicholas2015).
3.2 3.2 Identifiability of a Finite Mixture of Multiple Scaled Distributions
For a multiple scaled distribution, we only need to find a single direction where the distribution is finite mixture identifiable because, as noted in Remark 2 of Kent (1983), a distribution might be non-identifiable on a subset of \(\mathbb {R}^{p}\) but identifiability can endure over \(\mathbb {R}^{p}\). In other words, for a distribution to be non-identifiable, a linear combination has to be equal to zero for all \(x \in \mathbb {R}^{p}\). This is illustrated by the example given in Kent (1983):
“the polynomials P(x1, x2) = 1 and \(P(x_{1}, x_{2}) = ({x_{1}^{2}}+ x_{2})^{3}\), \(x\in \mathbb {R}^{2}\), are equal on the unit circle, but are not the same on all of \(\mathbb {R}^{2}\).”
As a consequence, if a multivariate distribution is identifiable in some direction then it is identifiable over \(\mathbb {R}^{p}\).
To begin, consider that if there is at one least direction or column of Γg that is equal across g = 1, … , G, then the identifiability of a multiple scaled distribution follows from the identifiability of the univariate distribution. Whereas if one column of Γg is unequal, that implies, by the nature of orthonormal matrices, that two columns of Γg are unequal. We will now illustrate how the bivariate multiple scaled distribution is identifiable, which implies identifiability for finite p.
When Γg differ, the identifiability of the multiple scaled distribution depends on the behavior of the multiple scaled distribution’s density and moment generating functions when we consider moving along directions other than the columns of Γg. For example, a bivariate multiple scaled t-distribution behaves (by definition) like a t-distribution with ν1 and ν2 degrees of freedom along each of it’s principal axes, but along any other direction, a bivariate multiple scaled t-distribution behaves asymptotically like a t-distribution with ν1 + ν2 degrees of freedom.
Consider the following three orthonormal matrices in the context of an eigen-decomposition of a matrix;
If we have equal eigenvalues then we cannot distinguish between Γ1 and Γ2. In the same way, if we have the same distribution along the first and second axis, we cannot distinguish between them. However, if we have eigenvalue ordering we can distinguish between Γ1 and Γ2, but eigenvalue ordering will not allow us to distinguish between Γ1 and Γ3, since they yield the same basis or set of directions. Therefore, in general, Γ is unique up to multiplication by
One way to establish uniqueness is to require the largest value of each column of Γ to be positive. An equivalent requirement is for Γ1 ≠ Γ2 which requires that
for j = 1, … , p, \(\mathbf {z} \in \mathbb {R}^{P}\), z ≠ 0p and R is a set of diagonal matrices such that diag(R) = (± 1, … , ± 1) excluding the identity matrix. Note that [a]j denotes the j th element of the vector a. However, if we had two orthonormal matrices such that \(\boldsymbol {\Gamma }_{1}^{\prime } \boldsymbol {\Gamma }_{2} = \mathbf {I}\), then Γ1 = Γ2. If \(\boldsymbol {\Gamma }_{1}^{\prime } \boldsymbol {\Gamma }_{2} = \mathbf {R}\), then our orthonormal condition amounts to \( \boldsymbol {\Gamma }_{1}^{\prime } = \boldsymbol {\Gamma }_{2}\) or equivalently, for all directions \(\mathbf {z} \in \mathbb {R}^{P}\) and z ≠ 0p
This prevents the j th column of Γ2 from being in the opposite direction of the j th column of Γ1. This form of the condition is easier to incorporate into the identifiability illustration.
In the MSGHD, if we consider moving the amount t in a direction z, which entails setting x = tz, we can write the density as
Note, if z is equal to the k th eigenvector, which is the k th column of Γ, then the density reduces to
where
Therefore, the density is simply proportional to
First, note that if the parameterizations are one-to-one, then if one parameterization is shown to be identifiable, the others are identifiable as well. Similar to Browne and McNicholas (2015), we let δj = βj/Φj, \(\alpha _{j} = \sqrt { \omega _{j}/{\Phi }_{j} + {\beta _{j}^{2}}/{{\Phi }_{j}^{2}}} \) and \(\kappa _{j} = \sqrt {{\Phi }_{j} \omega _{j}} \), where αj ≥|δj|. Under this reparameterization, we now have
For large z, the Bessel function can approximated by
which yields, using the alternative parameterization,
If z is not equal to the k th eigenvector, than, using the reparameterization given in Eq. 38, we have
The characteristic function for a multiple scaled distribution can be written as
which, under the alternative parameterization from Eq. 38, becomes
Now if we consider moving t in the direction z
and, for large t, the characteristic function is
Therefore, from the condition given in Eq. 34, there exists z such that the tuple \(({\sum }_{j = 1}^{P} \kappa _{j} \left | [ \boldsymbol {\Gamma }^{\prime } \mathbf {z} ]_{j} \right |, \mathbf {z}^{\prime } \boldsymbol {\Gamma } \boldsymbol {\mu } )\) is unique for all g = 1, … , G and reduces to the univariate hyperbolic distribution, which is identifiable.
3.3 3.3 Identifiability of the Coalesced Generalized Hyperbolic Distribution
To prove the identifiability of the CGHD, we only need to show that two sets of distributions, the multiple scaled and the generalized hyperbolic distribution are disjoint. Consider moving along the k th eigenvalue such that (λk, κk) is distinct from (λ0, κ0) and the proof easily follows from the identifiability of the univariate generalized hyperbolic distribution.
Rights and permissions
About this article
Cite this article
Tortora, C., Franczak, B.C., Browne, R.P. et al. A Mixture of Coalesced Generalized Hyperbolic Distributions. J Classif 36, 26–57 (2019). https://doi.org/10.1007/s00357-019-09319-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09319-3