Comparison between K–Means and Hierarchical Clustering of Dependent and Independent Data Generated from Multivariate Gaussian Copula Function

Francesca Marta Lilja Di Lascio
Department of Statistics, University of Bologna, Italy

Warren J. Ewens
Department of Biology, University of Pennsylvania, USA

We use the multivariate gaussian copula function ([3]) to evaluate the ability of the K–means algorithm ([2]) and the hierarchical method ([1]) to identify clusters correspondent to the marginal probability functions holding by the dependence structure of their joint distribution function via copula function.

Both for the k–means and the hierarchical clustering we make simulations distinguishing (1) small and big sample, (2) the value of the dependence parameter of the copula function, (3) the value of parameters margins (well–separated, overlapped and distinct margins) and, finally, (4) the kind of the dispersion matrix (unstructured and exchangeable).

We evaluate the performance of the two clustering methods under study by means of (1) the difference between the ‘real’ value of the dependence parameters and its value post-clustering, (2) the percentage of iterations in which the number of the observations for each cluster is different from the ‘real’ one, (3) the capability to identify the exact probability model of the margins.

We find that the hierarchical method works well if the margins are well-distinct irrespective of cluster size, while the k–means works well only if the sample is small. The performance of both clustering methods is independent from the dispersion structure.


  1. [1] Everitt, B., (1993). Cluster Analysis (3 ed.), London, E. Arnold, NY, Halsted Press.
  2. [2] Hartigan, J.A., and Wong, M.A., (1979). Algorithm AS 136: A K–Means Clustering Algorithm, Applied Statistics, vol. 28, n. 1, pp. 100–108.
  3. [3] Nelsen, R.B., (2006). Introduction to copulas, New York, Springer.

Session 1a, Statistical Methodology: 11:30 — 11:50, Room 446

Presentation Program