# Abstracts

### Comparison between K–Means and Hierarchical Clustering of Dependent and Independent Data Generated from Multivariate Gaussian Copula Function

**Francesca Marta Lilja Di Lascio**

Department of Statistics, University of Bologna, Italy

**Warren J. Ewens**

Department of Biology, University of Pennsylvania, USA

We use the multivariate gaussian copula function ([3]) to evaluate the ability of the K–means algorithm ([2]) and
the hierarchical method ([1]) to identify clusters correspondent to the marginal probability functions holding by
the dependence structure of their joint distribution function via copula function.

Both for the k–means and the hierarchical clustering we make simulations distinguishing (1) small and big
sample, (2) the value of the dependence parameter of the copula function, (3) the value of parameters margins
(well–separated, overlapped and distinct margins) and, finally, (4) the kind of the dispersion matrix (unstructured
and exchangeable).

We evaluate the performance of the two clustering methods under study by means of (1) the difference
between the ‘real’ value of the dependence parameters and its value post-clustering, (2) the percentage of
iterations in which the number of the observations for each cluster is different from the ‘real’ one, (3) the
capability to identify the exact probability model of the margins.

We find that the hierarchical method works well if the margins are well-distinct irrespective of cluster
size, while the k–means works well only if the sample is small. The performance of both clustering methods is
independent from the dispersion structure.

References:

- [1] Everitt, B., (1993). Cluster Analysis (3 ed.), London, E. Arnold, NY, Halsted Press.
- [2] Hartigan, J.A., and Wong, M.A., (1979). Algorithm AS 136: A K–Means Clustering Algorithm, Applied
Statistics, vol. 28, n. 1, pp. 100–108.
- [3] Nelsen, R.B., (2006). Introduction to copulas, New York, Springer.

**Session 1a**, Statistical Methodology: 11:30 — 11:50, Room 446

Presentation Program