Clustering of mixed data

This page provides access to material for benchmarking « ready-to-use » clustering methods for mixed data. It refers to a paper recently published in Nature Scientific Reports and entitled

« Head-to-head comparison of clustering methods for heterogeneous data. A simulation-driven benchmark »

by

Gregoire Preud’homme 1, 2, Kevin Duarte 1, Kevin Dalleau 3, Claire Lacomblez 1, Emmanuel Bresso 3, Malika Smaïl-Tabbone 2,3,  Miguel Couceiro 3, Marie-Dominique Devignes 2,3, Masatake Kobayashi 1, 2, Olivier Huttin 1, 2, João Pedro Ferreira 1, 2, Faiez Zannad 1, 2, Patrick Rossignol 1, 2, Nicolas Girerd 1, 2

  1. Université de Lorraine, Centre d’Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, France.
  2. F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, France.
  3. Université de Lorraine, CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Vandoeuvre-lès-Nancy, France.

Simulated datasets

For each tested design 1000 simulated datasets were generated (see paper for the method used). The sets of 1000 datasets can be downloaded as a single rds file (formatted for usage with R packages) named according to the design parameters.
Default parameters are as follows

  • Population size : 300 ;
  • Number of clusters : 6 ;
  • Number of continuous variables : 4 ;
  • Degree of relevance of continuous variables : mild ;
  • Proportion of relevant continuous variables : 100% (for a total of 4 variables);
  • Total number of categorical variables : 4 ;
  • Degree of relevance of categorical variables : mild.
  • Proportion of relevant categorical variables : 100% (for a total of 4 variables);

In the table, for each design, the values of the varying parameters are indicated with a link for downloading the corresponding dataset (average size 10Mo).

Design number Varying Parameter Value 1 Value 2 Value 3
Design 1 Population size
Design 2 Number of clusters
Design 3 Number of continuous variables with 2 categorical variables
Design 3bis Number of continuous variables with 4 categorical variables
Design 3ter Number of continuous variables with 8 categorical variables
Design 4 Degree of relevance of continuous variables
Design 5 Degree of relevance of categorical variables
Design 6 Proportion of relevant categorical variables for a total of 10 categorical variables
Design 7 Proportion of relevant categorical variables for a total of 10 categorical variables

Tested algorithms

Distance-based

Model-based