validclust package

Submodules

validclust.indices module

validclust.indices.cop(data, dist, labels)[source]

Calculate the COP CVI

See Gurrutxaga et al. (2010) for details on how the index is calculated. [1]

Parameters:
  • data (array-like, shape = [n_samples, n_features]) – The data to cluster.
  • dist (array-like, shape = [n_samples, n_samples]) – A distance matrix containing the distances between each observation.
  • labels (array [n_samples]) – The cluster labels for each observation.
Returns:

The COP index.

Return type:

float

References

[1]Gurrutxaga, I., Albisua, I., Arbelaitz, O., Martín, J., Muguerza, J., Pérez, J., Perona, I. (2010). SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index. Pattern Recognition, 43(10), 3364-3373. DOI: 10.1016/j.patcog.2010.04.021.

Examples

>>> from sklearn.cluster import k_means
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn.datasets import load_iris
>>> from validclust import cop
>>> data = load_iris()['data']
>>> _, labels, _ = k_means(data, n_clusters=3)
>>> dist = pairwise_distances(data)
>>> cop(data, dist, labels)
0.133689170400615
validclust.indices.dunn(dist, labels)[source]

Calculate the Dunn CVI

See Dunn (2008) for details on how the index is calculated. [2]

Parameters:
  • dist (array-like, shape = [n_samples, n_samples]) – A distance matrix containing the distances between each observation.
  • labels (array [n_samples]) – The cluster labels for each observation.
Returns:

The Dunn index.

Return type:

float

References

[2]Dunn, J. C. (1973). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95-104. DOI: 10.1080/01969727408546059.

Examples

>>> from sklearn.cluster import k_means
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn.datasets import load_iris
>>> from validclust import dunn
>>> data = load_iris()['data']
>>> _, labels, _ = k_means(data, n_clusters=3)
>>> dist = pairwise_distances(data)
>>> dunn(dist, labels)
0.09880739332808611

validclust.validclust module

class validclust.validclust.ValidClust(k, indices=['silhouette', 'calinski', 'davies', 'dunn'], methods=['hierarchical', 'kmeans'], linkage='ward', affinity='euclidean')[source]

Bases: object

Validate clustering results

Parameters:
  • k (int or list of int) – The number of clusters to partition your data into.
  • indices (str or list of str, optional) – The cluster validity indices to calculate. Acceptable values include ‘silhouette’, ‘calinski’, ‘davies’, ‘dunn’, and ‘cop’. You can use a three-character abbreviation for these values as well. For example, you could specify indices=['cal', 'dav', 'dun'].
  • methods (str or list of str, optional) – The clustering algorithm(s) to use. Acceptable values are ‘hierarchical’ and ‘kmeans’.
  • linkage ({'ward', 'complete', 'average', 'single'}, optional) – Which linkage criterion to use for hierarchical clustering. See the sklean docs for more details.
  • affinity ({'euclidean', 'l1', 'l2', 'manhattan', 'cosine'}, optional) – The metric used to compute the linkage for hierarchical clustering. Note, you must specify affinity='euclidean' when linkage='ward'. See the sklearn docs linked above for more details.
score_df

A Pandas DataFrame with the computed cluster validity index values.

Type:DataFrame
fit(data)[source]

Fit the clustering algorithm(s) to the data and calculate the CVI scores

Parameters:data (array-like, shape = [n_samples, n_features]) – The data to cluster.
Returns:A ValidClust object whose score_df attribute contains the calculated CVI scores.
Return type:self
fit_predict(data)[source]

Fit the clustering algorithm(s) to the data and calculate the CVI scores

Parameters:data (array-like, shape = [n_samples, n_features]) – The data to cluster.
Returns:A Pandas DataFrame with the computed cluster validity index values (self.score_df).
Return type:DataFrame
plot()[source]

Plot normalized CVI scores in a heatmap

The CVI scores are normalized along each method/index pair using the max norm. Note that, because the scores are normalized along each method/index pair, you should compare the colors of the cells in the heatmap only within a given row. You should not, for instance, compare the color of the cells in the “kmeans, dunn” row with those in the “kmeans, silhouette” row.

Returns:Nothing is returned. Instead, a plot is rendered using a graphics backend.
Return type:None