validclust package¶

Submodules¶

validclust.indices module¶

validclust.indices.cop(data, dist, labels)[source]¶

Calculate the COP CVI

See Gurrutxaga et al. (2010) for details on how the index is calculated. [1]

Parameters:	data (array-like, shape = [n_samples, n_features]) – The data to cluster. dist (array-like, shape = [n_samples, n_samples]) – A distance matrix containing the distances between each observation. labels (array [n_samples]) – The cluster labels for each observation.
Returns:	The COP index.
Return type:	float

References

[1]	Gurrutxaga, I., Albisua, I., Arbelaitz, O., Martín, J., Muguerza, J., Pérez, J., Perona, I. (2010). SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index. Pattern Recognition, 43(10), 3364-3373. DOI: 10.1016/j.patcog.2010.04.021.

Examples

>>> from sklearn.cluster import k_means
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn.datasets import load_iris
>>> from validclust import cop
>>> data = load_iris()['data']
>>> _, labels, _ = k_means(data, n_clusters=3)
>>> dist = pairwise_distances(data)
>>> cop(data, dist, labels)
0.133689170400615

validclust.indices.dunn(dist, labels)[source]¶

Calculate the Dunn CVI

See Dunn (2008) for details on how the index is calculated. [2]

Parameters:	dist (array-like, shape = [n_samples, n_samples]) – A distance matrix containing the distances between each observation. labels (array [n_samples]) – The cluster labels for each observation.
Returns:	The Dunn index.
Return type:	float

References

[2]	Dunn, J. C. (1973). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95-104. DOI: 10.1080/01969727408546059.

Examples

>>> from sklearn.cluster import k_means
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn.datasets import load_iris
>>> from validclust import dunn
>>> data = load_iris()['data']
>>> _, labels, _ = k_means(data, n_clusters=3)
>>> dist = pairwise_distances(data)
>>> dunn(dist, labels)
0.09880739332808611

validclust.validclust module¶

class validclust.validclust.ValidClust(k, indices=['silhouette', 'calinski', 'davies', 'dunn'], methods=['hierarchical', 'kmeans'], linkage='ward', affinity='euclidean')[source]¶

Bases: object

Validate clustering results

Parameters:

k (int or list of int) – The number of clusters to partition your data into.
indices (str or list of str, optional) – The cluster validity indices to calculate. Acceptable values include ‘silhouette’, ‘calinski’, ‘davies’, ‘dunn’, and ‘cop’. You can use a three-character abbreviation for these values as well. For example, you could specify indices=['cal', 'dav', 'dun'].
methods (str or list of str, optional) – The clustering algorithm(s) to use. Acceptable values are ‘hierarchical’ and ‘kmeans’.
linkage ({'ward', 'complete', 'average', 'single'}, optional) – Which linkage criterion to use for hierarchical clustering. See the sklean docs for more details.
affinity ({'euclidean', 'l1', 'l2', 'manhattan', 'cosine'}, optional) – The metric used to compute the linkage for hierarchical clustering. Note, you must specify affinity='euclidean' when linkage='ward'. See the sklearn docs linked above for more details.

score_df¶

A Pandas DataFrame with the computed cluster validity index values.

Type:	DataFrame

fit(data)[source]¶

Fit the clustering algorithm(s) to the data and calculate the CVI scores

Parameters:	data (array-like, shape = [n_samples, n_features]) – The data to cluster.
Returns:	A `ValidClust` object whose `score_df` attribute contains the calculated CVI scores.
Return type:	self

fit_predict(data)[source]¶

Fit the clustering algorithm(s) to the data and calculate the CVI scores

Parameters:	data (array-like, shape = [n_samples, n_features]) – The data to cluster.
Returns:	A Pandas DataFrame with the computed cluster validity index values (`self.score_df`).
Return type:	DataFrame

plot()[source]¶

Plot normalized CVI scores in a heatmap

The CVI scores are normalized along each method/index pair using the max norm. Note that, because the scores are normalized along each method/index pair, you should compare the colors of the cells in the heatmap only within a given row. You should not, for instance, compare the color of the cells in the “kmeans, dunn” row with those in the “kmeans, silhouette” row.

Returns:	Nothing is returned. Instead, a plot is rendered using a graphics backend.
Return type:	None