validclust package¶
Submodules¶
validclust.indices module¶
-
validclust.indices.
cop
(data, dist, labels)[source]¶ Calculate the COP CVI
See Gurrutxaga et al. (2010) for details on how the index is calculated. [1]
Parameters: - data (array-like, shape = [n_samples, n_features]) – The data to cluster.
- dist (array-like, shape = [n_samples, n_samples]) – A distance matrix containing the distances between each observation.
- labels (array [n_samples]) – The cluster labels for each observation.
Returns: The COP index.
Return type: float
References
[1] Gurrutxaga, I., Albisua, I., Arbelaitz, O., Martín, J., Muguerza, J., Pérez, J., Perona, I. (2010). SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index. Pattern Recognition, 43(10), 3364-3373. DOI: 10.1016/j.patcog.2010.04.021. Examples
>>> from sklearn.cluster import k_means >>> from sklearn.metrics import pairwise_distances >>> from sklearn.datasets import load_iris >>> from validclust import cop >>> data = load_iris()['data'] >>> _, labels, _ = k_means(data, n_clusters=3) >>> dist = pairwise_distances(data) >>> cop(data, dist, labels) 0.133689170400615
-
validclust.indices.
dunn
(dist, labels)[source]¶ Calculate the Dunn CVI
See Dunn (2008) for details on how the index is calculated. [2]
Parameters: - dist (array-like, shape = [n_samples, n_samples]) – A distance matrix containing the distances between each observation.
- labels (array [n_samples]) – The cluster labels for each observation.
Returns: The Dunn index.
Return type: float
References
[2] Dunn, J. C. (1973). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95-104. DOI: 10.1080/01969727408546059. Examples
>>> from sklearn.cluster import k_means >>> from sklearn.metrics import pairwise_distances >>> from sklearn.datasets import load_iris >>> from validclust import dunn >>> data = load_iris()['data'] >>> _, labels, _ = k_means(data, n_clusters=3) >>> dist = pairwise_distances(data) >>> dunn(dist, labels) 0.09880739332808611
validclust.validclust module¶
-
class
validclust.validclust.
ValidClust
(k, indices=['silhouette', 'calinski', 'davies', 'dunn'], methods=['hierarchical', 'kmeans'], linkage='ward', affinity='euclidean')[source]¶ Bases:
object
Validate clustering results
Parameters: - k (int or list of int) – The number of clusters to partition your data into.
- indices (str or list of str, optional) – The cluster validity indices to calculate. Acceptable values include
‘silhouette’, ‘calinski’, ‘davies’, ‘dunn’, and ‘cop’. You can use
a three-character abbreviation for these values as well. For example,
you could specify
indices=['cal', 'dav', 'dun']
. - methods (str or list of str, optional) – The clustering algorithm(s) to use. Acceptable values are ‘hierarchical’ and ‘kmeans’.
- linkage ({'ward', 'complete', 'average', 'single'}, optional) – Which linkage criterion to use for hierarchical clustering. See the sklean docs for more details.
- affinity ({'euclidean', 'l1', 'l2', 'manhattan', 'cosine'}, optional) – The metric used to compute the linkage for hierarchical clustering.
Note, you must specify
affinity='euclidean'
whenlinkage='ward'
. See the sklearn docs linked above for more details.
-
score_df
¶ A Pandas DataFrame with the computed cluster validity index values.
Type: DataFrame
-
fit
(data)[source]¶ Fit the clustering algorithm(s) to the data and calculate the CVI scores
Parameters: data (array-like, shape = [n_samples, n_features]) – The data to cluster. Returns: A ValidClust
object whosescore_df
attribute contains the calculated CVI scores.Return type: self
-
fit_predict
(data)[source]¶ Fit the clustering algorithm(s) to the data and calculate the CVI scores
Parameters: data (array-like, shape = [n_samples, n_features]) – The data to cluster. Returns: A Pandas DataFrame with the computed cluster validity index values ( self.score_df
).Return type: DataFrame
-
plot
()[source]¶ Plot normalized CVI scores in a heatmap
The CVI scores are normalized along each method/index pair using the max norm. Note that, because the scores are normalized along each method/index pair, you should compare the colors of the cells in the heatmap only within a given row. You should not, for instance, compare the color of the cells in the “kmeans, dunn” row with those in the “kmeans, silhouette” row.
Returns: Nothing is returned. Instead, a plot is rendered using a graphics backend. Return type: None