A powerful density-based clustering library for discovering patterns in 2D data
Project description
ClusterDC
Overview
ClusterDC is a powerful density-based clustering library tailored for identifying clusters in two-dimensional embedding spaces. It is fast, robust, flexible, and data-driven. Initially created to address the clustering challenges faced by geochemists, it has evolved into a comprehensive toolkit for data analysis, visualization, and clustering that can be applied across multiple domains.
ClusterDC helps in analyzing two-dimensional embeddings of multivariate data, such as multielement assay datasets, to identify meaningful patterns and groups. At its core, ClusterDC leverages advanced Kernel Density Estimation (KDE) techniques to accurately model the underlying density distribution of your data. This density-based approach excels at identifying natural clusters of arbitrary shapes and varying densities, making it particularly effective for real-world data with complex structures.
ClusterDC operates only on two-dimensional data, so high-dimensional datasets must first be reduced to 2D using dimension reduction techniques. Based on extensive testing across numerous projects, we strongly recommend PaCMAP for its robust non-linear dimensionality reduction capabilities that effectively preserve cluster structures. LocalMAP, an improved version of PaCMAP, is another excellent option we're currently evaluating. While UMAP and t-SNE are also compatible with ClusterDC, our experience shows that PaCMAP and LocalMAP typically produce superior clustering results by maintaining better separation between natural data groups.
The library's advanced KDE implementation offers multiple kernel functions and bandwidth selection methods, including adaptive local bandwidths that automatically adjust to variations in your data density. This provides superior performance compared to traditional clustering methods, especially for datasets with irregular distributions, outliers, or varying cluster densities. We recommend using local bandwidth KDE for understanding clusters within your data first before start lumping these clusters into bigger ones. Try always to use max_clusters option at the first attempt to understand the clusters within the data to see the maximum number of clusters that can be generated using local bandwidth KDE.
While originally focused on geological applications, ClusterDC can be used in many fields beyond geosciences, such as environmental engineering, biological sciences, financial analysis, and other natural sciences. These areas often struggle with clustering due to natural variations and complex real-world phenomena that ClusterDC's density-based approach handles effectively.
For more information about how the core algorithm works, please refer to the publication:
Meyrieux, M., Hmoud, S., van Geffen, P., Kaeter, D. CLUSTERDC: A New Density-Based Clustering Algorithm and its Application in a Geological Material Characterization Workflow. Nat Resour Res (2024). https://doi.org/10.1007/s11053-024-10379-5
The publication presents case studies demonstrating the application of ClusterDC in geological contexts, showing how the algorithm supports the characterization of geological material types based on multi-element geochemistry.
Installation
ClusterDC can be installed using pip:
pip install clusterdc
If you're using a conda environment, you may want to install certain dependencies with conda first:
conda install -y numpy=1.24.3 matplotlib=3.7.1
pip install clusterdc
Key Components
ClusterDC now consists of three main classes that work together to provide a complete data analysis pipeline:
1. Data Class
A comprehensive data handling utility that simplifies loading, processing, analyzing, and visualizing data from various sources:
- Versatile data loading from local files, URLs, or built-in datasets
- Support for multiple formats including CSV, Excel, JSON, and HTML tables
- Data summarization with comprehensive statistics
- Rich visualization capabilities including:
- Scatter plots with customizable markers and coloring
- Density-colored scatter plots for pattern identification
- Combined visualizations showing original and sampled data
- Customizable plotting parameters (colors, sizes, transparency)
- Plot saving in various formats with adjustable resolution
- Advanced sampling techniques including density-based sampling for large datsets
- Seamless integration with KDE and clustering components
2. KDE (Kernel Density Estimation) Class
An advanced implementation of kernel density estimation with multiple bandwidth selection methods:
- Multiple kernel functions (Gaussian, Epanechnikov, Laplacian)
- Automatic bandwidth selection using Bayesian optimization
- Adaptive local bandwidth estimation based on k-nearest neighbors
- Global bandwidth with anisotropic covariance estimation
- Rule-of-thumb methods (Scott's rule, Silverman's rule, both are same when having 2D datasets)
- High-performance implementation with memory optimization
- Visualization tools for analyzing density distributions
- Performance benchmarking capabilities
- saving and loading KDE models to save time
3. ClusterDC Class
The core clustering algorithm that identifies natural clusters based on density patterns:
- Automatic selection of the optimal number of clusters using gap analysis
- Manual specification of desired number of clusters
- Rich visualization tools for cluster interpretation
- Comprehensive separability analysis
Getting Started
Here's a simple example of using the ClusterDC library:
Data Loading and Processing
The Data class provides versatile data handling capabilities:
from clusterdc import Data
# Initialize data handler
data = Data()
# Load from various sources
df1 = data.read_file('training_data') # url link to clusterdc github
# df2 = data.read_file('https://example.com/data.csv')
# df3 = data.read_file('excel_file.xlsx', sheet_name='Sheet1')
# Get comprehensive summary statistics
summary = data.get_summary()
print(summary)
# Perform dimension reduction if needed
# (e.g., using PaCMAP, LocalMAP, UMAP, t-SNE, etc.)
# This step depends on your specific needs
# for this training data, you don't need dimension reduction.
# Data is already in 2D embedding space
# Create visualization
data.plot_scatter('PaCMAP_X', 'PaCMAP_Y', labels='category',
title='Data Visualization', save_path='scatter.png')
# Perform density-based sampling
estimated_densities = data.estimate_density(['PaCMAP_X', 'PaCMAP_Y'])
sampled_data = data.sample(n_samples=1000, method='density')
# Create comparison visualization
result_df = data.plot_density_samples('PaCMAP_X', 'PaCMAP_Y', n_samples=500,
return_samples=True)
once data is imported, summarized, and visualized. Next step is to model kernel density estimate.
Kernel Density Estimation
The KDE class offers powerful density estimation:
from clusterdc import KDE
# Initialize KDE with options
kde = KDE(
data=df[['PaCMAP_X', 'PaCMAP_Y']],
kernel_types=['gaussian', 'epanechnikov'],
n_iter=50,
k_limits=(1, 40) # as percentage of data points
)
# Fit KDE with different methods
kde.fit(method='scott') # or 'silverman', 'local_bandwidth', 'global_bandwidth'
# Get density values
point_densities = kde.get_point_densities()
grid_densities = kde.get_grid_densities()
# Visualize results
kde.plot_results()
# Analyze optimization results
kde.plot_optimization_progress()
kde.print_optimization_report()
# Benchmark performance for larger datasets
results, model, predictions = KDE.benchmark_and_predict(
data_input=df,
target_size=100000,
method='scott'
)
The KDE class offers also localized and globally optimized bandwidths using Bayesian Optimization:
from clusterdc import KDE
# Initialize KDE with options
kde = KDE(
data=df[['PaCMAP_X', 'PaCMAP_Y']],
kernel_types=['gaussian', 'epanechnikov'],
n_iter=50,
k_limits=(1, 40) # as percentage of data points
)
# Fit KDE with different methods
kde.fit(method='local_bandwidth') # finding best localized bandwidths using Bayesian Optimization.
# Get density values
point_densities = kde.get_point_densities()
grid_densities = kde.get_grid_densities()
# Visualize results
kde.plot_results()
# Analyze optimization results
kde.plot_optimization_progress()
kde.print_optimization_report()
ClusterDC
Aftet estimating data density using KDE, ClusterDC will identify clusters in data based on KDE.
from clusterdc import ClusterDC
# Create ClusterDC object with 2D data
cluster_dc = ClusterDC(
data=df,
columns=['PaCMAP_X', 'PaCMAP_Y'], # Specify which columns to use
kde_method='scott', # Use Scott's rule for bandwidth
gap_order=1 # Use first major gap for cluster selection
)
# Run clustering
assignments, density_info = cluster_dc.run_clustering()
# Visualize results
cluster_dc.plot_results(assignments, density_info)
# Find optimal number of clusters
optimal_clusters = cluster_dc.find_optimal_clusters()
print(f"Optimal number of clusters: {optimal_clusters}")
# Get cluster assignments
cluster_df = cluster_dc.get_cluster_assignments()
Advanced Clustering
The ClusterDC class provides sophisticated clustering capabilities:
from clusterdc import ClusterDC
# Create ClusterDC with options
cluster_dc = ClusterDC(
data=df,
columns=['x', 'y'],
levels=50, # Number of contour levels
min_point=5, # Minimum points per cluster
gap_order='max_clusters', # Get maximum possible clusters
kde_method='local_bandwidth' # Use adaptive bandwidth
)
# Run clustering
assignments, density_info = cluster_dc.run_clustering()
# Analyze separability between clusters
cluster_dc.plot_separability(save_path='separability.png')
# Find optimal number of clusters automatically
optimal_clusters = cluster_dc.find_optimal_clusters(
method='direct_gap',
save_path='optimal_clusters.png'
)
# Save clustering model for later use
cluster_dc.save_clustering('my_clustering_model.cdc')
# Load previously saved model
loaded_model = ClusterDC.load_clustering('my_clustering_model.cdc')
Example Notebook
To see detailed examples of how to use the library, please refer to the provided Jupyter Notebook files in the examples directory. These notebooks demonstrate the usage of the functions with sample data and provide visualizations of the analysis and clustering results. They serve as practical guides to help you get started with ClusterDC and understand its various capabilities.
Upcoming Improvements
We're actively working on enhancing ClusterDC with:
- More advanced visualization capabilities
- Extended documentation and tutorials
Contact
If you have any questions or feedback, please feel free to contact:
- Samer Hmoud: geo.samer.hmoud@gmail.com
- Maximilien Meyrieux: maximilien.meyrieux@gmail.com
Attribution
If you use ClusterDC in your work, please include the following attribution:
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
We would like to acknowledge:
-
The PaCMAP team for providing the PaCMAP dimension reduction algorithm, which is useful for reducing the dimensionality of the data before applying ClusterDC. For more information, refer to the PaCMAP & LocalMAP GitHub repository.
-
The ClusterDV team for the development of the ClusterDV MATLAB code and the synthetic datasets provided with it. ClusterDC was developed as an extension of ClusterDV to overcome its limitations in processing large datasets. For more details on ClusterDV, see the ClusterDV GitHub repository.
Please refer to the following references for more information:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clusterdc-0.0.7.tar.gz.
File metadata
- Download URL: clusterdc-0.0.7.tar.gz
- Upload date:
- Size: 9.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0d30936dab65fc33c1cf976c683c3e545f08ac4e9a9b62156af4b198544e359
|
|
| MD5 |
91abc6063b9230ad1aa70961647e9a46
|
|
| BLAKE2b-256 |
0244a3d4952275817b47bc9498ed0c00d3394c2428a0ff501fc1a7c9a11b7157
|
File details
Details for the file clusterdc-0.0.7-py3-none-any.whl.
File metadata
- Download URL: clusterdc-0.0.7-py3-none-any.whl
- Upload date:
- Size: 9.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ecfabe7e4abb49defd6a7347d3a59197646f50878558222b2d6ad6c31193485
|
|
| MD5 |
3f143e36736f8bca16090ce6a4826afb
|
|
| BLAKE2b-256 |
36c8910100b605abd4cc9c308ab0780a8c579396dc4e83234009ab336f2a113a
|