A powerful density-based clustering library for discovering patterns in 2D data

These details have not been verified by PyPI

Project links

Project description

ClusterDC

Overview

ClusterDC is a powerful density-based clustering library tailored for identifying clusters in two-dimensional embedding spaces. It is fast, robust, flexible, and data-driven. Initially created to address the clustering challenges faced by geochemists, it has evolved into a comprehensive toolkit for data analysis, visualization, and clustering that can be applied across multiple domains.

ClusterDC helps in analyzing two-dimensional embeddings of multivariate data, such as multielement assay datasets, to identify meaningful patterns and groups. At its core, ClusterDC leverages advanced Kernel Density Estimation (KDE) techniques to accurately model the underlying density distribution of your data. This density-based approach excels at identifying natural clusters of arbitrary shapes and varying densities, making it particularly effective for real-world data with complex structures.

ClusterDC operates only on two-dimensional data, so high-dimensional datasets must first be reduced to 2D using dimension reduction techniques. Based on extensive testing across numerous projects, we strongly recommend PaCMAP for its robust non-linear dimensionality reduction capabilities that effectively preserve cluster structures. LocalMAP, an improved version of PaCMAP, is another excellent option we're currently evaluating. While UMAP and t-SNE are also compatible with ClusterDC, our experience shows that PaCMAP and LocalMAP typically produce superior clustering results by maintaining better separation between natural data groups.

The library's advanced KDE implementation offers multiple kernel functions and bandwidth selection methods, including adaptive local bandwidths that automatically adjust to variations in your data density. This provides superior performance compared to traditional clustering methods, especially for datasets with irregular distributions, outliers, or varying cluster densities. We recommend using local bandwidth KDE for understanding clusters within your data first before start lumping these clusters into bigger ones. Try always to use max_clusters option at the first attempt to understand the clusters within the data to see the maximum number of clusters that can be generated using local bandwidth KDE.

While originally focused on geological applications, ClusterDC can be used in many fields beyond geosciences, such as environmental engineering, biological sciences, financial analysis, and other natural sciences. These areas often struggle with clustering due to natural variations and complex real-world phenomena that ClusterDC's density-based approach handles effectively.

For more information about how the core algorithm works, please refer to the publication:
Meyrieux, M., Hmoud, S., van Geffen, P., Kaeter, D. CLUSTERDC: A New Density-Based Clustering Algorithm and its Application in a Geological Material Characterization Workflow. Nat Resour Res (2024). https://doi.org/10.1007/s11053-024-10379-5

The publication presents case studies demonstrating the application of ClusterDC in geological contexts, showing how the algorithm supports the characterization of geological material types based on multi-element geochemistry.

3D plot of the Kernel Density Estimation

Contour plot of the Kernel Density Estimation

Installation

ClusterDC can be installed using pip:

pip install clusterdc

If you're using a conda environment, you may want to install certain dependencies with conda first:

conda install -y numpy=1.24.3 matplotlib=3.7.1
pip install clusterdc

Key Components

ClusterDC now consists of three main classes that work together to provide a complete data analysis pipeline:

1. Data Class

A comprehensive data handling utility that simplifies loading, processing, analyzing, and visualizing data from various sources:

Versatile data loading from local files, URLs, or built-in datasets
Support for multiple formats including CSV, Excel, JSON, and HTML tables
Data summarization with comprehensive statistics
Rich visualization capabilities including:
- Scatter plots with customizable markers and coloring
- Density-colored scatter plots for pattern identification
- Combined visualizations showing original and sampled data
- Customizable plotting parameters (colors, sizes, transparency)
- Plot saving in various formats with adjustable resolution
Advanced sampling techniques including density-based sampling for large datsets
Seamless integration with KDE and clustering components

2. KDE (Kernel Density Estimation) Class

An advanced implementation of kernel density estimation with multiple bandwidth selection methods:

Multiple kernel functions (Gaussian, Epanechnikov, Laplacian)
Automatic bandwidth selection using Bayesian optimization
Adaptive local bandwidth estimation based on k-nearest neighbors
Global bandwidth with anisotropic covariance estimation
Rule-of-thumb methods (Scott's rule, Silverman's rule, both are same when having 2D datasets)
High-performance implementation with memory optimization
Visualization tools for analyzing density distributions
Performance benchmarking capabilities
saving and loading KDE models to save time

3. ClusterDC Class

The core clustering algorithm that identifies natural clusters based on density patterns:

Automatic selection of the optimal number of clusters using gap analysis
Manual specification of desired number of clusters
Rich visualization tools for cluster interpretation
Comprehensive separability analysis

Getting Started

Here's a simple example of using the ClusterDC library:

Data Loading and Processing

The Data class provides versatile data handling capabilities:

from clusterdc import Data

# Initialize data handler
data = Data()

# Load from various sources
df1 = data.read_file('training_data') # url link to clusterdc github
# df2 = data.read_file('https://example.com/data.csv')
# df3 = data.read_file('excel_file.xlsx', sheet_name='Sheet1')

# Get comprehensive summary statistics
summary = data.get_summary()
print(summary)

# Perform dimension reduction if needed
# (e.g., using PaCMAP, LocalMAP, UMAP, t-SNE, etc.)
# This step depends on your specific needs
# for this training data, you don't need dimension reduction. 
# Data is already in 2D embedding space

# Create visualization
data.plot_scatter('PaCMAP_X', 'PaCMAP_Y', labels='category', 
                  title='Data Visualization', save_path='scatter.png')

# Perform density-based sampling
estimated_densities = data.estimate_density(['PaCMAP_X', 'PaCMAP_Y'])
sampled_data = data.sample(n_samples=1000, method='density')

# Create comparison visualization
result_df = data.plot_density_samples('PaCMAP_X', 'PaCMAP_Y', n_samples=500, 
                                      return_samples=True)

once data is imported, summarized, and visualized. Next step is to model kernel density estimate.

Kernel Density Estimation

The KDE class offers powerful density estimation:

from clusterdc import KDE

# Initialize KDE with options
kde = KDE(
    data=df[['PaCMAP_X', 'PaCMAP_Y']],
    kernel_types=['gaussian', 'epanechnikov'],
    n_iter=50,
    k_limits=(1, 40)  # as percentage of data points
)

# Fit KDE with different methods
kde.fit(method='scott')  # or 'silverman', 'local_bandwidth', 'global_bandwidth'

# Get density values
point_densities = kde.get_point_densities()
grid_densities = kde.get_grid_densities()

# Visualize results
kde.plot_results()

# Analyze optimization results
kde.plot_optimization_progress()
kde.print_optimization_report()

# Benchmark performance for larger datasets
results, model, predictions = KDE.benchmark_and_predict(
    data_input=df,
    target_size=100000,
    method='scott'
)

The KDE class offers also localized and globally optimized bandwidths using Bayesian Optimization:

from clusterdc import KDE

# Initialize KDE with options
kde = KDE(
    data=df[['PaCMAP_X', 'PaCMAP_Y']],
    kernel_types=['gaussian', 'epanechnikov'],
    n_iter=50,
    k_limits=(1, 40)  # as percentage of data points
)

# Fit KDE with different methods
kde.fit(method='local_bandwidth') # finding best localized bandwidths using Bayesian Optimization.

# Get density values
point_densities = kde.get_point_densities()
grid_densities = kde.get_grid_densities()

# Visualize results
kde.plot_results()

# Analyze optimization results
kde.plot_optimization_progress()
kde.print_optimization_report()

ClusterDC

Aftet estimating data density using KDE, ClusterDC will identify clusters in data based on KDE.

from clusterdc import ClusterDC

# Create ClusterDC object with 2D data
cluster_dc = ClusterDC(
    data=df,
    columns=['PaCMAP_X', 'PaCMAP_Y'],  # Specify which columns to use
    kde_method='scott',  # Use Scott's rule for bandwidth
    gap_order=1  # Use first major gap for cluster selection
)

# Run clustering
assignments, density_info = cluster_dc.run_clustering()

# Visualize results
cluster_dc.plot_results(assignments, density_info)

# Find optimal number of clusters
optimal_clusters = cluster_dc.find_optimal_clusters()
print(f"Optimal number of clusters: {optimal_clusters}")

# Get cluster assignments
cluster_df = cluster_dc.get_cluster_assignments()

Advanced Clustering

The ClusterDC class provides sophisticated clustering capabilities:

from clusterdc import ClusterDC

# Create ClusterDC with options
cluster_dc = ClusterDC(
    data=df,
    columns=['x', 'y'],
    levels=50,  # Number of contour levels
    min_point=5,  # Minimum points per cluster
    gap_order='max_clusters',  # Get maximum possible clusters
    kde_method='local_bandwidth'  # Use adaptive bandwidth
)

# Run clustering
assignments, density_info = cluster_dc.run_clustering()

# Analyze separability between clusters
cluster_dc.plot_separability(save_path='separability.png')

# Find optimal number of clusters automatically
optimal_clusters = cluster_dc.find_optimal_clusters(
    method='direct_gap',
    save_path='optimal_clusters.png'
)

# Save clustering model for later use
cluster_dc.save_clustering('my_clustering_model.cdc')

# Load previously saved model
loaded_model = ClusterDC.load_clustering('my_clustering_model.cdc')

Example Notebook

To see detailed examples of how to use the library, please refer to the provided Jupyter Notebook files in the examples directory. These notebooks demonstrate the usage of the functions with sample data and provide visualizations of the analysis and clustering results. They serve as practical guides to help you get started with ClusterDC and understand its various capabilities.

Upcoming Improvements

We're actively working on enhancing ClusterDC with:

More advanced visualization capabilities
Extended documentation and tutorials

Contact

If you have any questions or feedback, please feel free to contact:

Samer Hmoud: geo.samer.hmoud@gmail.com
Maximilien Meyrieux: maximilien.meyrieux@gmail.com

Attribution

If you use ClusterDC in your work, please include the following attribution:

Meyrieux, M., Hmoud, S., van Geffen, P., Kaeter, D. CLUSTERDC: A New Density-Based Clustering Algorithm and its Application in a Geological Material Characterization Workflow. Nat Resour Res (2024). https://doi.org/10.1007/s11053-024-10379-5

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

We would like to acknowledge:

The PaCMAP team for providing the PaCMAP dimension reduction algorithm, which is useful for reducing the dimensionality of the data before applying ClusterDC. For more information, refer to the PaCMAP & LocalMAP GitHub repository.
The ClusterDV team for the development of the ClusterDV MATLAB code and the synthetic datasets provided with it. ClusterDC was developed as an extension of ClusterDV to overcome its limitations in processing large datasets. For more details on ClusterDV, see the ClusterDV GitHub repository.

Please refer to the following references for more information:

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.7

Mar 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusterdc-0.0.7.tar.gz (9.4 MB view details)

Uploaded Mar 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clusterdc-0.0.7-py3-none-any.whl (9.4 MB view details)

Uploaded Mar 10, 2025 Python 3

File details

Details for the file clusterdc-0.0.7.tar.gz.

File metadata

Download URL: clusterdc-0.0.7.tar.gz
Upload date: Mar 10, 2025
Size: 9.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for clusterdc-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`f0d30936dab65fc33c1cf976c683c3e545f08ac4e9a9b62156af4b198544e359`
MD5	`91abc6063b9230ad1aa70961647e9a46`
BLAKE2b-256	`0244a3d4952275817b47bc9498ed0c00d3394c2428a0ff501fc1a7c9a11b7157`

See more details on using hashes here.

File details

Details for the file clusterdc-0.0.7-py3-none-any.whl.

File metadata

Download URL: clusterdc-0.0.7-py3-none-any.whl
Upload date: Mar 10, 2025
Size: 9.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for clusterdc-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ecfabe7e4abb49defd6a7347d3a59197646f50878558222b2d6ad6c31193485`
MD5	`3f143e36736f8bca16090ce6a4826afb`
BLAKE2b-256	`36c8910100b605abd4cc9c308ab0780a8c579396dc4e83234009ab336f2a113a`

See more details on using hashes here.

clusterdc 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ClusterDC

Overview

Installation

Key Components

1. Data Class

2. KDE (Kernel Density Estimation) Class

3. ClusterDC Class

Getting Started

Data Loading and Processing

Kernel Density Estimation

ClusterDC

Advanced Clustering

Example Notebook

Upcoming Improvements

Contact

Attribution

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes