Introduction

https://img.shields.io/pypi/v/bmdcluster.svg https://img.shields.io/travis/csprock/bmdcluster.svg Documentation Status https://img.shields.io/badge/License-MIT-yellow.svg

A Python implementation of the Binary Matrix Decomposition algorithm for clustering binary data.

bmdcluster

This packages contains an implementation of the Binary Matrix Decomposition (BMD) algorithm for clustering a binary matrix as presented by Li [1] and Li & Zhu [2]. BMD solves the two-sided clustering problem of clustering both the data and the features.

The algorithm works by decomposing a binary matrix \(W \in \mathbb{Z}^{n \times m}\) with \(n\) points and \(m\) features into \(\hat{W}=AXB^T\) where \(A \in \mathbb{Z}^{n \times c}\) and \(B \in \mathbb{Z}^{m \times k}\) are cluster membership indicator matrices for the data and features with \(c\) data clusters and \(k\) feature clusters. \(X \in \mathbb{Z}^{c \times k}\) is a cluster representation matrix that encodes the relationship between the data and feature clusters. The algorithm alternatingly solves for these matrices by minimizing

\[\begin{equation} \| W - AXB^T \|^2 \end{equation}\]

Two variants of the algorithm are implemented. The first method assumes the data matrix is block-diagonal, i.e. there is a one-to-one correspondence between the data and feature clusters, which is equivalent to setting \(X = I\). The second method is a general method with no restrictions on the matrix structure, i.e. a group of points can be associated with several clusters of features and vice versa. See [1] for details.

The cluster assignments of the data and features are encoded in indicator matrices that are randomly initialized at the start of the clustering procedure. [2] recommends bootstrapping a small subset of the data to get initial data cluster assignments. These are then used to seed the data indicator matrix at the beginning of the clustering procedure when used on the full dataset.

bmdcluster supports this method of initializing the data clusters as well as random initialization. Users also have the option initialize the feature cluster indicator matrix to the identity matrix, which corresponds to putting each feature in its own cluster at initialization.

Credits

[1](1, 2) Li, T. (2005, August). A general model for clustering binary data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 188-197). ACM.
[2](1, 2) Li, T., & Zhu, S. (2005, April). On clustering binary data. In Proceedings of the 2005 SIAM International Conference on Data Mining (pp. 526-530). Society for Industrial and Applied Mathematics.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.