Improved Cluster identification and Visualization in high-dimensional Data using Self-Organizing Maps

TitleImproved Cluster identification and Visualization in high-dimensional Data using Self-Organizing Maps
Publication TypeConference Paper and Presentation
Year of Publication2011
AuthorsManukyan, N, Eppstein, MJ, Rizzo, DM
Conference Name EOS Transactions, American Geophysical Union Fall Meeting
Date Published2011/12
Conference Location San Francisco, CA

A Kohonen self-organizing map (SOM) is a type of unsupervised artificial neural network that results in a self-organized projection of high-dimensional data onto a low-dimensional feature map, wherein vector similarity is implicitly translated into topological closeness, enabling clusters to be identified. In recently published work [1], 209 microbial variables from 22 monitoring wells around the leaking Schuyler Falls Landfill in Clinton, NY [2] were analyzed using a multi-stage non-parametric process to explore how microbial communities may act as indicators for the gradient of contamination in groundwater. The final stage of their analysis used a weighted SOM to identify microbial signatures in this high dimensionality data set that correspond to clean, fringe, and contaminated soils. Resulting clusters were visualized with the standard unified distance matrix (U-matrix). However, while the results of this analysis were very promising, visualized boundaries between clusters in the SOM were indistinct and required manual and somewhat arbitrary identification. In this contribution, we introduce (i) a new cluster reinforcement (CR) phase to be run subsequent to traditional SOM training for automatic sharpening of cluster boundaries, and (ii) a new boundary matrix (B-matrix) approach for visualization of the resulting cluster boundaries. The CR-phase differs from standard SOM training in several ways, most notably by using a feature-based neighborhood function rather than a topologically-based neighborhood function. In contrast to the U-matrix, the B-matrix can be directly superimposed on heat maps of the individual features (as output by the SOM) using grid lines whose thickness corresponds to inter-cluster distances. By thresholding the displayed lines, one obtains hierarchical control of the visual level of cluster resolution. We first illustrate the advantages of these methods on a small synthetic test case, and then apply them to the Schuyler Falls landfill data to demonstrate how the proposed methods facilitate automatic identification and visualization of clusters in real-world, high-dimensional biogeochemical data with complex relationships. The proposed methods are quite general and are applicable to a wide range of geophysical problems.

Refereed DesignationRefereed
Attributable Grant: 
Grant Year: