Two complementary Python repositories for training and analyzing Growing Hierarchical Self-Organizing Maps, focusing on design decisions, trade-offs, and practical research workflows.

On Building Tools for Hierarchical Self-Organization

Machine learning tools often force a choice: usability or transparency. You get a clean API or you get visibility into what happens under the hood. The two repositories I have been working on—ghsom-py and ghsom-toolkits—represent an attempt to avoid this trade-off.

The Core Implementation

GHSOM, or Growing Hierarchical Self-Organizing Map, is not a new algorithm. Rauber, Merkl, and Dittenbach introduced it in 2002 as a way to let self-organizing maps grow both horizontally and vertically. The map expands when data demands it, creating a hierarchy that reflects natural structure. This matters because real data rarely fits neatly into a fixed grid.

The ghsom-py repository contains a pure Python implementation. It uses only NumPy, which keeps dependencies minimal. The design prioritizes clarity over performance tricks. You can follow the training loop without diving into C extensions. Parallel training is supported.

Key parameters control growth. The t1 threshold determines when a neuron spawns a child map. The t2 threshold stops the entire process. These values shape the hierarchy, and they require thought. There is no automatic tuning. This is intentional. The algorithm should not hide its decisions.

A callback system tracks training progress. You can log metrics, visualize growth, or integrate with Weights & Biases.

The Analysis Layer

Training a model is one task. Understanding it is another. The ghsom-toolkits repository addresses this second problem. It started as analysis scripts for a music generation project (during my PhD), then grew into a separate package.

Visualization begins with hierarchy trees. Graphviz renders the structure. Node size shows cluster populations. Color indicates depth. You can see which branches grew dense and which remained sparse.

Heatmaps reveal internal states. Weight vectors show what each neuron learned. U-Matrices display distances between neighbors. Activation maps trace how specific samples flow through the hierarchy. These are standard tools for SOMs, but they must handle the hierarchical case.

The toolkit includes an interactive dashboard built with Dash. You can explore models in a browser. Select a node. Highlight its subtree. Watch how samples activate different levels. This is useful for presentations and for building intuition.

Comparison tools matter because GHSOM results depend on thresholds. You can train three models—loose, medium, tight—and inspect the differences. Radar charts show multiple metrics at once. Silhouette scores and Davies-Bouldin indices quantify cluster quality. The toolkit can generate HTML reports that combine figures with metrics.

Design Decisions

The split into two packages reflects a practical concern. Training requires few dependencies. Analysis does not. Matplotlib, pydot, and Dash add weight. Separating them lets users install only what they need.

The APIs mirror each other. You adapt a trained model with one function call. Visualization routines accept the adapted format. This reduces friction. It also makes the division between training and analysis explicit.

Documentation lives in both repositories. The core package explains parameters and algorithm details. The toolkit shows usage patterns. This separation prevents duplication while keeping each document focused.

Limitations and Trade-offs

Pure Python runs slower than optimized C. The implementation accepts this. Speed matters less than clarity for many research applications. When performance becomes critical, users can export weights to faster tools.

The visualization layer assumes models fit in memory. Very large hierarchies might require different approaches. The current tools work for datasets up to medium scale.

The GHSOM algorithm itself has constraints. It works best when clusters have clear spatial structure. High-dimensional sparse data presents challenges. The tools do not pretend otherwise.

Why This Matters

Self-organizing maps have fallen out of fashion. Neural networks dominate. Yet hierarchical clustering remains useful for exploratory analysis. You can inspect every node. The structure is interpretable. This matters for scientific applications where understanding beats raw performance.

The repositories provide a complete workflow. You can train models, visualize them, compare variants, and generate reports. The code is testable and documented. It does not try to be revolutionary. It tries to be reliable.

Looking Forward

There are clear next steps. Three-dimensional visualizations could help with complex hierarchies. Jupyter integration would streamline notebooks. Network graph layouts might reveal different structures.

The goal is not to build the most comprehensive toolkit. It is to create something that researchers can use, understand, and modify. The source code is the real documentation. Short functions. Clear variable names. Explicit parameters.

If you work with hierarchical clustering, try these tools. They are not perfect. They are simply honest about what they do.