Clustering is one of the most important techniques in unsupervised learning because it reveals the natural structure within data without requiring labelled examples. By identifying groups of similar observations, clustering helps uncover hidden patterns, summarise complex datasets, and distinguish normal behaviour from unusual behaviour. This makes it particularly valuable in anomaly detection, where labelled anomalies are often rare or unavailable.

Clustering is the process of grouping data points so that points within the same group are more similar to each other than to those in other groups. In anomaly detection, it works without needing labelled examples. Instead, it lets the data reveal its natural structure, making it easier to identify observations that do not fit the normal patterns. Grouping the data is the straightforward part; the real challenge is choosing the clustering algorithm that best matches the characteristics of the data.

Earlier posts in this series explained how to detect subsurface anomalies without labelled training data - generating synthetic survey signals from the magnetic dipole model and then reducing each sample to a compact feature vector. This post picks up from there. Now that we have a feature space, and the next question is: which unsupervised clustering algorithm should best separate anomaly from non-anomaly, and why?

Three algorithms dominate unsupervised clustering: K-Means, Gaussian Mixture Models, and DBSCAN. Each rests on a different assumption about what a cluster is. Understanding those assumptions is the whole of the decision, because an inappropriate assumption can produce clusters that appear well defined but fail to represent the true structure of the data.

K-Means: Simple, Fast, and Quietly Demanding

K-Means is often the first clustering algorithm we encounter, and for good reason. It is fast, simple to implement, and easy to understand. We specify the number of clusters, the algorithm places that many centroids, assigns each data point to its nearest centroid, updates the centroid locations, and repeats this process until the assignments no longer change. When the data form well-separated, roughly spherical clusters of similar size and density, K-Means performs remarkably well.

The demands it makes are quieter than they first appear. Before the algorithm can begin, we must specify the number of clusters. In a controlled experiment, where there are two known classes - anomaly and non-anomaly - that is a reasonable choice. In a real survey, where the number of distinct structures is unknown, it is a guess that shapes the entire result. If we specify two clusters when the data naturally contains three, K-Means will force the data to fit our choice, splitting or merging the underlying structure to satisfy the number we imposed.

K-Means also assumes that clusters are roughly spherical and of similar size. Magnetic anomaly responses rarely satisfy these assumptions. Signatures vary widely with object strength, depth, and orientation, producing a feature distribution that is uneven and irregularly shaped. And critically, K-Means has no concept of an outlier. Every point is assigned to a cluster, including the noisy, ambiguous points that should arguably belong to nothing. In anomaly detection, those observations are often the most informative and forcing them into an existing cluster can obscure the very anomalies we are trying to detect.

Gaussian Mixture Models: More Flexible, Same Blind Spot

A Gaussian Mixture Model, or GMM, relaxes some of the assumptions that makes K-Means rigid. Rather than assigning each data point exclusively to a single cluster, it models the data as a mixture of several Gaussian distributions and estimates the probability that each point belongs to each cluster. Because each Gaussian has its own shape and orientation, GMM can represent elongated, tilted and overlapping clusters that K-Means cannot. In that sense, it is a more flexible and expressive extension of the same basic idea.

That added flexibility helps overcome the shape limitation, but it inherits two of the same limitations that matter for geophysical data. First, we must still specify the number of mixture components in advance, so the uncertainty about how many distinct structures exist does not disappear. And like K-Means, GMM has no native notion of noise or outliers. GMM assigns a probability to every point, including the genuine anomalies. Unless we introduce an explicit probability threshold or rejection criterion, every point is ultimately assigned to the component with the highest probability rather than being identified as noise.

There is a further limitation. GMM assumes that each cluster can be modelled by a Gaussian distribution. When that assumption holds, the model is both elegant and effective. When it does not, and feature spaces derived from real geophysical survey data rarely consist of clean Gaussian clusters, the resulting fit can be misleading in ways that are difficult to recognise by eye. A model that always returns a plausible answer is not the same as one that returns a correct one.

DBSCAN: Density as the Definition of a Cluster

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) starts from a different premise. It does not ask how many clusters there are, and it does not assume that clusters have any particular shape. It defines a cluster as a region where points are packed densely together, separated from other clusters by regions of sparser space. Points in dense neighbourhoods are grouped into clusters; points that lie alone in sparse regions are labelled as noise and assigned to no cluster.

That single design choice addresses both weaknesses of the centroid-based methods. First, we do not need to specify the number of clusters in advance; DBSCAN discovers however many dense regions exist. Second, and more important for anomaly detection, it has an explicit category for noise. A point that does not belong to any dense region is not forced into a cluster simply because every point must be assigned; it is set aside as an outlier, which is often precisely the signal of interest.

DBSCAN is governed by two parameters: the neighbourhood radius, epsilon (ε), sets how close two points must be to count as neighbours, and the minimum number of points (MinPts) sets how many neighbours are required to define a dense region. Together they encode what density means for the data. This is where the method asks for care rather than guesswork. Both parameters must be tuned to the scale of the feature space, and the right values depend on the dataset and its density.

The VE3 AI Research framework studied exactly this. Across datasets containing 20 to 300 samples, the analysis found that as the number of points increased, the optimal epsilon (ε) generally decreases while the minimum number of points (MinPts) generally increase, preserving genuine structure and avoiding merged clusters. Smaller datasets clustered cleanly with a larger radius andlower MinPts values; larger, denser datasets needed a tighter radius and higher MinPts values. The behaviour was systematic, not arbitrary, which is what makes it a practical guide for parameter selection rather than a process of trial and error.

Why Density Fits the Physics

The case for DBSCAN here is not that it is fashionable or generally superior. It is that its core assumption matches the structure of the problem. Non-anomaly samples, drawn from quiet background, cluster tightly -: their statistical features resemble one another, forming a dense region. Anomaly samples, shaped by varying object strength, depth, and orientation, are more dispersed throughout the feature space, while ambiguous, noisy points often lie between dense regions and belong clearly to neither cluster.

An algorithm built around density and noise reads that landscape correctly. It groups the tight background region, identifies irregularly shaped regions of higher density without requiring them to be spherical, and sets the ambiguous points aside instead of forcing them into a cluster. K-Means and GMM, built around centroids and fixed cluster counts, would impose a structure the data does not have; and do so confidently, which is what makes the resulting clusters potentially misleading.

This is also why the framework's feature-selection results fit the same logic. The research tested standard deviation as a feature and found it added noise rather than discrimination, increasing the number of points labelled as noise by DBSCAN. Removing it sharpened the density structure on which DBSCAN depends. The lesson runs both ways: the algorithm should match the data, and the features should be chosen to reveal that structure as clearly as possible.

Choosing in Practice

None of this makes K-Means or GMM bad algorithms. Each is the right tool under the right conditions, and a practical comparison is more useful than a verdict.

Choose K-Means when we know roughly how many clusters to expect, the groups are compact and similar in size, and outliers are not the point of the exercise. It is fast and entirely adequate for well-behaved data.

Choose a GMM when we expect clusters to overlap or exhibit elongated shapes and when soft, probabilistic membership is preferred over hard assignment. - and when the data within each group scan be reasonably modelled by a Gaussian distribution.

Choose DBSCAN when we do not know the number of clusters in advance, cluster shapes are irregular, and identifying outliers is part of the goal. This describes subsurface anomaly detection closely.

The reason DBSCAN anchors this research is that all three of those conditions hold. The number of structures is not known ahead of time, the anomaly feature space is irregular by nature, and the points labelled as noise are not a nuisance to be absorbed but a signal to be surfaced. When the assumptions of the algorithm align with the underlying structure of the data, the clustering produces meaningful results. When they do not, it may produce convincing clusters that do not reflect reality.

The Algorithm Is a Choice About Assumptions

It is tempting to treat the choice of clustering algorithm as a technical detail to settle late and quickly. It is quite the opposite. Each algorithm carries a built-in assumption about what a cluster is: convex and counted for K-Means, Gaussian and counted for GMM, or dense and uncounted for DBSCAN. Choosing the algorithm is choosing which assumption to impose on your data.

For subsurface anomaly detection built on synthetic, physics-derived features, density-based clustering is the approach that fits. It discovers structure without being told how much to find, tolerates the irregular shapes that real anomaly responses produce, and treats genuine outliers as distinct from the underlying background. That alignment between method and data is what turns a clustering result into a reliable first pass; and a reliable first pass is the foundation on which everything else is built.

Read the full methodology - including the complete DBSCAN parameter analysis and PCA cluster visualisations across 20 to 300 sample datasets.