In the first post of this series, we introduced the idea that subsurface anomaly detection models can be trained using synthetic data generated from physical simulations, eliminating the need for labelled excavation examples. This post explores the rationale and validation behind that approach.

Among geophysical survey practitioners, this idea typically triggers one of two reactions. The first is scepticism: synthetic data can sound like a euphemism for fabricated data, raising concerns that a model might perform well in simulation yet fail under real-world conditions. The second is curiosity: if the approach truly works, it addresses one of the most costly and persistent challenges in the field.

This post is for both audiences. It explains how synthetic data generation is used for magnetic anomaly detection, why the approach is grounded in established physical principles rather than assumptions, how anomaly and non-anomaly samples are generated, and where the limitations genuinely lie.

What "Synthetic Data" Actually Means Here

The phrase synthetic data can be misleading. In many machine learning applications, it refers to data samples generated by another model. In magnetic anomaly detection, however, synthetic data comes from physics-based simulations of magnetic fields and sensor responses. It is not invented by AI, it is computed from known physical laws. That distinction is key.

Here, synthetic data is generated from the same physical equation that geophysicists have relied on for decades to interpret survey data. The magnetic dipole model describes how a magnetised object perturbs the surrounding magnetic field.Unlike a statistical model trained on examples, it is a physics-based formulation derived from first principles and grounded in the behaviour of real magnetic fields.

Generating a synthetic anomaly sample is not an exercise in statistical imagination. Given an object's magnetic moment, depth, orientation, sensor geometry, and measurement noise, the magnetic field response can be computed directly from the governing equations. The resulting signal is therefore a physically consistent prediction of what a magnetometer would measure. In that sense, the process is closer to a flight simulator than a deepfake: the realism comes from modelling the underlying physics, not from mimicking observed examples.

The Magnetic Dipole Model in Plain Terms

The magnetic dipole model describes how a buried metal object disturbs the Earth's magnetic field. In simple terms, the object behaves like a small magnet with a particular strength and direction.

As you move farther away, the disturbance quickly becomes weaker. A magnetometer passing overhead records this disturbance as a magnetic anomaly. The size and shape of that anomaly depend on the object's strength, depth, and orientation.

Because these relationships are governed by physics, they can be used to generate realistic synthetic training data.

A magnetometer passing overhead records a characteristic anomaly: the signal grows as the sensor approaches the object, peaks or dips near it, and then fades away. The shape of this anomaly depends on a few physical parameters-such as the object's strength, depth, and orientation-which are unknown in a real survey but can be controlled in a simulation:

Magnetic moment - how strongly the object is magnetised, and in what direction.

Burial depth - deeper objects produce weaker, broader signatures.

Orientation - the same object at a different angle produces a measurably different signature.

Sensor altitude - how high the magnetometer flies, which changes signal strength.

Measurement noise - the random variation every real sensor adds, simulated with Gaussian noise.

In a real survey, all five are unknown and tangled together. In a simulation, you set each one deliberately, which means you know the ground truth, because you defined it. That is the whole principle behind synthetic data generation for anomaly detection.

Generating an Anomaly Sample

To generate a synthetic anomaly sample, we first define the object: its magnetic moment, orientation, and burial depth. We then define the survey geometry. In our framework, a three-sensor magnetometer array moves along a linear track at a fixed altitude, recording measurements at discrete positions.

For each sensor position, the dipole equation is used to compute the magnetic field generated by the object. This field is projected onto the Earth's field direction to produce the Total Magnetic Intensity (TMI) that a real magnetometer would measure. Gaussian noise is then added to mimic measurement uncertainty.

The result is a multi-channel time series that closely resembles a real survey pass over a buried object. By repeating this process with different combinations of object properties and survey conditions, we can generate an unlimited number of anomaly samples-each automatically labelled because the presence of an object is known by design.

The Half That Gets Forgotten: Non-Anomaly Samples

A detector trained only on anomalies is of little practical use. To operate in the field, it must learn the difference between a genuine target and background ground that contains nothing of interest. That means the dataset needs non-anomaly samples as well-and generating realistic background data is just as important as generating the anomalies themselves.

A non-anomaly sample is not a flat line. Real background measurements contain natural variations in the Earth's magnetic field, geological effects, and sensor noise. If the simulated background is too clean, a detector may perform well in testing but fail in the field. The negative class must be realistic enough that distinguishing it from a genuine anomaly remains a meaningful challenge.

Our framework uses a balanced dataset, with roughly equal numbers of anomaly and non-anomaly samples, to avoid biasing the clustering stage toward either class. This is where credible synthetic data separates from the misleading kind. Generating a clear anomaly is relatively straightforward; generating realistic background data that is genuinely difficult to distinguish from an anomaly is the real challenge.

Turning Raw Signal into Usable Features

A raw multi-sensor time series contains a lot of information, but it is not an efficient input for clustering. Instead, each sample is reduced to a compact feature vector. For every sensor channel, we extract simple statistical descriptors from the TMI signal, including the maximum, minimum, mean, and peak-to-peak amplitude. These features capture the key characteristics of the anomaly: strong, shallow targets tend to produce larger amplitudes, while weak or absent targets do not.

Not every feature improves performance. Standard deviation, for example, was evaluated but ultimately discarded because it increased ambiguity rather than separation between classes. Removing it reduced the number of samples that could not be confidently assigned to a cluster. The lesson is simple: feature selection should be driven by the underlying physics, not by the assumption that more features automatically lead to better results.

The result is a compact feature vector for each sample, with all features standardised to a common scale so that no single descriptor dominates the analysis. This reduced representation becomes the input to the clustering algorithm-a topic we explore in the next post in this series.

Where Synthetic Data Stops and Reality Begins

Synthetic data generation is powerful, but it is not a complete solution.

The magnetic dipole model is an approximation. Real objects have complex shapes and magnetisation patterns that can produce responses that differ from an ideal dipole, especially at close range.

Real survey environments are also difficult to model perfectly. Geological variability, cultural interference, sensor drift, calibration errors, and platform-motion artefacts all introduce effects that are only approximated in simulation.

These limitations do not undermine the approach; they define its role. Synthetic data is highly effective for preliminary anomaly detection, helping identify likely targets from large volumes of survey data. The most practical path forward is a hybrid one: use physics-based synthetic data for pre-training, then refine the model with labelled field data as it becomes available.

The Data Problem, Solved at the Source

The labelled-data problem has remained difficult because labels are traditionally obtained through ground truth verification after excavation. This creates a fundamental bottleneck: anomaly detection systems require labelled examples to learn, but those labels only become available once the target has already been identified. Synthetic data generation breaks this dependency. By simulating the physical response of buried objects, it produces fully labelled anomaly and non-anomaly samples directly from the underlying physics, enabling large-scale dataset generation without excavation campaigns, test ranges, or extensive historical survey archives.

Read the full methodology - including the complete dipole formulation, feature analysis, and experimental results.