Enterprise document automation has spent the last three years quietly running into a wall that the demos never show. The wall is not the language model, the extraction engine, or the orchestration layer -it is the moment before any of that run, when a real-world submission arrives as a single PDF containing eight different documents in non-standard order. Solving that pre-processing step well, with mathematics rather than heuristics, is what makes the rest of the pipeline trustworthy.
The character-count failure mode
Most production Retrieval-Augmented Generation systems split incoming documents into chunks of a fixed character or token length -1,000 characters with a 200-character overlap is a common default, partly because it is computationally cheap and partly because every popular framework ships with it as the out-of-the-box behaviour. The cost only becomes visible later. When a chunk boundary lands halfway through a topic, the resulting vector embedding represents an average of two unrelated meanings rather than either one cleanly. Retrieval against that diluted embedding returns plausible-looking but semantically blurred passages, and downstream reasoning quietly degrades. The failure is not catastrophic; it is statistical, and it is precisely the kind of degradation that does not surface in unit tests but does surface when a regulator asks why a particular decision was made.
Why fixed splitters and template separators fail on bundles
The bundled-document case makes the failure mode acute. A regulatory submission, a commercial loan packet, or a specialty insurance slip pack is not a single document with topic drift inside it -it is structurally distinct documents glued together. Fixed-length splitters cut blindly across those structural seams. Template-based separators (barcode pages, keyword sentinels, page-count rules) work until the first counterparty changes their layout, scans the pack out of order, or omits a separator entirely. In production we see all three constantly. Worse, deterministic Intelligent Document Processing platforms enforce hard limits -MuleSoft IDP, for instance, caps files at 50 pages and 10MB -so a 120-page bundled submission cannot be fed in whole even if the separator logic worked. The pre-processing layer has to fracture the bundle correctly, and it has to do so without prior knowledge of its structure.
The algorithm
Semantic splitting replaces structural heuristics with a measurement of meaning. The bundle is first tokenised at sentence granularity using a standard segmentation library; sub-sentence chunking introduces noise without resolution gain. The algorithm then applies a sliding window across consecutive sentence groups -a window of six sentences, advanced one sentence at a time, is a defensible starting point. Each window is divided into two halves, and an embedding model produces a high-dimensional vector representation of each half.
The divergence between the two halves is measured using cosine similarity, the standard angular distance between vectors in embedding space, defined as cos(θ) = (A · B) / (‖A‖ ‖B‖), where A and B are the embedding vectors of the leading and trailing halves of the window. Semantic distance is then taken as 1 minus that value. Across a homogeneous section of a single document the distance is low; across a structural seam between two distinct documents -a catch certificate giving way to a commercial invoice, an applicant's filed accounts giving way to their business plan -it rises sharply.
What converts those rises into useful boundaries is thresholding. Tracking the distance series across the full bundle produces a noisy curve with clear peaks at every structural transition. Setting the boundary threshold at the 95th percentile of the bundle's own distance distribution -rather than at an absolute global value -calibrates the algorithm to the document's intrinsic variance. Peaks crossing that threshold are taken as logical document boundaries. The bundle is fractured at those points, each resulting sub-document is independently classified, and the pipeline downstream receives a stream of cleanly separated files at sizes the extraction engine can handle.
Implementation considerations
Three choices materially affect performance. The embedding model has to be modern enough to capture domain-specific semantic shifts -sentence-transformer models in the 384-to-1024-dimension range are the practical sweet spot, with domain-tuned variants outperforming general models on technical document corpora. Window size trades sensitivity against noise: a six-sentence window catches abrupt transitions cleanly but smooths over gentle topic drift within a long document, which is usually desirable for separation but should be tested against representative bundles. The 95th percentile threshold is a sensible default, but for bundles with very few documents it produces too few boundaries and for highly heterogeneous bundles it produces too many; calibrating it empirically on a labelled validation set is worth the afternoon it takes. None of these decisions is irreversible -the algorithm tolerates re-running with different parameters far better than a template-based system tolerates a single new layout.
Production pattern: PromptX in front of MuleSoft IDP
In the architecture this approach was developed for, PromptX runs as a semantic pre-processor immediately upstream of MuleSoft IDP. Bundles arrive over any MuleSoft connector -SFTP, monitored mailbox, government portal export -and are handed to PromptX before any extraction is attempted. PromptX performs the embedding, the sliding-window divergence analysis, and the threshold-based separation, then passes each isolated sub-document back into MuleSoft IDP with the appropriate extraction template. The 50-page and 10MB limits are never breached because the bundle is never presented to IDP whole; the IDP layer continues to do what it does well -deterministic field extraction with measurable confidence scores -on inputs it is structurally suited to handle. Salesforce Agentforce then orchestrates downstream validation and human review, with every separation decision logged against the case alongside the extraction outcomes.
What this enables
The practical effect is end-to-end automation of a class of input that previously required manual triage at the front of the pipeline. Bundled regulatory submissions, multi-document loan packets, complex claim files and specialty slip packs become processable at machine speed without sacrificing the auditability of where each decision came from. The bottleneck moves from "can we even get this into the extractor" to "is the extracted reasoning right" -which is a much better problem to have.


.png)
.png)
.png)



