The Dictionary Collision Effect in Computational Decipherment
Abstract
Computational decipherment routinely uses dictionary hit rate as a success metric: decode unknown symbols into short strings and count how many appear in a reference dictionary. We show that this metric is systematically broken. When decoded strings are short (2 to 4 characters) and dictionaries are large, chance collisions produce matches at rates approaching genuine ones, inflating apparent results by an order of magnitude or more (up to unbounded ratios when the true signal is zero). We introduce a four-category token classification (signal, shared hit, anti signal, shared miss) computed by comparing decoded output against null corpora with matched character statistics. The framework recovers a calibrated net-signal metric that correctly identifies wrong-language evaluations as noise (negative net signal) where five standard alternatives all fail, reporting 18.7 to 19.0 percent "signal" on a Hebrew dictionary against correctly-decoded Latin plaintext. The code, data, and figures are available at github.com/mruckman1/signal-isolation-paper.
The method was developed during analysis of the Voynich Manuscript (Ruckman, 2026) but applies to any computational decipherment pipeline that uses dictionary matching. A reference implementation is published as the dictcollision PyPI package.