The
Zodiac Killer's unsolved z340: not your vanilla substitution
cipher
By the way have you cracked the last cipher I sent you? (The
Zodiac Killer, 20 Apr 1970)
**Edit 2019: an updated version of this got published here: [html
link].**
To this day, the Zodiac Killer's second major cipher, z340,
remains unsolved. In this post, I argue that z340 is not a
classical substitution cipher and that it could well be a fake
cipher. These hypotheses have been put forward before, but I
provide substantial evidence for them, mainly through the analysis
of mean squared distances for ngrams. This analysis is applied
across various ciphers: z340, z408, which is the other major
cipher by the Zodiac Killer, twenty generated ciphers based on
real texts, and twenty fake ciphers based on random letter
sequences. The results show that z340 behaves unlike other true
substitution ciphers. Instead, it associates with the fake
ciphers. However, it remains a possibility that z340 is a more
intricate cipher, e.g. a Vigenère cipher.
Introduction
Back in around 2012, I learnt about an intriguing, to this day
unsolved cipher: The Zodiac Killer's z340. Here is what the
beginning of the cipher looks like:

You can find the full cipher here [image
link]. The cipher has gripped people's imagination for
almost 50 years now. Various solutions were suggested, some more
credible [html
link] than others [html
link]. And there are some pretty decent analyses out there [html
link]. Arguably, however, the cipher remains unsolved [html
link]. In my view, the reason for why there has been so
little progress on the cipher is that the focus was too much on
trying to solve it, instead of analysing whether or not z340 is a
"bona fide" cipher in the first place.
The z340 cipher
z340 consists of 20 rows with 17 characters each, making use of a
set of 63 characters. It is the Zodiac's second and also his last
major cipher, following z408, which has been cracked by the
Hardens shortly after publication [html
link]. Simple descriptives make z340 look like a solvable
cipher. At a first glance, the character frequencies look
promising (more on character frequency: [html
link]) and there are some bigrams and two trigrams (these
are character ngrams; more on ngrams: [html
link]). The PLUS character has a relative
frequency of about 7%, suggesting that it should probably be
either "E", "T", "A", or, "O". Further, the PLUS occurs
in various bigrams. In fact, PLUS-PLUS is one
of the most frequent bigrams. In theory, these facts
alone should give us a massive head start to solving z340.
 
Please ignore the msd's for
now; we'll come back to them shortly.
A further boost comes from the fact that z340 has a lot in common
with z408 – and z408 has been cracked. z408 is also of the form
17*x, in this case 17*24 characters. It has a character set of 54
characters, 47 of which it shares with z340. While the Hardens
deciphered most of z408, the last 18 characters remain
undeciphered. As per the Hardens' Solution, z408 is a one-to-many
substitution cipher. It is also worth noting that the Zodiac
Killer mapped "E" to "E" and "R" to reversed "R", which is
sometimes interpreted as a double bluff ("'E' as 'E' would be too
easy, it must be something else"). Choosing high-frequency letters
for this ruse indicates that the Zodiac Killer was aware of
English letter frequencies. Here is a snippet of the z408 cipher,
taken from Wikisource [html
link]:

So altogether, z340 looks pretty doable and yet, it hasn't been
solved. The Zodiac Killer seemed to enjoy that people struggled to
crack the cipher. He taunted the investigators with it, which is
the opening quote above [html
link]. However, this also bothered me about z340 right from
the beginning: It looks almost too easy. If z340 is just
another z408, which has been deciphered within about a week after
its publication, then why bother to encrypt z340 in the first
place? So, the assumption is that the Zodiac Killer must have
stepped up his game big time.
Several hypotheses have been put forward, trying to explain why
z340 has been so difficult to decipher. Among them:
1) There is some trick to the cipher. For instance, the text is
mirrored, certain characters have to be disregarded, etc.
2) The cipher contains a obscure message that make it hard to
decipher.
3) It's a many-to-many cipher.
4) The cipher does not contain any message, it's a ruse.
5) It is something more advanced, e.g. a Vigenère cipher.
I will address these points one by one. However, let's first look
at the key metric that I use throughout this post: mean squared
distances for ngrams.
Mean squared distances for ngrams
I decided to use a metric based on letter/character ngrams
(throughout this post, whenever I speak of ngrams, I refer to
ngrams of letters/characters). To give a quick example, consider
the bigrams of the word "dadaism": "da", "ad", "da", "ai", "is",
and "sm". ngrams used on ciphers are a great measure of how much
language is present/accessible. If ngrams are present to an
expected degree, then there is language underlying the cipher. If
ngrams are not present to an expected degree, then there is either
no language underlying the cipher or the encryption of the cipher
is so strong that ngrams do not surface sufficiently. Further, a
nice feature of ngrams is that the direction of the cipher is
irrelevant. If a cipher is mirrored, then ngrams will surface just
as much. If you have any doubts whether ngrams are a good basis
for cipher analysis or wonder what "to an expected degree" means,
then please bear with me. Below, these points will become clearer.
Any ngram occurring more than once is valuable information, thus
measuring an ngram's distance from 1 is a plausible metric. And
the higher the frequency, the more valuable an ngram becomes,
arguably in a non-linear way. Thus, squaring the distance makes
sense. We can do this for all ngrams and then take the average.
This is what the mean squared distance from one metric
(msd) captures. The following figure, based on hypothetical data,
illustrates the msd.

The formula is as follows: 
N.B.: Alternatively, one could use the distance from 0, which, in
my view, works equally well. There are pros and cons to both. Some
transformation of the msd, e.g. taking the log or the square root
could also be useful. However, for our purposes, the bare msd is
sufficient.
We already saw the bigram msd and trigram msd for z340 in the
figure above. It can be hard to make sense of what the values mean.
So, where it becomes really interesting is when we start comparing
a cipher's msd to a meaningless baseline. Such a Δ between
msd's could be extremely insightful. For example, the
bigram msd for z340 is 0.097. If we had the msd of a meaningless
baseline, then the Δ between the two would be valuable information
about the meaningfulness of z340.
The problem with the msd is that there are difficulties with
comparing msd's across data sets. This is because the length of a
text influences the msd, as well as the length of a text's
character set. A 400 character cipher using 10 characters will see
a different ngram distribution to a 100 character cipher using 40
characters. So, ideally, we would compare any given cipher to
something that 1) is of the same length, 2) uses a character set
of the same length, and 3) is clearly meaningless. 3) allows us to
detect whether meaning is present in the target.
Luckily, we have such a baseline for any data set! We can read in
the same cipher in different ways, e.g. horizontally vs
vertically. This way, we have a baseline of the same length, using
the same character set, and we know that at least one of the two
interpretations has to be meaningless. If there is no substantial
difference in msd's, then we know that both interpretations of the
cipher are meaningless: The Δ in msd's between a meaningless text
and another meaningless text is approximating 0. If there is a
considerable difference, then we know that the interpretation with
the higher msd consists of more language underlyingly and hence,
this is the right way to interpret the cipher.
This is somewhat abstract and will become clearer soon. Let's
return to z340.
z340 vs z408
When I tried to solve z340, assuming that it's a one-to-many
substitution cipher that goes left-to-right, I didn't get
anywhere. So, I took a step back and compared z340 to z408.
Comparing character frequencies and ngrams give interesting
results.
 
First, the PLUS has a relative frequency of 7% in z340,
which in itself is suspicious. Second, note how z408's character
frequencies make a nice S-shaped tail, this is the relatively
sharp drop towards the end, while z340 behaves like a step
function. Further, z408 has a lot more bi- and trigrams. Something
fishy is going on.
This is where the analysis of Δ-msd's provides important insights.
Compare the |Δ-msd| for z340 vs |Δ-msd| for z408. (Note that in
the following, we will always look at absolute Δ's.)
 
What these graphs tell us is that if you read in z408
horizontally, then you get a certain ngram distribution (the pink
line). If you read it in vertically, then that distribution
collapses (the dark pink triangles). This contrasts to z340, where
horizontally vs vertically makes little difference. Basically,
what this means is that interpreting z340 horizontally is as
pointless as doing so vertically. We will back this up further
below by comparing z340 to other fake ciphers.
z408 gives us an idea what the |Δ-msd|'s should look like for a
meaningful one-to-many cipher vs a meaningless one-to-many cipher.
Let's plot this in a two dimensional space.

Further ways to read in z340
It doesn't matter whether or not the cipher is written left to
right or right to left; this also holds for reading in lines top
to bottom vs bottom to top. An ngram analysis would still capture
this. These symmetries also hold when we read in the cipher
vertically; top to bottom vs bottom to top would be equally
reflected in the ngram analysis.
This already covers a lot of ground in terms of different ways to
read in the cipher. I have added a few more, viz. only reading in
every other character (these are the "alternating 1" and
"alternating 2" conditions in the graph below), skipping every
third character ("two-one, shift 1-3"), ignoring the PLUS
("no plus"), randomly splitting the PLUS into two
characters ("plus split"), reading in the cipher diagonally
("diagonally"), and reading in the cipher from the outer edge
spiraling towards the centre ("spiral"). Each one is compared to
their vertical counterparts, except the diagonal condition, which
is compared to the other diagonal, and the spiral, which is
compared to z340 read-in normally. The |Δ-msd|'s come out as
follows:

It is also worth noting that some other tricks, e.g. skipping
every fourth character, alternating real words with gibberish,
etc., will lead to differences in |Δ-msd|'s, and thus, such tricks
could not break this analysis. So, unless I have missed a really
clever trick, I don't think that different ways of reading in the
cipher is the key to making sense of z340.
Contextualising
We saw how z340 compares to z408; but some more context would be
helpful. To this end, I have created forty more substitution
ciphers, twenty based on real texts and twenty based on random
letter sequences. The twenty real texts, ""true ciphers", are
based on various sources: the first letter attributed to the
Zodiac Killer [html
link], the Ramsey letter sent to the police in 1996 [html
link], the "Dear Boss" letter attributed to Jack the Ripper
[html
link], the Unabomber Manifesto [html
link], the Genesis [html
link], Karl Marx' The Capital [html
link], Message 5 by the Shadow Brokers [html
link], the Nano Whitepaper [html
link], the Declaration of Independence [html
link], the user guide of the Nokia 3310 [html
link]. The lengths of the ciphers range from 306 to 459
characters. The twenty random letter sequences, "fake ciphers",
were created with a Python script and observe letter frequency in
English. They, too, range from 306 to 459 characters. All files,
scripts, and further notes can be found on GitHub [html
link]. In a next step, I have encrypted the forty texts,
using a similar encryption to z408. The character frequencies for
the unencrypted and encrypted texts are as follows, with z408 and
z340 added as points of comparison. "z408 (H. S.)" in the left
graph is the Hardens' Solution.
 
I then calculated the |Δ-msd|'s for each of the forty ciphers,
where the target was the cipher read in horizontally and the
baseline was the cipher read in vertically. I also did this for
the unencrypted texts ("plain text" in the right graph), to add
some more context. The results are as follows:
 
We can see that z340 associates with the fake ciphers.
Interestingly, we can also see that the other true ciphers form
their own cluster, away from z408, implying that the encryption
mechanism used for them is a lot stronger than the encryption
mechanism of z408. I think that the main reason for this is that
the script is in its one-to-many mappings pseudo-random, meaning
it's mimicking true randomness pretty well [html
link]. It is likely that the Zodiac Killer did not reach
such levels of randomness, which will be reflected in the
ngrams.
This interpretation of z340 as a fake cipher is backed up by a
K-means clustering analysis, as shown in the graph below. K-means
clustering is a basic categorisation algorithm [html
link]. The nice thing is that we know the number of clusters
that we wish to see: two – one cluster for the fake ciphers and
one for the true ciphers.

The K-means analysis was done with Python and graphed with R. By
and large, the learning algorithm clusters the ciphers as one
would expect. There are a few mis-categorisations, though.
Crucially, the learning puts z340 in the same cluster as the other
fake ciphers, this is the cluster near the origin, and not in the
cluster of the other true ciphers, the other cluster further out.
I also ran an SVM, which produces similar results (see my GitHub,
[html
link]).
Word salad
One might speculate that maybe z340 is some kind of weird word
salad and that is why the |Δ-msd|'s come out so low. But here's
the thing: Even random word sequences follow the rules of English
phonology, something that should be reflected in the ngram
analysis. To illustrate this point, I analysed twelve more data
sets, four real texts, based on the Wikipedia article on Charles
Manson ("real text 21" to "real text 24", [html
link]), four texts consisting of random word sequences
("random words 1" to "random words 4"), sampled from the same
article, and four texts consisting of random letter sequences,
similar to the random data sets from above ("random letters 21" to
"random letters 24"). All texts were 400 characters long. They
were encrypted in the same way as the previous data sets. Here is
a bigram analysis of the encrypted texts.

The word salad ciphers behave very similar to the text ciphers and
unlike the random letter sequence ciphers. So, this is also not
the key to z340.
Many-to-many ciphers
What if z340 is a many-to-many cipher? Could that be the reason
for the results above? First, a quick reminder what many-to-many
ciphers are. In a one-to-one cipher, any plain text letter is
mapped to exactly one cipher character, e.g. "A" to "Α", "B" to
"Β", etc. Critically, each cipher character is only used once. In
a one-to-many cipher, plain text letters might be mapped to more
than one cipher character, e.g. "A" could be "Α" and "Ω", "B"
could be "Β" and "Ψ", etc. Here, too, each cipher character is
only used for one plain text character. That is, "Α" is always
"A". In a many-to-many cipher, this restriction is lifted. It
could for instance be the case that "A" maps to "Α" and "Ω" and
"B" maps to "Β" – but also "Ω". This way, it becomes a lot harder
to decipher such a cipher.
Many-to-many ciphers come in degrees. Strong many-to-many ciphers
are very hard, sometimes virtually impossible to crack. I wondered
how many-to-many ciphers would come out with respect to |Δ-msd|'s,
so I encrypted the texts from above, using many-to-many mappings.
The encryption algorithm is pretty strong and arguably really,
really hard to crack, maybe at times even beyond crackable. The
character frequencies for the different ciphers are shown below. I
repeated the analysis from above and produced |Δ-msd|'s for each
data set. The results are as follows:
 
The clusters are less clear now; true and fake clusters
intermingle to some degree. The random many-to-many ciphers are in
the same space as before, which is unsurprising. The true
many-to-many ciphers are closer to the origin than the one-to-many
ciphers from above. Crucially: z340 still associates with the fake
ciphers and less so with the true text ciphers. This is backed up
by another K-means analysis and another SVM analysis, both of
which can be found on GitHub [html
link].
Again, many-to-many ciphers come in degrees and we see some
many-to-many ciphers that behave like fake ciphers. It is a
possibility that z340 is an extremely strong many-to-many cipher.
If this was the case, then it is questionable whether such a
strong many-to-many cipher could be cracked.
Vigenère ciphers
An intriguing way to create a many-to-many cipher is through a
Vigenère encryption, which makes use of repeated keyword-shifted
mappings [html
link]. This is not to be confused with a simple keyword
cipher [html
link], where the latter would be easily detected through the
ngram-based analysis above.
To illustrate Vigenère ciphers, consider a one-to-one substitution
'cipher' without a shift, where both the plain text and the cipher
use the Latin alphabet as their character sets. "HELLOWORLD" in
the plaintext would also be "HELLOWORLD" in the cipher. However,
now a keyword is added. Assume it is "BEE". "BEE" represents
repeated shifts of 1, 4, and 4 positions. For "HELLOWORLD", "H"
becomes "I" ("B", shift 1), "E" becomes "I" ("E", shift 4), the
first "L" becomes "P" ("E", shift 4), the second "L" becomes "M"
("B, shift 1), and so forth. In our example, plain text
"HELLOWORLD" becomes "IIPMSAPVPE". Note how plain text "L" now
maps to both "P" and "M" and how cipher text "P" maps to both
plain text "L" and "O". Thus, we created a many-to-many cipher
with a relatively simple trick applied to a 0-shift one-to-one
'cipher'.
In principle, a Vigenère encryption can be added on top of any
cipher, no matter whether it's a one-to-one cipher, a one-to-many
cipher, or a many-to-many cipher. However, in my view, it is
likely that the ngram-based analysis from above would still pick
up on the underlying information. As to why, see here [html
link]. This depends on the length of the keyword, though. A
Vigenère cipher with a keyword of two or three characters should
behave like a true cipher in the analysis above. Longer keywords
make things more difficult as they increase the number of possible
mappings. This is something I wish to follow up on.
Either way, deciphering a Vigenère cipher can be "fiendishly
tricky", as asetniop puts it in his comment here [html
link]. One would have to find the right keyword, with
hundred of thousands of words to choose from, and find the right
order of the cipher character set – because an ordered character
set is the underlying assumption of a Vigenère encryption. But for
z340's character set, it is not clear what the order would look
like. Which character follows the "A"? "B" and then "C" seems
plausible, but it's not a given. And what comes next? "Reversed C"
or "D"? Further, what happens to all the non-Latin letters like
"THETA" and the different boxes, etc.? And even if we had all that
right, then we could still not be sure that we have covered the
entire cipher character set, as asetniop pointed out. For
instance, maybe the original character set included a SIGMA, but
it just happened that it never got used.
Discussion and concluding remarks
We saw how the Zodiac Killer's z340 compares to his other major
cipher, z408, and to other true and fake ciphers. The results show
that z340 behaves like fake ciphers and unlike true ciphers. The
difficulties with z340 are also not caused by some more-or-less
obvious trick of how to read in the cipher, e.g. having the cipher
mirrored. Further, even if the cipher was just some obscure word
salad, this should have surfaced in the analysis above.
So, z340 is not your classical substitution cipher. If you
approach it as such, chances are that you won't solve it. What is
going then? There is a possibility that z340 is a very strong
many-to-many cipher. If z340 was a classical many-to-many cipher,
then it would have to make use of ultra-strong mappings –
otherwise the analysis above would have probably detected
something. However, considering that the first cipher, z408, was
fairly weak, this seems unlikely.
Another possibility is that the Zodiac Killer made use of a
Vigenère encryption or some other, strong trick, e.g.
transposition. In case of a Vigenère encryption, there is a good
chance that underlying information would be reflected in the
analysis above, certainly for shorter keywords. This is something
I wish to follow up on. It is possible that z340 is a very strong
Vigenère cipher with a long keyword or uses some other,
not-so-standard encryption trick that cannot be detected by an
ngram-based analysis. This is indeed a possibility – but one has
to keep in mind that z340 was created before computers were
readily accessible to the public.
Another possibility is that the cipher is a ruse. Serial killers
tend to be subject to hubris, as Douglas points out in his
excellent book on the subject [html
link]. It must have been a shock to the Zodiac Killer when
he learnt that his z408 cipher had been deciphered so quick, by
two hobby cryptographers. To prove intellectual superiority, he
needed another cipher, one so strong that couldn't be deciphered,
certainly not as easily as z408. I think it is a real possibility
that the Zodiac Killer was unable to produce such a cipher and
that he chose to take another route: cheating. It is a route the
Zodiac Killer has taken before, as he promised to reveal his name
in his first cipher, which turned out to be a lie. Thus, it could
well be that to keep face, the Zodiac Killer has created a fake
cipher, one that no-one could ever decrypt.
Notes
All files can be found on GitHub [html
link]. I enjoyed the project and wanted a write up. If you
really want to donate something, then feel free to send some
amount of Nanos. I think that Nanos are truly great [html
link]. Here are my address and QR code:
xrb_39s7e8mu4wbuo77k3idfgmquep3cmkb4r56mf8rnfq6r99q9gxzsnzzj9k9x
Alternatively, here is my Bitcoin address:
14nnVvewsrvE99sm4xGvqGvy4UiB8pdQa9
Addendum and acknowledgements
An earlier version compared z340 to only four true ciphers and
four fake ciphers, which seemed a bit low. The results were very
similar – in fact, the previous results showed an even clearer
divide between true and fake ciphers. The K-means analysis'
outcome with respect to z340 was the same. Also, the first version
did not include the SVM analysis, which I added some kind of
'linear sanity check'. Further, I made some adjustments and edits,
taking feedback into account. The edits even include the title.
However, the original post is still up on Github [html
link].
The spiral-analysis was added post publication to work in feedback
by Sasja. Both Reddit user asetniop and Sasja pointed me to
Vigenère encryptions. asetniop's comment and explanation can be
found here [html
link]. To address this, I've added the section "Vigenère
ciphers".
Jarl from www.zodiackillersite.com [html
link] got in touch after reading the present entry. Jarl
argued that there is a possibility that z340 is a transposition
cipher [html
link]. I agree and in my view, a sensible follow-up would be
to write a script that automatically tests as many transposition
combinations for their |Δ-msd|'s as possible.
tsj; originally posted on 16 Mar
2018
|
|
|