Stonehill College

Mathematical Experiments in Computer Science

Professor Shai Simonson

Cryptography Laboratory

Introduction

In this lab, you will experiment with cryptographic techniques that are the foundation for today’s e-commerce and internet communication.

In World War II the war behind the war was all about the British trying to intercept and decode German transmissions. (Check here for the contributions of the famous pioneer computer scientist, Alan Turing in this effort.).   Here is the recent apology from the British government.

A simple version of the codes used in World War II works like this.  The alphabet is shifted over by some number of letters, so that “a” might become “f”, and “b” would be “g”, etc., wrapping all the way around.  We could call that the "f"-shift.  The substitutions for the f-shift are shown below:

abcdefghijklmnopqrstuvwxyz
fghijklmnopqrstuvwxyzabcde

For example, the word “shai” is encoded into “xmfn” using the "f"-shift.  Note that the f-shift is nothing more than adding 5 to each letter’s ASCII value.  You need to be careful to wrap around appropriately modulo 26.

If that was all there was to encoding, then a code breaker would only have to try 25 possibilities to break the code.  A code breaker simply tries translating the message using each of the letter shifts until one translation looks like a real message.

The Germans were more clever in creating their encodings.  The trick they used (more or less) was to have each subsequent letter use a different shift.  After a while these shifts would cycle.  The sequence of shifts was represented by a code word consisting of letter shifts.  For example, if the codeword was “remarkable”, then the first letter of the message would shift by 17, the second letter by 4, …, the 7th letter by 0 etc.  The 11th letter would shift by 17 again like the first and the cycle would start to repeat.

For example, the first few characters of the sentence “We meet in New York for a rendezvous” would be encoded as “Ni ye…”.  Note that spaces between words were not actually transmitted or encoded.

Checkpoint:  Make sure you can encode the rest of the sentence yourself by hand.

For a codeword like “remarkable” of length ten, there are now 2610 different possible encodings.  The longer the sequence of the cycle, the more possibilities there are.

Breaking the Code

Once one knows the length of the cycle, there are a number of techniques for breaking the code without having to try all the possibilities. One idea is to divide the letters of the encoded message up into 10 groups, one for each shift in the cycle.  The 1st, 11th, 21st etc make the first group.  The 2nd, 22nd, 32nd etc. make the second group.  And so on.  The letters in each group are encoded using the same shift.

We try to decode the letters a group at a time.  For each group, we try all 26 possible shifts, but since the letters in each group are scattered throughout the message, we are unlikely to learn anything by doing this.  The chance of something looking familiar will be very small.

There is a way to learn something about the message nevertheless.  Every language has a characteristic statistical frequency for each letter.  For example, in English the letter “e” is the most common.  For a given group, we compare the frequencies of the letters in each of the 26 decodings to the expected frequencies.  Matching the two sets of frequencies helps identify the shift or at least narrows down the number of possibilities from 26 to just a few.

The Germans actually used a more complex version of cycles and shifts that required a more sophisticated computer-aided decoding effort.  For one thing, they allowed an arbitrary substitution of 26 letters rather than a single shift value.  See a description and a simulator of their Enigma machines.  This movie explains all the details about the Enigma hardware:   rotors, reflectors, and plugboard.  The end of the story was that the British were successful in their efforts to crack the Enigma code, and the Allies prevailed.

Program 1.  A cycle of shifts is described by a codeword like REMARKABLE, where each letter represents a shift, A is 0, B is 1, … and Z is 25.  Write a program that takes a text and a codeword, and produces the encrypted text.  You will need to use the % (mod) operator and perhaps an if-statement or two to make sure you wrap around appropriately.   You could keep the codeword in a circularly linked list, but that is not necessary.  For simplicity, you may assume that all letters are uppercase.  If there are spaces, you should simply leave them.
Program 2.  Modify your program so that it can decode as easily as it encodes. All you need to do is add an option to decode, and then do subtraction rather than addition.

Make a note of how symmetric this is!  This was the hallmark of cryptography since ancient times, until the last twenty years.  If you knew how to encode something, then you knew how to decode it.  You might be thinking “well duh!!”  However, with modern cryptography you can publicize the encoding scheme, and amazingly, still people will not be able to decode a message!

Determining the Cycle Length

An ingenious method to determine the cycle length was discovered by Philip Friedman whose paper is described by David Kahn in his seminal history of cryptography  as "the most important single publication in cryptology."   Friedman defines the index of coincidence, a statistical measure of text, that for normal English is about 6.6%, but for a random collection of letters is only about 3.8%.

To determine the index of coincidence of a particular piece of text, take the text, rotate it by some arbitrary number of places, and write the rotated text underneath the original text.  For example, the sentence below is rotated by 58 characters and the rotated version appears beneath it.

alanturingbreakscodeslikenobodysbusinessbuthispersonallifesadlybecameeverybodysbusiness
dysbusinessbuthispersonallifesadlybecameeverybodysbusinessalanturingbreakscodeslikenobo

Consider the number of places in which the same letter occurs in both strings of text.  In this example, there are 6 such "coincidences".  The index of coincidence is the ratio of coincidences to the total letters in the text.  In this  example, it is 6/87, which equals approximately 6.9%, a rate similar to the 6.6% of normal English text. You will not likely see 6.6% with every shift, but you will see 6.6% more surely if you average over all the shifts, ranging from 1 through the length_of_the_text - 1.  In the example above, if you shift once, twice, three times, etc., through 86, and average all the indices of coincidence, you will get a value quite close to 6.6%.

An important point to notice is that when a text is encrypted with a single letter codeword, the index of coincidence of the encrypted text is identical to that of the original.  This is because with a fixed single value shift, two letters match before the shift if and only if they match after the shift.  In contrast, the index of coincidence of text encrypted with a codeword of length greater than one, would likely be closer to the 3.8% you expect with random letters. The exception is when we happen to rotate the text by a multiple of the codeword's length.  In this case, the letters that are lined up underneath each other are encoded using the same shift, and therefore the usual English index of coincidence would be expected. 

Hence, to determine (or at least make a good guess at) the codeword length is to rotate the encrypted text by 1, 2, 3 etc. symbols, and examine the results looking for an index of coincidence closer to 6.6% than 3.8%. As we mentioned, you may not see the 6.6% with one particular codeword length rotation but you will likely see it if you consider an average over a larger number of rotations, each of which is a multiple of the codeword length.  Therefore: When considering a codeword of k symbols, pay special attention to rotations that are multiples of k.  You will get better results by looking for 6.6% in the average of these multiples

For example, say you have an encrypted text of 100 symbols, and you calculate an index of coincidence for each rotation from 1 through 99.  Furthermore, assume the data starts like this:

Rotation        Index of Coincidence
1                    1.6%
2                    3.2%
3                    5.4%
4                    8.1%
5                    3.3%
6                    7.2%
7                    2.1%
8                    1.9%
9                    8.3%
10                  4.9%
...                    ...
50                   2.7%

This fragment of the data seems to suggest that a likely codeword length is 3, because the average of rotations 3, 6, and 9 equals 6.6%.  To confirm this, you should consider the rest of the data, (3, 6, 9, ... 99), along with averages of multiples of other lengths, such as (2, 4, 6, ..., 98),  (4, 8, 12, ..., 96), (5, 10, 15, 95), etc.  The reasonable candidates for the actual codeword length, are those with multiples that average closest to 6.6%.  Remember, that since this method is statistical, it is not guaranteed to identify the cycle length -- just to point out likely candidates. 

Program 3.  Write a program that takes encrypted text and calculates the index of coincidence for rotations of 1 through the length_of_text - 1.  A chart like the one above should be printed.   Create test data by cutting some English text of at least 150 characters off the Internet, and encrypting the text with a codeword of your choice using Program 1.  Do nut use the example above "alanturingbreakscodeslie..." because it is too short to get very good data.

Options:  Note that you can write the program to generate the data physically by moving the text in a parallel array.  You can also do it without a second array, by using a double loop and comparing index i of the array with index i + shift (mod text_length), for shift equal 1 through 15.

Analysis:  Assuming that the codeword length is between two and ten inclusive, use your program to help you guess the exact codeword length by calculating the average index of coincidence for all multiples of codeword lengths two through say fifty, and searching for an average index of coincidence close to 6.6%.  In your report, include the original text, encrypted text, codeword, the data generated by your program, and your analysis of finding the codeword's length. Include the average index of coincidence for multiples of lengths 2 through 10.

Pubic Key Cryptography - A Revolution in Cryptography

Cryptographic methods do not have to be symmetrical!  Today a new kind of encoding is used which is called Public Key, one-way, or sometimes trapdoor cryptography.  In this new method, the whole world is able to encode information, but unless someone has more information, then they still cannot decode a message.

This new method is what allows e-commerce to flourish without fear of a security breach.  Here’s how this trapdoor technology is used.   Say I want to send my credit card to Amazon.com to buy a book, I encode my number with a publicly published method that anyone else could use, but only Amazon.com can decode it!

What if I don’t trust that I am actually talking to Amazon.com?  That is, I suspect that a thief posing as Amazon, sent me a fake encoding algorithm, and then they will decode it and get my number!  In that case, we do the process in reverse.  I ask Amazon to send me a message encoded with their secret encoding that anyone can decode with their public algorithm.  If I decode the message and it states, “Hi I am Amazon.com”, I know it had to come from them, because nobody else would have known how to encode it correctly.  This is called authentication and is described in nice detail by VeriSign, the largest digital signature provider, at http://www.verisign.com/docs/pk_intro.html.

The Mathematics Behind Public Key Cryptography

Ironically, this new extremely practical technology is based on one of the oldest branches of pure mathematics, famous for its beauty and scarcity of practical applications:  number theory.

The details of public-key cryptography are based on the RSA algorithm, by Rivest, Shamir, and Adelman.  We will review the mathematics here briefly here and end with a few experiments and programs.

Euclid’s Algorithm and Greatest Common Divisors

The greatest common divisor of two integers is the largest integer that divides both evenly.  For example, the greatest common divisor (gcd) of 24 and 15 is 3. One way to calculate the gcd of two numbers, a and b, is to simply try all the numbers starting from the smaller of the two given numbers down to one.  The first of these numbers that divides both a and b evenly, is the greatest common divisor.
Program 4.  Write a program to find the greatest common divisor by trying each number from the lower number and working downwards towards 1.

Program 5.  Write a program to find the greatest common divisor by listing the prime factors for each number, and taking the intersection of the two lists.   For example, if the two numbers are 24 and 40, then the prime factors of 24 are: 2, 2, 2, 3, and the prime factors of 40 are:  2, 2, 2, 5.  The intersection of these two lists is 2, 2, 2, so the gcd is 2x2x2 = 8. A nice way to do this is to store the prime factors sorted in two arrays, and then use an algorithm that resembles the "merging" algorithm. This takes time proportional to the sum of the lengths of the two lists of primes.  A less efficient algorithm would take time proportional to the product of the lengths of the two lists of primes.

Notes for program 5:  However you find the prime factors of a number, you will likely need a method to check whether or not a number n is prime.  To check if a number n is prime, it suffices to try divisors up to the square root of n. 

Although the methods used in both program 4 and 5 are intuitive and seem natural, neither is the best way to find greatest common divisors.  Both these algorithms take time proportional to the size of the two numbers. It will be important to us later to be able to find the greatest common divisor efficiently without factoring the given numbers.

Euclid (300 B.C.E.) invented a method to compute greatest common divisors that bears his name to this day.  Euclid's algorithm in time proportional to the number of digits in the two numbers - a tremendous improvement over Programs 4 and 5.  In Euclid’s algorithm, we use recursion.  He proved that assuming a>b, then the gcd(a,b) = gcd(b, a mod b).  An example is shown below.

a
b
   
123
28
28
11
11
6
6
5
5
1
1
0
When the smaller number, b, equals 0, then the answer is a.  In this case, the gcd(123, 28) = 1.
Programs 6a and 6b
6a. Write a recursive program to implement Euclid’s algorithm. 
6b. Write an iterative program to implement Euclid’s algorithm.

Experiment and Discuss.  Test your four programs (4, 5, 6a, 6b) on a variety (3 or 4) pairs of large numbers, time the programs, and discuss the results.  Make sure to use some very large (long int) sized numbers to see the true differences in time complexity.  Make sure not to time any printing or unrelated work that would inaccurately spoil the timing.  In Java, you can use System.nanoTime() (returns a long),  to get the current CPU time.  Calculating the time elapsed becomes a matter of subtracting the start time from the finish time.

Note that theoretically Euclid’s algorithm (whether recursive or iterative) takes time proportional to the number of digits in the numbers (log x), while the other two algorithms take time proportional to the numbers themselves (an exponential explosion in complexity). This theory should let you check whether your results make sense. Discuss the results of your experiments with respect to the theory.

Connection to Cryptography

It is not obvious that Euclid’s algorithm has a lot to do with RSA public-key cryptography, but it does.  You will need to be patient.

A theorem of Euclid that we will not prove here, states that if the gcd(a,b) = m, then there are two integers  x and y such that ax + by = m.  These values x and y can be computed constructively using Euclid’s algorithm.  Here is an example using the numbers 123 and 28 from the previous example.

123  =  28(4) + 11
28  =  11(2) + 6
11  =  6(1) + 5
6  =   5(1) + 1
The greatest common divisor is 1.  Now we work our way backwards:
1  = 6 – 5(1)
1  = 6 – (11 – 6(1))(1)  = 6(2) – 11(1)
1  =  (28–11(2))(2) – 11(1)   = 28(2) – 11(5)
1 = 28(2) – (123 – 28(4))(5) = 28(22) – 123(5)
Each step gives us the x and y for the pair of numbers on that level, until finally we arrive back at the original pair of numbers 28 and 123, where the x and y are 22 and –5 respectively.  Study this example, and try to understand exactly what is going on.
Checkpoint:  To make sure you get the idea above, compute by hand, the integers x and y, such that ax + by = gcd(a,b) for each pair (a, b) below:
(99, 101)    (10, 35)      (7, 12)       (36, 42)

The next program is only a few lines long, but you may still need a hint trying to figure out how to set up the recursion.

Program 7.   Write a recursive program to calculate the integers x and y, such that ax + by = gcd(a,b) for a given pair a and b.  This program is crucial for the RSA cryptography algorithm.  Your program should have two outputs, one where x is positive and y is negative, and one where x is negative and y is positive.  Test your program out on the numbers 233987973 and 41111687. 

Hints for program 7:   We did the code for this in class. 

 Public Key Cryptography – A First Attempt

This section describes a method that is a simple version of RSA, but has the drawback that the private key can be computed from the public key using program 6 that you wrote earlier.  Study this simple version first, because although it does not quite work because of the drawback mentioned, it will give you a chance to understand what is really going on before tackling RSA proper.

All the public key cryptographic methods encode one integer into another, so we assume that our messages are first converted to a sequence of integers.  The exact method of conversion is not trivial but it won’t concern us here.

To encode a number, we will need the public key.  This consists of two integers, for example 5 and 17.  The second integer must be prime, and the first must be relatively prime to the second integer minus one.  In this case, 17 is prime, and 5 is relatively prime to 16.  Recall that two numbers are relatively prime if and only if their greatest common divisor is equal to one.

Encoding and Decoding Numbers – Public and Private Keys

Now to encode a number, say 6, using this key.  All we do is calculate the remainder of 65 after dividing by 17.  You can check that this equals 7.  To decode 7 back into 6, we need to have the private key.  The private key also consist of two numbers, one of which is really public, namely the prime 17, and one of which is really private namely 13.  Then we calculate the remainder of 713 after dividing by 17, and we get 6.

Where does the 13 come from?  Why does this work?

Thirteen is the solution to the equation 5x = 1 mod 16.

What happened here is that we computed 65(13) mod 17 = 713 mod 17 =  6 mod 17.  Since 5(13) = 1 mod 16, we can write 5(13) = 16(4) + 1.  This means that that 616(4) + 1 = 6 mod 17, and equivalently 616(4) = 1 mod 17.  It turns out that this must happen as long as 5(13) = 1 mod 16, and 5 has no common factors with 16.  There is a theorem to justify this.

Fermat’s Little Theorem

Pierre de Fermat, was an amateur but great mathematician of the early 17th century.  Fermat’s Little Theorem, not to be confused with his more famous Last Theorem states:

If p is a prime, and is not divisible by p, then p evenly divides ap-1 - 1.  For example, if p equals 17, and a equals 6, then 17 evenly divides 616 – 1, or 616 = 1 mod 17.  The proof is left for the discrete math class.   It is elegant but irrelevant for this lab.  It is the result that we need.  If 616 = 1 mod 17, then certainly 616(4) = 1 mod 17, and our encoding and decoding will work.

Computing Inverses

It doesn’t matter if you understand all the mathematics here, but you do need to understand how to calculate this number 13.  Recall that thirteen was the solution to the equation, 5x = 1 mod 16.  That means we must find two numbers x and y such that 5x – 16y = 1.  Sound familiar?  It is just Euclid’s algorithm!

The numbers 5 and 13 are called inverses of each other mod 16.  That means they multiply together to equal one.  The computation of an inverse is just what you did in program 6.

Now What?  If a person wanted to calculate the private key from the public key, they could do it by just using your program 6 and calculating the inverse of the private key mod 16.  What’s wrong is that it is possible to calculate inverses quickly and efficiently. You wrote a program to do it yourself!  So if someone publishes a public key (5, 17) you could have your program compute the private key (13, 17).  That is not good enough.  We want the decoding to be harder than the encoding.  We do not want anyone to be able to find 13 so easily.  Only the person who holds this private key can decode.

 A Second Attempt at Public Key Cryptography

The first attempt described above is based on work by Diffie and Hellman in 1976 (see http://www.verisign.com/docs/pk_intro.html) and was known before the contributions of Rivest, Shamir, and Adelman (RSA) in 1977. The RSA contribution is described next.
Now we start with two prime numbers, p and q, say 2 and 17, and compute their product pq = 34.  We calculate (p-1)(q-1) = 16, and we choose a value that has no common factors with 16, say 5.  The public key becomes the pair of numbers 34 and 5.

Encoding and Decoding

The encoding and decoding is done just like before.  For example, to encode 6, we compute 65 mod 34 = 24 and to decode 24 we compute 2413 mod 34 = 6.  As before, thirteen is the solution to the equation 5x = 1 mod 16.

The difference between this and our first attempt is that previously, 16 was simply one less than the public prime, so that the equation, 5x = 1 mod 16, was easily constructed (and solved by program 6).  But now the only way to calculate 16, is to factor the number 34 into 2 and 17 and compute (2-1)(17-1).  The factoring part is hard!  Nobody knows how to factor numbers quickly.  The best method is proportional to the given numbers.  Contrast this with the calculation of the inverse that is exponentially faster, proportional only to the number of digits in the given numbers.

Of course, anybody can factor 34, but in practice the two primes that are chosen are on the order of 100 digits each.  This makes all currently known factoring algorithms take centuries.  If you come up with an efficient algorithm (like the one for inverses) that can factor numbers, you will be famous!

Checkpoint
a. Create your own public and private codes for doing RSA encrypting.  Make sure they satisfy all the appropriate constraints.
b. Assuming that each character in a message is represented by its ASCII value, encode the message “Too much work!”.

Program 8.  Write a program to encode numbers using RSA given the public key (e, m).  Recall that encoding a number w is done by calculating we mod m.  Make sure to use the fast modular exponentiation algorithm (Egyptian-style) that we discussed in class.  This allows you to deal with very large numbers efficiently.  Raising w to the eth power can be done with fast exponentiation in time proportional to log e.

Experiment and Discuss:  Replace the fast exponentiation algorithm with a standard exponentiation algorithm for we mod m that runs in time proportional to e, and test both programs on a few large examples.  Discuss the results.

Program 9.  Write a program that takes a public key (e, m),

Recall that program 7 has two outputs, and you need to find the positive value for d rather than the negative value.

Test your program on the puzzle below:

            Puzzle.  Cracking the UFO Message.
A public code is found etched on a rock on Mars:  (7, 1147).  The message {128, 1040, 129, 1144, 788, 735, 570, 875} is received from outer space on one of the billion machines running the Extraterrestrial Life Detector Screen Saver distributed among the world’s PC’s.  Assuming that this message was encrypted with the public code found on Mars, crack the code and decode the numbers.  (Note that this problem uses small primes, hence giving your program a fighting chance to succeed before we are all dead.)
Note that to use RSA on integers that are really large (which is the whole idea), you need to use a "Big Integer" class in order to avoid overflow.  You can get away with regular Java integers (int) for Challenge 3 and "long" Java integers for Challenge 4 below, but to proceed further on more realistic data, you would need a class that allows you to use integers of arbitrary size.  For C++ programs, here is one stack class that implements Big Integer and here is another.  For Java there is built-in BigInteger class.

For years, RSA published public challenges to factor some very large numbers (hundreds of digits).  Many of these were solved and prizes were paid.  Recently, they shut down the challenges.  Nobody knows exactly why.  Here is a nice explanation of the public RSA challenges.

Cryptographic Decoding Challenges

    Solve each of the challenges below, and describe how you managed to solve each one.

Challenge 1.

A code word less than six letters long gives an encoding of a well known cerebral song:  The code word is the name of a famous dog from the movie featuring the song.
BQHIERPVBZXOPORHASACNFLQHBYSKFBBZKBHAHASYZHKXFLQHBLIEHBBZKBHAHASKOBB
PWMVMVXHACNUAHLWWPXHAWGYBBBQHIERUSTBHHASKZBBVCEBBTBCGZRVTRTPKOBB

What is the original text? What is the code word?

Challenge 2.

A code word less than six letters long gives an encoding of a math joke about factorials, attributed incorrectly to Woody Allen:
BOQWMNWOOQSSFHQKASRLAGOWRAADWRKAONCIOHKJKCYHVYWHLLC

What is the original text?  What is the code word?

Challenge 3.

RSA encoding of nine ascii values, using public key (10555, 21971), gives:
16912    19531    20676    16912    6613    2348    17835    15770    15770

The original text is a famous cartoon hunter.  What is the original text?

Challenge 4.

RSA encoding using public key (5555551, 118513313) gives:
80217189    107242213    96490860    79543571    25953566

What are the original numbers?

Enrichment Essay

You should watch the movies Breaking the Code with Derek Jacobi and The Imitation Game with Benedict Cumberbatch, both showing, in different ways, the role of Alan Turing in the British decryption of German codes in World War II.   We will screen at least one of these in class, and you should watch the other on your own.  We will also run some friendly WWII style cryptographic competitions with prizes for the winners.

Write your impressions of the two movies. Which movie did you like better and why?   Which movie did you feel was more authentic and why?

Short Answer Movie Quiz

1.  In which of the movies does Turing name the computer after his teenage boyfriend Christopher?

2.  In which of the movies does "Pat" earn her cryptography job by taking a test and beating out all the men?

3.  In which of the movies does Turing chain his coffee cup to the radiator?

4.  In Breaking the Code, when Pat tells Alan Turing that she loves him, there is an awkward silence after which Turing grabs a pine cone to change the subject.  What does he point out to Pat about the pine cone?

5.  In Breaking the Code, Turing's government handler explains to him that he can't just go on ignoring the effect he has on other people or the effect they have on him.  He warns Turing to keep his private life to himself so as not to raise red flags or offend anyone.  What do we learn about the government official later in the film?

6.  In The Imitation Game, does Turing actually marry Pat?

7.  What happens to Christopher?

8.  In both movies, Turing writes directly to Winston Churchill - the prime minister of England - to ask for more support for his cryptography project.  How does each movie portray the reactions of his government handler to this?

9.  In Breaking the Code, when Turing first meets Ron, the young man with whom he eventually has an affair, Turing tells him that he has just seen the film Snow White about a princess who falls into a deep sleep after eating a poison apple, but in the end lives happily ever after and marries a prince.  Ron replies that he "likes a thriller - a good thriller."  What is the significance of the particular Disney film in that scene?

10. In which of the movies do we meet any of Turing's parents, and if so, which one(s)?