The generation of random numbers is too important to be left to chance.
Everything we do to achieve privacy and security in the computer age depends on random numbers.
In February 2012, two groups of researchers revealed that large numbers of RSA encryption keys that are actively used on the Internet can be cracked because the random numbers used to generate these keys were not random enough. Here, I offer a puzzle in which you will identify and crack RSA keys that are vulnerable in the same way — using a slightly simpler version of the same technique. This should provide a clearer understanding of how this problem arises and how it can be exploited, as well as more familiarity with the RSA algorithm and Euclid's algorithm.
This challenge is aimed at novice programmers who want to practice their programming skills with some real-world cryptography, and at anyone who was interested in Lenstra et al. and Heninger et al.'s research but found the math involved daunting or unfamiliar.
I'll start with an explanation about RSA to put these attacks in some context and explain some of the steps involved in solving this challenge. If you're already familiar with the RSA algorithm and with the details of these attacks, you could skip directly to the challenge part.
RSA is an important encryption technique first publicly invented by Ron Rivest, Adi Shamir, and Leonard Adleman in 1978. RSA is based on the fact that there is only one way to break a given integer down into a product of prime numbers, and a so-called trapdoor problem associated with this fact. It's easy to fall through a trap door, but pretty hard to climb up through it again; remember what the Sybil said:
[...] Sate sanguine divom,
Tros Anchisiade, facilis descensus Averno;
noctes atque dies patet atri ianua Ditis;
sed revocare gradum superasque evadere ad auras,
hoc opus, hic labor est.
Aeneid VI, 125-9
O goddess-born of great Anchises' line,
The gates of hell are open night and day;
Smooth the descent, and easy is the way:
But to return, and view the cheerful skies,
In this the task and mighty labor lies.
John Dryden translation
The particular problem at work is that multiplication is pretty easy to do, but reversing the multiplication — in the form of factoring — is apparently pretty hard. Multiplying given numbers p and q together to calculate p × q = n is pretty straightforward; even elementary school children are taught to do it quickly and accurately by long multiplication.
For example, if I asked you to multiply
17477852958781876547 × 15241555427044345769 = ?
you would probably need only a matter of minutes with pencil and paper, or a tiny fraction of a second with a computer, to answer
17477852958781876547 × 15241555427044345769 = 266389664617004986624097978187739779643
What if we turn the question around and ask what two integers would have to be multiplied together to yield a given number? For example,
? × ? = 281512008712700373730275954373439628511
Suddenly, this task is remarkably difficult; you'd have little hope of ever getting the answer by guessing with pencil and paper, and even with a computer you might have to write a new program to find the answer, and it might take a noticeable amount of time to do so.
However, if I reveal that the answer is
17627949842247424607 × 15969639761398924673 = 281512008712700373730275954373439628511
you can pretty easily check that this is right. If you need more convincing that it seems to be vastly harder to go in one direction than the other, try these two problems without answers:
18324413790489873917 × 16753053665517997013 = ?
? × ? = 231780035703803150278713557377368328099
It seems as though this pattern continues, and factoring gets drastically harder as the numbers involved get larger, while multiplication remains easy and quick. In fact, there is also a longstanding public challenge to see what the public state-of-the art in factorization is; the largest number from this challenge that we know has been successfully factored is called RSA-768:
This number was factored in 2009.
The RSA algorithm requires a user to generate a key-pair, made up of a public key and a private key, using this asymmetry. Descriptions of RSA often say that the private key is a pair of large prime numbers (p, q), while the public key is their product n = p × q.
This is almost right; in reality there are also two numbers called d and e involved; e, which is used for encryption, is usually 65537, while d, which is used for decryption, is calculated from e, p, and q. The way we calculate d is straightforward to do but complicated to explain, and isn't really necessary for our purposes here. (Below, I'll give you some source code and a web app to automate the process so that you don't have to worry about it while solving the puzzle.) The end result is that d can be used to decrypt what e and n encrypt (using an operation called modular exponentiation), but only the private key's owner knows or can calculate d.
n is, indeed, published as part of the public key. Not only is the public allowed to know this number, but for security reasons the public key holder normally wants everyone to know it, to prevent someone else from maliciously impersonating them by giving out some other value of n. We can go out and look at various people's values of n; for example, the one for Gmail's web site is
while the one for PayPal's site is
19907789316973232571709388139491358666929454999402440962751418618453573264955089051420528817897891333335369381181336604286912201113757821941217467558226280838933785489616679977332178809112857355692902776754821604616983317242449495394362748115124468692900781573694743868331219952099224674513107154770709045556420165907683219901455612366432477179986702075209386343768097176233191423040304881401194366954739922445951906916747808916845352010407952710079897810642551273099204890952309865459288708489866066485327182895626969651191379209124433725745943676165198426636198495665286419097106110163852891072181715030605705150303and the one for my personal PGP key is
(Yow! My key is pretty long, but that doesn't mean my personal e-mail is more secure overall than PayPal, because this won't do me much good if someone breaks into my computer's operating system. It just may mean I'm being extravagant. Gene Spafford once remarked that using such strong encryption on the Internet could be "the equivalent of arranging an armored car to deliver credit-card information from someone living in a cardboard box to someone living on a park bench". My computer could be the equivalent of the park bench.)
It's reasonable to say that the security of RSA is based on keeping p and q secret, and on its being just too hard to figure them out from n by factorizing it. If you can find the factors of any of the last three numbers above, you can break the security of that key and of some of the private data that it protects. Notice that if someone turned up claiming to have broken one of these keys, you could easily check their solution was right by multiplying the factors.
Although factorization seems like a very hard problem, there's a different problem that's much easier — finding the greatest common divisor ("gcd") of two numbers. This means the largest integer that both of the numbers are divisible by.
A few examples:
gcd(6, 9) = 3
gcd(10, 12) = 2
gcd(10, 30) = 10
gcd(100, 144) = 4
gcd(3, 7) = 1
What's gcd(15, 18)? What's gcd(1000, 3)?
One interesting observation is that gcd(p, q) is always 1 if p and q are prime. However, gcd(x, y) can also be 1 if x and y are not prime but just don't have any divisors in common. For example, gcd(16, 27) = 1 because there is no other number that 16 and 27 are both divisible by. So 16 and 27 are called relatively prime to each other.
There is a very ancient and very fast method for calculating gcd, even for extremely large numbers. I'll describe this method in detail in another section below. One important fact about it is that, even though factorization seems very hard and slow, calculating gcd is very fast and easy. What does that mean for cracking RSA keys?
Suppose there are four different primes, a, b, c, and d. The first two are used in one key, in the public value n1 = a × b. The other two are used in another key, in the public value n2 = c × d.
What is gcd(n1, n2)?
n1 and n2 must be relatively prime to each other. (There can't be any number other than 1 that both of them are divisible by, because if there were such a number, it would have to be one of the four primes a, b, c, or d... but n1 isn't divisible by c or d, and n2 isn't divisible by a or b. If this is hard for you to see, you might try reading more about the uniqueness of prime factorizations, which this explanation has assumed without being very explicit about it.)
So, gcd(n1, n2) = 1 and this hasn't given us any new information about the values of a, b, c, and d. And that's a good thing for the security of these RSA keys, because it provides no shortcut to factorizing them!
OK, what if we somehow re-used a prime between two different RSA keys?
In this scenario, there are now only three different primes a, b, and c. Somehow, b has been re-used in two different keys, so the public values are n1 = a × b and n2 = b × c. In this case, the re-use of a prime number across keys turns out to be extremely significant, and extremely bad for the security of those keys.
The security problem comes in if someone comes across both public keys and, looking at the public values n1 and n2, decides out of curiosity to calculate gcd(n1, n2). This time, the result is not 1, but rather b, because both n1 and n2 are evenly divisible by b!
Noticing this leads quickly to cracking both keys, because now it's easy to calculate a = n1 / b and c = n2 / b. That reveals both of the secret prime factors of both keys, which is enough to derive a complete private key for each and start decrypting encrypted messages. Whoops!
Once again, the "normal" case gives gcd(ab, cd) = 1 and reveals nothing extra, but the "shared prime" case gives gcd(ab, bc) = b, which leads to discovering both factors, revealing the private keys from the public keys. This means that using a prime in one's RSA key that someone else has already used in their RSA key is a very bad security failing.
But we normally choose these prime numbers "at random", so what are the odds that this would happen by chance? It's astronomically unlikely with modern key sizes.
The two primes that go into a 1024-bit RSA key are generally both 512 bits long. (If you multiply a j-digit number by a k-digit number, you can expect the answer to be around j+k digits long. Likewise with a j-bit and a k-bit number. This is based on the idea that bj × bk = b(j+k).)
How many different 512-bit primes are there? A theorem called the Prime Number Theorem can be used to make a good estimate. It indicates that the fraction of numbers around the size of 2512 that are prime is around 1/(512 ln 2)=0.0028... or around 0.28%. We can also test this experimentally with a random sample, which shows around the same result. Note that this includes all 512-bit even numbers, which are never prime, so about 0.6% of odd 512-bit numbers are prime.
One way to get an experimental estimate of the fraction of 512-bit numbers that are prime is
#!/usr/bin/env python import gmpy def experiment(n): total = 0 for i in range(n): p = gmpy.rand("next", 2**512) if gmpy.is_prime(p): total += 1 return total
A more condensed version:
python -c 'import gmpy; print len([1 for i in xrange(1000000) if gmpy.is_prime(gmpy.rand("next", 2**512))])'
Anyway, this suggests that there are somewhere between 2503 and 2504 512-bit primes (though the number will also depend on what we mean by a 512-bit prime, like whether it stops being one if it has too many leading zeroes). That is a vast number, and it's incredibly implausible to envision human beings ever choosing the same one of them twice by accident. (You can check this by doing the birthday paradox calculation, even assuming that we generate quadrillions of public keys instead of millions or billions.)
As a "sanity check", Arjen Lenstra et al. tried a gcd-calculating trick out on several million real keys (mostly those gathered by the EFF SSL Observatory) and cracked about 13000 of them. This led to a New York Times report emphasizing that this could be a serious flaw in the way RSA is used: about 0.2% of all keys seen on the Internet seem to be vulnerable.
As the Times reported, Lenstra's group concluded that the problem is that some users of RSA have faulty random number generators:
The system requires that a user first create and publish the product of two large prime numbers, in addition to another number, to generate a public “key.” The original numbers are kept secret. To encrypt a message, a second person employs a formula that contains the public number. In practice, only someone with knowledge of the original prime numbers can decode that message.
For the system to provide security, however, it is essential that the secret prime numbers be generated randomly. The researchers discovered that in a small but significant number of cases, the random number generation system failed to work correctly.
So we need to be able to choose prime numbers randomly. That, in turn, assumes that whatever software originally generated the RSA keys has the ability to do this. It's easy to find cases where things that ought to be random are relatively predictable; I remember gambling and adventure games on the old Apple ][ computer could be extremely predictable. Just after turning on the computer and starting a game, the first hand of Blackjack or the first storm at sea always came out exactly the same. In fact, we have just seen another more modern example of this: the random numbers generated by gmpy in the "how many 512-bit primes?" Python program above are the same every time you run the program! If we used that program to generate RSA keys, the same keys would be generated every time too.
If random number generators are failing to produce truly unpredictable numbers, this can produce serious weaknesses in cryptography, because an attacker may have various ways to guess "secret" key values, or at least narrow down the possibilities dramatically. Most computer systems today generate random numbers not primarily by measuring a physical quantity like radio static or lava lamp patterns but rather by using some sort of formula that gets fed with some (ideally) unpredictable value called a "seed". To get truly unpredictable numbers, we need truly unpredictable seeds from a large enough pool of possibilities. (The phenomenon where gmpy generates the same random numbers every time is an extreme case, where we haven't even attempted to give the random number generator a seed and it hasn't tried to create one from its environment.)
Security problems with inadequately or inappropriately sseded random number generators have arisen before; in 1996, Ian Goldberg and David Wagner demonstrated a practical way of attacking the SSL implementation in Netscape Navigator because of predictable random-number generators. In 2008, the Debian Project revealed that one of its developers had mistakenly removed code from OpenSSL in 2006 that catastrophically weakened the random number generator (causing RSA keys generated on Debian systems during that period to be limited to a pool of just 32,768 possibilities per key size! ... all of which an attacker could generate at home).
The idea that poor random number generators would make a collection of RSA keys jointly vulnerable to gcd, even though no individual key appears vulnerable in isolation, had been published as early as 1999 as a critique of RSA, but perhaps not experimentally demonstrated.
As Lenstra's group published its results, another group of researchers led by Nadia Heninger was simultaneously exploring the RSA public keys used on the Internet, and came up with similar results — likewise using a greatest-common-divisor method to find common factors. Heninger's group found a more specific explanation for how the weak keys came to be; they were able to track down specific devices that have been habitually generating them! The devices in question "include[d] firewalls, routers, VPN devices, remote server administration devices, printers, projectors, and VOIP phones" from "more than thirty different manufacturers". (Many of these devices have web interfaces, and security-conscious users would far prefer to log in and configure them over HTTPS than insecure HTTP — hence the devices need to have their own public keys to use with the HTTPS connections.) The manufacturers in question were advised about the problem.
These devices don't have keyboards and mice that they can use to help seed their random number generators (as PCs often do, looking at the precise, unpredictable timing of user input), and they might be programmed to generate RSA keys automatically the first time they're turned on, perhaps without seeding the random number generator properly, or without having much with which to seed it. (Random number generators can be securely seeded with other kinds of hardware, too, but many device makers haven't realized this or haven't made provisions for taking advantage of the randomness that occurs in certain kinds of circuits.) Without proper seeding, the number of different primes they can conceivably generate is greatly reduced, and the chance that two devices will independently generate the same prime goes way up — an effect that eventually becomes visible when a large number of public keys are collected and compared.
The paper from Heninger's group is now available; it's called "Mining Your Ps and Qs: Detection of Widespread Weak Keys in Network Devices".
I mentioned above that calculating gcd is very fast. One useful way to calculate gcd — even for huge numbers — is with Euclid's algorithm, which Donald Knuth calls "the oldest nontrivial algorithm that has survived to the present day".
Euclid considered the problem of finding the gcd of two integers in Elements, book VII, proposition 2, and gave a very practical way to do it:
Δύο ἀριθμῶν δοθέντων μὴ πρώτων πρὸς ἀλλήλους τὸ μέγιστον αὐτῶν κοινὸν μέτρον εὑρεῖν.
To find the greatest common measure of two given numbers (which are) not prime to one another.
Ἔστωσαν οἱ δοθέντες δύο ἀριθμοὶ μὴ πρῶτοι πρὸς ἀλλήλους οἱ ΑΒ, ΓΔ. δεῖ δὴ τῶν ΑΒ, ΓΔ τὸ μέγιστον κοινὸν μέτρον εὑρεῖν.
Let AB and CD be the two given numbers (which are) not prime to one another. So it is required to find the greatest common measure of AB and CD.
Εἰ μὲν οὖν ὁ ΓΔ τὸν ΑΒ μετρεῖ, μετρεῖ δὲ καὶ ἑαυτόν, ὁ ΓΔ ἄρα τῶν ΓΔ, ΑΒ κοινὸν μέτρον ἐστίν. καὶ φανερόν, ὅτι καὶ μέγιστον· οὐδεὶς γὰρ μείζων τοῦ ΓΔ τὸν ΓΔ μετρήσει.
In fact, if CD measures AB, CD is thus a common measure of CD and AB, (since CD) also measures itself. And (it is) manifest that (it is) also the greatest (common measure). For nothing greater than CD can measure CD.
Εἰ δὲ οὐ μετρεῖ ὁ ΓΔ τὸν ΑΒ, τῶν ΑΒ, ΓΔ ἀνθυφαιρουμένου ἀεὶ τοῦ ἐλάσσονος ἀπὸ τοῦ μείζονος λειφθήσεταί τις ἀριθμός, ὃς μετρήσει τὸν πρὸ ἑαυτοῦ. μονὰς μὲν γὰρ οὐ λειφθήσεται· εἰ δὲ μή, ἔσονται οἱ ΑΒ, ΓΔ πρῶτοι πρὸς ἀλλήλους· ὅπερ οὐχ ὑπόκειται. λειφθήσεταί τις ἄρα ἀριθμὸς, ὃς μετρήσει τὸν πρὸ ἑαυτοῦ. [...]
But if CD does not measure AB then some number will remain from AB and CD, the lesser being continually subtracted, in turn, from the greater, which will measure the (number) preceding it. For a unit will not be left. But if not, AB and CD will be prime to one another [Prop. 7.1]. The very opposite thing was assumed. Thus, some number will remain which will measure the (number) preceding it. [...]
(text, figure, and translation from Richard Fitzpatrick's Euclid)
Euclid's core observation is that if a ≥ b and b > 1, gcd(a, b) = gcd(a-b, b). (Wikipedia calls this "the principle that the greatest common divisor of two numbers does not change if the smaller number is subtracted from the larger number".) Can you prove this, or follow Euclid's original proof in Fitzpatrick's edition (at p. 196)?
So a simple way of stating the subtraction-based form of Euclid's algorithm is: to find the greatest common divisor of two numbers, replace the larger number with the smaller number subtracted from the larger. Repeat this until one of the numbers reaches 1 (in which case the gcd is 1, and the numbers are relatively prime) or until one of the numbers reaches 0 (in which case the gcd is the remaining value of the other number).
This is pretty fast, but there is an even faster way: instead of repeated subtraction, we can use the modulo operator to see what the result of the repeated subtractions will be. That is, if a ≥ b, then gcd(a, b) = gcd(a mod b, b).
So the modulo-based form is: to find the greatest common divisor of two numbers, replace the larger number with the larger number mod the smaller. Repeat this until one of the numbers reaches 1 (in which case the gcd is 1, and the numbers are relatively prime) or until one of the numbers reaches 0 (in which case the gcd is the remaining value of the other number).
I note in passing that even though Euclid's gcd is incredibly fast compared to trying to factor numbers, it would still take quite a long time to apply to to every pair of RSA public keys on the whole Internet—since there are millions of them and the number of possible pairs of things grows with the square of the number of things. There are still further tricks to make this process even faster in this case; Lenstra's team thought it best not to elaborate, saying "sapienti sat", while Heninger's team credits techniques published in 2004 by Daniel J. Bernstein. For purposes of the present puzzle, the Euclidean gcd technique is totally sufficient because the number of public keys is reasonably small (just one hundred).
Now, you can download the challenge file, which contains 100 RSA public keys and 100 corresponding messages, each message encrypted with one of the public keys. The keys are 1024-bit, so we would normally not expect to be able to crack any of them. [Note added 2016: here "we" refers to an ordinary computer user. Expert opinion holds that some teams and organizations can do it with their resources and that keys of this size should no longer be used for security purposes.]
However, some fraction of these keys are weak in the same way as the weak keys analyzed by Lenstra's and Heninger's groups: they share a prime with another key in our sample! (Not all of the keys are weak, and I won't tell you ahead of time exactly how many are weak. But the weakness is always a result of sharing factors with other keys within this set of 100.) Your task, then, is to write a program to figure out which keys are weak and crack them using gcd, thereby allowing you to decrypt the corresponding messages.
In order to do this, you need to be able to get the n values from the public keys, which are stored as PEM files (a common format for representing a public or private key). The challenge contains a README file that gives an OpenSSL command line to do this.
schoen@sescenties:~/challenge$ openssl rsa -in 1.pem -pubin -text -noout
Public-Key: (1024 bit)
Exponent: 65537 (0x10001)
Notice that the modulus (the value of n) is in hexadecimal, so you may also need to convert it back to base 10 — and remove the colons — in order to do arithmetic on it with some tools and programming languages. You could remove the extraneous stuff from the Unix command line with something like
$ openssl rsa -in 1.pem -pubin -text -noout | grep '^ ' | tr -dc '[0-9a-f]'; echo
[Note added 2016: I did not notice the simpler possibility of
openssl rsa -in 1.pem -pubin -modulus -noout, which avoids some of the subsequent manipulations.]
On the other hand, here is one way to do the key import using Python (assumes you have PyCrypto, Debian package name python-crypto).
>>> from Crypto.PublicKey import RSA
>>> pem1 = open("1.pem").read()
>>> k1 = RSA.importKey(pem1)
>>> print k1.n
And here's an example using Ruby.
irb(main):001:0> require 'openssl'
irb(main):002:0> key = OpenSSL::PKey::RSA.new File.read '1.pem'
=> -----BEGIN RSA PUBLIC KEY-----
-----END RSA PUBLIC KEY-----
If you can factor this n, you can break the public key stored in 1.pem and then read the encrypted message in 1.bin. (But I'm not telling whether key #1 is actually one of the weak keys or not...) This is the base-10 equivalent of the hexadecimal value given in the previous example.
What do you do once you have the p and q values for the keys you've proven are vulnerable? Well, it would be most practical to construct a new PEM file from them containing a complete usable private key, which involves a few details beyond the scope of this challenge. So, I've put up a web application to create a complete private key given p and q which you're welcome to use. If you prefer, you can also download the source code in C for this application and compile and use it.
Then, use this private-key PEM file to decrypt the associated encrypted .bin file and read its contents. Repeat the process for all of the keys that you can crack.
In the style of the Mystery Hunt and other contemporary puzzle hunts, the challenge eventually yields a specific answer that is a single word — but you'll have to figure out how to interpret the decrypted data to yield this word, since the RSA-decrypted data is itself in a non-RSA-related code. I'm not providing any specific instructions about what to do with the decrypted data beyond the fact that it can be interpreted as a reference to a single English word. (The answer word is a common noun, not a proper noun; if your guess at the answer is a proper noun, you haven't solved the puzzle all the way to the last step yet!)
Feel free to let me know if you find the word!
Thanks to Daniel Kahn Gillmor for helpful comments on this text, to Micah Lee for the layout, and to Vera Yin for testing the puzzle.