An easy way to visualize git activity

Today, I wrote gitply — a fairly simple Python script for visualizing the weekly activity of each contributor to a git repository.

It started out as a run-once script to get some statistics for one of my projects, but I ended up improving it incrementally until it turned into something friendly enough for other people to use.


What it does:

In short, gitply will show you how much code (and how many commits) a repository’s contributors have written each week. Its output looks something like this:

example

You’ll get a graph like this for each contributor. What all do we see here?

  • Lines inserted (green bars, left axis, log scale)
  • Lines deleted (red bars, left axis, log scale)
  • Commits made (blue line, right axis, linear scale)

You also get a textual representation of the same data:

History for anon2
  2016, week  6:  2 commits, +366  -26  
  2016, week  7:  4 commits, +325  -5   
  2016, week  8:  2 commits, +224  -3   
  2016, week  9: 21 commits, +2617 -219 
  -- Gap of 2 weeks
  2016, week 12: 10 commits, +4066 -614 
  2016, week 13:  6 commits, +2480 -432 
  2016, week 14:  2 commits, +654  -490 
  -- Gap of 1 week
  2016, week 16:  2 commits, +661  -229 
  -- Gap of 2 weeks
  2016, week 19:  3 commits, +1508 -47  
  -- Gap of 2 weeks
  2016, week 22:  1 commits, +1490 -29  

History for anon1
  2016, week 12:  1 commits, +143  -89  
  2016, week 13:  2 commits, +58   -16  
  2016, week 14:  3 commits, +137  -17  
  2016, week 15:  2 commits, +403  -106 
  -- Gap of 2 weeks
  2016, week 18:  5 commits, +134  -38  
  -- Gap of 3 weeks
  2016, week 22:  1 commits, +13   -1   
  2016, week 23:  1 commits, +485  -1   

If you just want to download it and use it already, check out the project on GitHub. For those who are interested, some implementation details follow.


How does it work?

The first thing gitply does is invoke the standard git client to retrieve a log:

git log --numstat --date=short --no-notes

Then, it analyzes the log line-by-line in real time (thanks, Python iterators) in order to get the data we need. Essentially, we create an iterator that yields tuples of (email, date, stats) for each commit, where stats is the pair (insertions, deletions).

Once we create this iterator over the commit history, we can attribute each commit to a user and ultimately fill in a dictionary that will help us print the analysis. We sum the insertions, deletions, and commits for each ISO week, then iterate through the weeks and print out the final results.

Now, if for some reason you need an iterator over the commits in a repository, you can import my iterate_commits function from this script and use it yourself.


What’s next?

Right now, gitply supports aggregating statistics from several repositories into one report. One planned feature is to use stacked bars and lines to show the amount of activity per-project, per-user, per-week. Until then, the data from multiple repositories will all be displayed together.

Adventures in image glitching

Databending is a type of glitch art wherein image files are intentionally corrupted in order to produce an aesthetic effect. Traditionally, these effects are produced by manually manipulating the compressed data in an image file. As a result, this is a trial-and-error process; often, edits will result in the file being completely corrupted and unopenable.

Someone recently asked me whether I knew why databending different types of image files produces different effects — and particularly, why PNG glitches are the most interesting. I didn’t know the answer, but the question inspired me to do a little research (mostly reading the Wikipedia articles about the compression techniques used in different image formats). I discovered that most compression techniques are not all that different. Most of them just employ some kind of run-length encoding or dictionary encoding, and then a prefix-free coding step. The subtle differences between the compression algorithms could not explain the wildly different effects we observed (except for in JPEGs, perhaps, since the compression is done in the frequency domain). However, PNG used a pre-filtering step which made it stand out.

PNG’s pre-filtering is basically a heuristic to try to make the image data more compressible. It tries to translate each pixel of the image into some delta based on pixels in the previous row or column. Since the core compression algorithms have no idea about the two-dimensional nature of an image, this two-dimensional filtering step actually makes a pretty big impact.

There are five main line filters defined by the PNG format. Most PNG encoders will try to optimize their choices of filters for each line in the image, in order to reduce the compressed size of the image. However, each filter also yields its own unique glitches.

Thus, an idea was spawned , and I wrote a Python script that glitches images not by manipulating their raw representations, but by applying one of the PNG filters and modifying the pixels in that representation.


PNG Filter Glitches

Since this article is about glitching — not image compression — we’re going to focus on the different visual effects we get from each filter. Usually, PNG codecs alternate between filters for each line, however for the sake of demonstration I configured my script to use one filter throughout the whole image.

All of the following glitches are made from a scale-down version of this wallpaper from Wallhaven.

Two of the simplest PNG filters are the Up filter and the Sub filter. While Up creates a delta from the adjacent pixel in the previous row, Sub creates a delta from the previous pixel in the same row. The results of glitching on the filters are similar — Up gives us vertical lines, and Sub gives us horizontal lines.

Glitch result from the Up filter
Up
Sub
Sub

A more interesting one is the Average filter. This filter looks at the both the above-pixel and the left-pixel and takes their average, then computes a delta using that value. Glitching on the Average filter results in smears of altered color that slowly defuse away across the image.

Average
Average

My favorite is actually the Paeth filter, which heuristically chooses either the above-pixel, left-pixel, or above-and-to-the-left pixel as the basis for the delta, depending on a function of all three. Glitches applied to this filter give us rectangular disturbances which slowly fade out, however stack together to create harsher effects.

Paeth
Paeth

I mentioned that there are five filters, but so far I’ve only mentioned four. The reason for this is simple — the fifth filter is the Null filter, which has no effect whatsoever.


Mixing Filters

To simulate the mixed effects one can observe in a typical PNG glitch, I implemented a glitching procedure which uses a random filter on each line. The results are pretty intriguing:

Random line filters
Random line filters

In this example, I’ve only allowed a 5% chance for the filter to change from one line to the next; without this provision, the results are quite boring.

To make the glitches even cooler, I started stacking filters on top of each other. Here’s an example with an extra pass through the Average filter applied after the randomized filtering stage:

Randomized + Average
Randomized + Average

Glitch Glitches

Overall, my favorite filters were the ones that had bugs when I wrote them. My original implementation of the Paeth filter had an error due to the use of unsigned integers in the predictor (when really I needed signed ones). However, this version resulted in some cool effects that acted like a combination of the normal Paeth and Average behaviors:

Buggy Paeth filter
Buggy Paeth filter

I also wrote a buggy Average filter that resulted in some bold vertical lines and some subtler horizontal artifacts:

Buggy Average filter

Try it Yourself

I’ve posted my glitch scripts on GitHub, so you can try them out too. The original version was pure Python and incredibly slow, but I’ve now converted the core functionality into a C module and it is several orders of magnitude faster. This is also my first foray into the intricacies of the Python C API, but that’s a topic for another post…

To conclude, have two more glitches achieved by mixing various filters:

test (2)

test (1)

Blue Coat’s position as a Certificate Authority, and what it means for you.

Recently, Blue Coat Systems has been approved as an intermediate certificate authority. If you aren’t versed in network security, this means nothing to you. However, be assured that it is a big deal.

Blue Coat is primarily known as  a vendor of application-layer (deep packet inspection) firewalls. In other words, they help people sniff your data — primarily in order to censor the Internet. Maybe your company’s firewall blocks access to YouTube and Facebook while you’re at work. That’s no big deal — Blue Coat delivers something a bit more sinister.

Countries such as Burma, Syria, Bahrain, China, Egypt, Kuwait, Russia, and Saudi Arabia have utilized Blue Coat’s products (or similar offerings) to enable Internet censorship and tracking. This is not a good reputation to carry. However, Blue Coat’s impact has been limited to unencrypted traffic thus far. Now that they are a trusted CA, they’re capable of much more.

In this article, I will give an overview of how our Internet communications are normally secured, then explain why it is a problem that Blue Coat how has this authority.


What is a Certificate Authority?

To encrypt data on the Internet, we use a protocol called Transport Layer Security, or TLS (you might be more familiar an older but similar protocol called SSL, or Secure Sockets Layer). Basically, if a URL starts with https://, then you’re accessing it through TLS; if your browser shows you a green lock icon (or something similar), you can typically be assured that your connection is secure.

A browser will trust a website and display this lock icon if it presents a valid certificate signed by a trusted certificate authority (CA). A certificate is a pretty simple thing — it is essentially just a public key for the website, accompanied by the signature of a CA. Your operating system (or web browser, depending on which one you use) trusts a handful of CAs by default. These are called root CAs. However, root CAs can often delegate the authority to sign certificates to other entities called intermediate CAs. Intermediate CAs don’t have to be explicitly trusted by your computer; the fact that a root CA has vouched for them makes them trusted by extension. Because another CA has vouched for them, Blue Coat is now capable of issuing certificates.

How can a CA be harmful?

To understand Blue Coat’s new role might be harmful, it is important to know a few more details about certificates and how they’re used.

A certificate is always restricted to some domain name (effectively representing a single website). Normally, a CA only gives a website a certificate if it can prove that it is legitimately operating under the name that they have requested; consequently, the certificate is tied to that name and won’t be valid for any other website.

Normally, a rogue website can’t impersonate a legitimate website because it will be unable to obtain a certificate for the website’s domain name. If the rogue site does somehow obtain a certificate, there is essentially no way for a normal user to know they aren’t visiting the legitimate site. We rely on CAs to check that any entity requesting a certificate can prove that it is authorized to represent the domain that it wants a certificate for; this is the entire basis of our security and privacy on the Internet.

Okay, so what?

Since Blue Coat produces hardware and software that enables Internet censorship, we can only imagine how badly they’re itching to be able to sniff our TLS-encrypted Internet traffic to be able to filter it. In order to sniff traffic arbitrarily, they would need to essentially impersonate whatever website you want to reach. Impersonating a website in this way is called a man-in-the-middle (MiTM) attack.

In general, a man-in-the-middle attack works like this:

  • You try to connect to example.com.
  • I intercept your connection so that you connect to my server instead.
  • I act as a middleman between you and example.com, and I am able to filter and manipulate your conversation however I want.

TLS normally prevents this. In the case that TLS is being used, a normal connection would work like this:

  • You try to connect to example.com.
  • Example.com sends you a certificate, signed by some CA.
  • You check to see if the certificate is valid, and decide that it is okay; you then proceed to communicate with example.com.

A MiTM against a TLS-secured website would typically go like this:

  • You try to connect to example.com.
  • I intercept the connection so that you connect to my server instead.
  • I send you a bogus certificate.
  • You look at my certificate and decide it is fake, and disconnect.

This is great; this is what should happen if someone manages to intercept your connection to a website. However, if a company like Blue Coat is a CA, they can actually create a fake certificate in order to pull off the MiTM successfully:

  • You try to connect to example.com.
  • Blue Coat intercepts the connection so that you connect to their server instead.
  • They send you a certificate that looks legitimate, since they have signed it and you trust them as an intermediate CA.
  • You accept the certificate, then proceed to communicate with them instead of the website you intended.
  • Blue Coat can monitor and filter your conversation with example.com however they see fit.

It should be clear why this is a huge problem for security and privacy on the Internet. Essentially, a de-facto lord-of-the-web (a CA) has given Blue Coat free reign to impersonate whatever website they want to. As a result, you can be easily fooled into thinking your connection is secure when really it can be monitored or filtered by a government, corporation, or some other potentially-malicious entity.

What can I do about it?

If you run a website where security and privacy are a concern, you should be employing a mechanism called Key Pinning. This will essentially allow a user’s browser to detect if your site’s purported certificate isn’t signed by the entity you’d expect. It is not foolproof, but will work most of the time unless the user has been subjected to the MiTM ever since their first visit to your site.

If you’re just an end-user, you might want to explicitly distrust Blue Coat. This will essentially tell your computer/browser that despite its certificate being signed by an otherwise-trustworthy CA, you should not trust the certificates that it signs. Apparently, it is pretty easy on OS X but is a little more difficult on Linux because it has a more varied trust ecosystem. I have not yet been able to obtain any information about untrusting the certificate on Windows.

Unnecessary mathematics: queuing delay for two types of network traffic

This morning I asked myself what I thought was an interesting question:

  • If there are two types of traffic in a network, where one is much less common than the other, will the two types experience different queuing delays?

If you have more common sense than me, you’ve already figured out that the answer is “No.” But it took me about an hour to figure this one out.

I designed a simple model of the scenario: there is a queue of N packets, and one of them is of a different type than the rest. Let’s call the two types A and B.

We can assume that the position in the queue reflects the queuing delay of the packet, so then we only need to be interested in the average position in the queue for each type of traffic: E[P(A)] , and E[P(B)].

We assume that there are N-1 packets of type A, and one packet of type B. We can place the type-B packet at any of the N positions in the queue with equal probabilities:

  • E[P(A)] = \frac{1}{N} \sum_{i=0}^{N-1} \left( \frac{1}{N-1} \left( \left( \sum_{j=0}^{N-1} j\right) - i \right) \right)
  • E[P(B)] = \frac{1}{N} \sum_{i=0}^{N-1} \left( i \right)

The term summed in the formula for E[P(A)] represents the average position of all N-1 type-A packets. In E[P(B)], there is only one packet so the term doesn’t need another sum inside of it.

Well, it turns out that E[P(A)] and E[P(B)] both simplify to \frac{N-1}{2}.

This should be immediately obvious in the formula for E[P(B)], and it takes just a little more work to derive for E[P(A)]. Yep, it couldn’t have been simpler.

If you think about it, this result makes a lot of sense and I shouldn’t have needed to compute it — if I’m at a random position in line, then so is everyone else. So it doesn’t matter whether I’m averaging over a group of N-1 people, or only worrying about myself. We’ll all wait the same amount of time (on average, of course).

What’s inside a PEM file?

As you probably know, PEM is a base64-encoded format with human-readable headers, so you can kind of figure out what you’re looking at if you open it in a text editor.

For example, let’s look at an RSA public key:

-----BEGIN PUBLIC KEY-----
MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA4YFgwNrEkMdynjtDsM0q
b+Hedk8p4pySfxakYTfSQPEyGxxnQGcVMV2ZEjPR4nZeqJrtNTlixhK2YWqunE6I
KopVDq3WvtPKweNEeZ8B2lA2I8FFrpZSjI/Tosq8/MbTd/Y/C4Q8Qcz78MF/NH17
/E82K3ca9/LM2b4KGTEIhsLUff7OGrJM7lPcQZN3EOdUeQnzT9uTh8Z9oFqChfJP
pLwwSebfrRB7VMXjeKHZmubSO5pULHLdZLbkgLSmnhbgBjO6apG0tkYyOeWd6L8F
MzA21WkXJdANrr1s/yv5zS9hx1q9jSM8Me9QA2/iaAbgem7VwQ2YlPiXEvUq48oB
VsKXMpHQ6A2cUygs+PiSFuUzNjTIebWFTWmKKuoRx0O2m63fAZJaT2aJA4G0HqdJ
ZQ2Aqr4Acs1+28IhLxUbMAlHJ4N2XPnE2WpQYbtUR4zZMXU+bVIToXuqHCLo4pf/
qEIK/xzr/S8WdvMvRVSOtVIIQwyaMDUxsnnKozYSVHvzYsxQo3b3VD5OOqmg1mx1
+Z/PLFViLkBjo+ZMkl5dFbsgYyHmkn/uvCV19IpjkdDNfFgdrOlSdNTnlGU7su5L
L31k/IwSvD0PR0egxiv8HhegaYwqgujVylB0gntyBsrVVHfE3Wr2+aJlR3YmrdCZ
lsAiSbnFxgGtfB6INHepFdkCAwEAAQ==
-----END PUBLIC KEY-----

We can see that we have a public key just from the header. But what’s all the base64 stuff? If it’s a public key, we know that there should be a modulus and an exponent hidden in there. And there’s also probably some kind of hint that this is an RSA key, as opposed to some other type of key.

That base-64 payload actually has the potential to contain a lot of information. It’s an ASN.1 (Abstract Syntax Notation One) data structure, encoded in a format called DER (Distinguished Encoding Rules), and then finally base64-encoded before the header and footer are attached.

ASN.1 is used for a lot of stuff besides keys and certificates; it is a generic file format that can be used to serialize any kind of hierarchical data. DER is just one of many encoding formats for an ASN.1 structure — e.g. there is an older format called BER (Basic Encoding Rules) and an XML-based format called XER (you can probably guess what that stands for).

But anyway, what is inside that public key? How can we find out?

Parsing PEM files

OpenSSL comes with a utility called asn1parse. It can understand plain DER or PEM-encoded ASN.1. Let’s feed our public key into it and take a look:

$ openssl asn1parse -in public.pem -i
    0:d=0  hl=4 l= 546 cons: SEQUENCE          
    4:d=1  hl=2 l=  13 cons:  SEQUENCE          
    6:d=2  hl=2 l=   9 prim:   OBJECT            :rsaEncryption
   17:d=2  hl=2 l=   0 prim:   NULL              
   19:d=1  hl=4 l= 527 prim:  BIT STRING

The column that contains “cons” and “prim” gives us information about the hierarchical structure of the data. “Cons” stands for a “constructed” field, i.e., a field that encapsulates other fields; on the other hand, “prim” means “primitive.”

The column after “cons” or “prim” tells us what type of data is in that field. The -i flag I’ve supplied causes that column to be indented according to how deep we are in the hierarchical structure. So overall, what are we looking at?

There is one root SEQUENCE object. That SEQUENCE contains another SEQUENCE and a BIT STRING. That internal SEQUENCE has an OBJECT and a NULL terminator. The OBJECT field is actually an Object Identifier — it contains some constant that tells us what kind of object we’re decoding. As we may have suspended, it tells us that we’re decoding an RSA key.

Here’s where stuff gets really interesting: That BIT STRING field actually contains more ASN.1 data. Let’s jump in and parse it:

$ openssl asn1parse -in public.pem -strparse 19 -i 
    0:d=0  hl=4 l= 522 cons: SEQUENCE          
    4:d=1  hl=4 l= 513 prim:  INTEGER           :E18160C0DAC490C7729E...
  521:d=1  hl=2 l=   3 prim:  INTEGER           :010001

(The -strparse 19 flag means “parse the data in the field located at offset 19 in the original structure.” If there was another BIT STRING inside this one, we could add another -strparse argument to recurse into it.)

What do we have here? A SEQUENCE containing two INTEGERs. It turns out that the first INTEGER is our modulus, and the second INTEGER is the exponent.

You can actually infer the effective length of the key from this information. The third column indicates the length of the modulus is 513 bytes; one of these bytes is just padding, so that means our key is 512 bytes (or 4096 bits) in strength.

Reverting to DER

As I explained earlier, PEM is just a wrapper around DER. So can we dump the raw DER and parse it that way? You sure can. You just cut out the PEM headers and base64-decode the payload, and what comes out is DER.

$ grep -v "PUBLIC KEY" public.pem | openssl base64 -d > public.der

Now we can inspect the raw DER:

$ hexdump public.der 
0000000 8230 2202 0d30 0906 862a 8648 0df7 0101
0000010 0501 0300 0282 000f 8230 0a02 8202 0102
0000020 e100 6081 dac0 90c4 72c7 3b9e b043 2acd
0000030 e16f 76de 294f 9ce2 7f92 a416 3761 40d2
0000040 32f1 1c1b 4067 1567 5d31 1299 d133 76e2
0000050 a85e ed9a 3935 c662 b612 6a61 9cae 884e
0000060 8a2a 0e55 d6ad d3be c1ca 44e3 9f79 da01
0000070 3650 c123 ae45 5296 8f8c a2d3 bcca c6fc
...

Or, we can use asn1parse if we actually want to understand it:

$ openssl asn1parse -in public.der -inform DER -i
    0:d=0  hl=4 l= 546 cons: SEQUENCE          
    4:d=1  hl=2 l=  13 cons:  SEQUENCE          
    6:d=2  hl=2 l=   9 prim:   OBJECT            :rsaEncryption
   17:d=2  hl=2 l=   0 prim:   NULL              
   19:d=1  hl=4 l= 527 prim:  BIT STRING   
$ openssl asn1parse -in public.der -inform DER -strparse 19 -i
    0:d=0  hl=4 l= 522 cons: SEQUENCE          
    4:d=1  hl=4 l= 513 prim:  INTEGER           :E18160C0DAC490C7729E...
  521:d=1  hl=2 l=   3 prim:  INTEGER           :010001

This is the exact same data we saw earlier, but there’s your proof that PEM really is just DER underneath a layer of base64.

My first adventure with Let’s Encrypt on nginx, dovecot, and postfix

Let’s Encrypt is old news by now. It launched back in December, so it has been giving away free DV certificates for nearly four months now. Being a TA for a Computer Security course, it’s about time that I actually tried it out.

Prologue

Let’s Encrypt is a free certificate authority. They grant TLS certificates that you can use to secure your webserver. They are Domain Validated (DV) certificates, which means they will verify that you control the domain name you are trying to certify.

To test out Let’s Encrypt, I decided to deploy new certificates on some internal Zeall.us services (Gitlab and Roundcube). Previously, we were using self-signed certs, so we would have to approve a security exception on every visit.

Unfortunately, I chose the wrong day to test out Let’s Encrypt. For some reason, their DNS client decided to die at the very moment I tried to get a certificate. I wasn’t the only one having this problem, but Google hadn’t yet crawled the support site so I didn’t see these other reports until the problem had already subsided. Once it cleared up, everything else was a breeze.

Obtaining certificates

Obtaining certificates for our Gitlab and webmail instances was incredibly easy. I just cloned the Let’s Encrypt repository from Github and used their super-simple command-line utility. It doesn’t have built-in support for nginx (our webserver of choice), but you can still easily use the webroot plugin to validate your domain. You just tell it the directory from which your webserver serves, and it puts a file there which it uses for the validation. I ran it once for each of the services:

./letsencrypt-auto certonly --email example@example.com --text --agree-tos --renew-by-default --webroot -w /opt/gitlab/embedded/service/gitlab-rails/public -d <our_gitlab_domain>
./letsencrypt-auto certonly --email example@example.com --text --agree-tos --renew-by-default --webroot -w /srv/roundcube -d <our_webmail_domain>

It handles everything automatically, and sticks your new certificates and keys in their respective /etc/letsencrypt/live/<domain_name>/ directories.

Configuring nginx

Prior to setting up the new certificates in nginx, I generated strong Diffie-Hellman key exchange parameters, as suggested by this guide to nginx hardening:

sudo openssl dhparam -out /etc/ssl/certs/dhparam.pem 4096

Then I just put the following lines in each of my nginx site configs:

ssl on;
ssl_certificate /etc/letsencrypt/live/<domain>/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/<domain>/privkey.pem;
ssl_session_timeout 5m;
ssl_session_cache shared:SSL:10m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers 'EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH';
ssl_prefer_server_ciphers on;
ssl_dhparam /etc/ssl/certs/dhparam.pem;

Follow up with a sudo service nginx reload, and you’re good to go. With this configuration, my domain got an A+ from Qualys’ SSL Labs Server Test.

Configuring Dovecot

To supply the new certificate to my Dovecot IMAP server, I set the following configs in /etc/dovecot/conf.d/10-ssl.conf:

ssl_cert = </etc/letsencrypt/live/<domain>/fullchain.pem
ssl_key = </etc/letsencrypt/live/<domain>/privkey.pem

This is the same certificate I used for webmail, since the domain is the same. I have not yet investigated hardening techniques for Dovecot, but at a glance the settings seem to be quite similar to nginx.

This configuration was followed up with a sudo service dovecot restart, and then everything worked.

Configuring Postfix

Finally, I supplied the same certificate to Postfix (SMTP) by putting the following in /etc/postfix/main.cf:

smtpd_tls_cert_file = /etc/letsencrypt/live/<domain>/fullchain.pem
smtpd_tls_key_file = /etc/letsencrypt/live/<domain>/privkey.pem

After a sudo service postfix restart, the certificates are live.

Auto-renew

Let’s Encrypt provides certificates with a term of only 90 days, so it is necessary to write a script to renew them automatically. I wrote the following script:

/opt/letsencrypt/letsencrypt-auto renew | tee -a ./lerenew.log
service nginx reload
service postfix reload
service dovecot reload

And a corresponding crontab:

0 0 * * 0 ~/lerenew.sh

Now once a week, any eligible certificates will be renewed and the services which use them will be reloaded.

Conclusion

It has never been this easy to deploy TLS certificates. I highly recommend that anyone who is currently using a self-signed certificate switch to Let’s Encrypt. If you’re currently paying for a DV certificate, consider switching to Let’s Encrypt to save some money — but keep compatibility issues in mind.

Demonstrating the double-DES meet-in-the-middle attack

Introduction

DES (Data Encryption Standard) is an old-school block cipher which has been around since the 1970s. It only uses a 56-bit key, which is undeniably too short for use in the modern day. Between the realization that DES is weak in the late 90s and the invention of AES in the early 2000’s, Triple-DES had a brief time to shine.

Triple-DES is just what it sounds like: you run the DES algorithm three times. You use two 56-bit keys, K1 and K2, and apply them in the order K1K2K1. The result is a cipher with 112 bits of key strength.

Students often ask me, why not just encrypt twice: K1, K2, done? The reason is that this construction is vulnerable to a particular chosen-plaintext attack, which we call the meet-in-the-middle attack. That is, if the attacker knows your plaintext in addition to your ciphertext, he doesn’t have to try all 2^112 keys. The maximum amount of work he has to perform is actually only 2^56 — not much more than to break single DES.

Notation

Before I proceed to an explanation of the meet-in-the-middle attack, let’s get some basic cryptography notation out of the way.

Let’s call our plaintext P, and our ciphertext C. Let’s call our DES encryption function  E. It takes in a plaintext and a key; in return, it produces a ciphertext. Similarly, there is a decryption function D which takes a ciphertext and a key and spits out a plaintext.

We can then say:

  • E(P, K) = C
  • D(C, K) = P
  • D(E(P, K), K) = P
  • E(D(C, K), K) = C

You get the idea. Let’s move on.

Double-DES

Double DES uses two keys, K1 and K2. We produce a ciphertext using the following formula:

  • C = E(E(P, K1), K2)

We just encrypt twice, using our two keys. Easy.

Since each key is 56 bits, an attacker that doesn’t know the plaintext would need to guess a total of 112 bits of key to break the cipher. Therefore, the total keyspace is 2^112, and an average of 2^111 keys would have to be guessed before the true K1 and K2 are obtained.

Meet-in-the-middle

Let’s redefine Double-DES as a two-step process:

  • M = E(P, K1)
  • C = E(M, K2)

We will refer to the result of the first encryption as M, as it is our “middle” value. Now, for this attack to work, we assume our adversary has access to both and C, but wants to determine K1 and K2.

Consider the following statement — we can easily derive it from the second formula we wrote above:

  • M = D(C, K2)

An attacker can easily exploit this fact, and essentially make K1 independent of K2. He can guess them each separately, instead of trying to guess them at the same time. But how?

First, the attacker guesses every possible K1. Let’s call each of his guesses K1′. For each guess, he computes M’ = E(P, K1′) to obtain a potential “middle” value, and stores the result in a table along with the corresponding K1′.

After filling out the table, with each of the 2^56 possible K1′ keys, the attacker moves on to guessing K2. Again, let’s refer to each guess as K2′. He computes M’ = D(C, K2′) then checks whether this M’ matches any M’ he stored in the table previously.

If a match is found, then both keys have been broken. That is, if E(P, K1′) = D(C, K2′), then K1 = K1′ and K2 = K2′.

How much work has the attacker done? Well, generating the table required 2^56 encryptions, then finding a match would require 2^55 decryptions on average. Overall, we will only have done only 2^56.6 cryptographic operations. This is in stark contrast to the 2^111 operations which we would expect to perform in a ciphertext-only attack with two 56-bit keys.

Pseudocode

In short, this is the attack:

function meet_in_the_middle(C, P):
      table = {}
      for i = 0 to 2^56 - 1:
            table[E(P, i)] = i
      for i = 0 to 2^56 - 1:
            if D(C, i) in table:
                  return table[D(C, i)], i

That’s it.

Demonstration

I’ve written a quick demonstration of this attack in Python using the pycrypto library. The source code is on my Github. For the demo, I have only used 18 bits of the key (just so you don’t have to wait forever to see the results). There are also some optimizations which can be made (e.g., DES takes the key as a 64-bit input even though only 56 of those bits are used… we can skip some of the inputs that we have tested to make it faster). However, it works.

Triple-DES

So why isn’t Triple-DES vulnerable to this attack? Well, it’s easy. We can only “skip” one encryption using this strategy. Since Triple-DES is defined as E(D(E(P,K1),K2),K1), we can’t split it up in such a way that we won’t, at some point, have to guess both K1 and K2 at the same time. A formal proof of this claim is, unfortunately, out of scope of this article.

Last word

This attack isn’t specific to DES. It works on any composition of two ciphers which is constructed as E(E(P, K1), K2). So don’t ever, ever, ever try to do that.

Digital privacy: Where do you draw the line?

I recently had a friendly debate on IRC about how much privacy you really need for it to be considered “enough.” Arguably, the answer is “there is never enough.” And still, there are plenty of people who would be perfectly content having no privacy at all.

Many argue that the key to privacy is actually transparency. If we use open source software, we can all audit it to be sure it is free of backdoors and bugs which may leak information. Perhaps this is why increasingly many users (but unfortunately still less than 2% overall) now prefer Linux to proprietary operating systems. But even among Linux users, there is not much knowledge about how deep the privacy rabbit hole goes.

Before I return to the debate about how much privacy is really sufficient, I’d like to give a quick overview of some of the tools one can use to preserve their privacy. I will discuss four levels of privacy: network, OS, firmware, and hardware.

Privacy-centric networking

Tor is now a very well-known protocol. It allows you to browse the Internet privately by encrypting your data several times and forwarding it through several proxies, each of which can only remove one layer of encryption.

One alternative to Tor is JonDonym, which is similar in that it enables privacy by packaging data under multiple layers of encryption. However, it is inherently less secure because its anonymization network is smaller and it uses less randomization in path selection. Even so, it is demonstrably faster than Tor, so may be suitable if you aren’t quite as concerned about privacy.

Even though Tor has become the de-facto standard for Internet privacy, some privacy advocates and security researchers have recently asserted that Tor isn’t strong enough. For example, so-called “Internet drug kingpin” Dread Pirate Roberts was nabbed by the FBI despite using Tor back in 2013. More recently, there was a big controversy over the FBI allegedly paying academics large sums of money to help them catch criminals who were using Tor.

In response to concerns about Tor-like protocols, new privacy-centric protocols have been proposed. One particularly notable design is Vuvuzela, which goes out of its way to hide any possible metadata by having idle clients continuously transmit bogus data. While many complain that browsing the Internet using Tor is too slow, using Vuvuzela for this purpose is utterly impossible. It was purposely designed with message transmission latencies of up to one minute.  Even so, it is perfectly suitable for sending sensitive emails, or other types of not-quite-instant messaging.

Privacy-centric operating systems

The Linux-based operating system Tails received notoriety back in 2013, as the operating system which Edward Snowden used to preserve his privacy while in exile. Tails is 100% open source and comes with a bundle of privacy-oriented software and state-of-the-art cryptographic tools. Perhaps most notably, it runs like a Live CD and does not store any state unless you specifically ask it to. Therefore, you can be sure that no identifying or compromising files will remain on the disk.

Though Tails is certainly the most popular, there are other privacy-oriented OSes out there. Some other notable distributions include:

  • BlackArch: Based on Arch Linux, this distro is targeted primarily at pentesters and security researchers. Nonetheless, it ships with many of the same cryptographic and privacy-preserving utilities that are included with Tails.
  • Parrot Security OS: This rolling-release Debian-based distro is similar to BlackArch in that it is aimed primarily at security researchers but still gives you all the utilities you need to ensure your privacy and security.
  • JonDo: JonDo is more like Tails in that its main goal is to protect the user’s privacy. In addition to Tor, it includes clients for the JonDo anonymity network.
  • Qubes OS: Qubes is unique in that it provides security by enforcing compartmentalization. Each program you run is contained in its own sandboxed virtual machine, so there is little chance of data leakage between applications. It is also notable in that its team is quite involved in the security research community.

Open source firmware

Recently, the NSA has been accused of infecting computers with persistent spyware that survives even if the operating system is reinstalled. The software can install itself permanently by infecting the computer’s firmware, e.g. in the BIOS/UEFI or System Management Controller. Detecting these threats is very difficult, and sometimes they can be impossible to mitigate. Updating firmware is difficult, and often vendors will never release updates to their code. In response, there have been recent efforts to make computer boot firmware open source.

The most popular free/open-source firmware projects are coreboot and libreboot. The two are intimately related, in that libreboot is actually a distribution of coreboot (e.g. in the way that Debian is a distribution of Linux). These projects aim to make full software-stack transparency possible for the average user — but this is still an ongoing endeavor.

As of now, there are not many hardware platforms supported by libreboot, and the installation process can be quite arduous, depending on the platform. Fortunately, support for open-source firmware is growing, so hopefully it will be more accessible in the near future.

Open source hardware

Though it may now be possible to run 100% open source code on your computer, all from the firmware up to application software, it is still difficult to gain full control over your hardware. As a result, the Open Source Hardware Foundation (OSHWA) and the Free Software Foundation (FSF) have recently initiated efforts to develop open hardware and certify compliant devices. Only with open-source hardware is it possible to have complete trust in your computer. With access to every detail of the design, these certified devices let you be sure that you are invulnerable to privacy holes and backdoors.

While several embedded devices and hobby platforms (e.g., Arduino) embrace the open-source hardware philosophy, it is still difficult to find a general-purpose computer that is completely open-source. The FSF’s open hardware certification program, “Respects Your Freedom,” has only certified a handful of devices since its inauguration. This is somewhat understandable, since only a small niche of users care enough about their privacy to go this far, so there is not much motivation for hardware vendors to make their products open and get them certified. However, one can rejoice in the fact that it is indeed feasible to obtain a completely open-source machine.

How much is too much?

Personally, I believe that everyone should be entitled to privacy online. However, I’m not going to go out of my way to make all of my communications 100% airtight from the prying eyes of determined state actors.

There is certainly a tradeoff between privacy and convenience. Not many people would take the time to find an open-source hardware platform with open-source firmware then install a locked-down, privacy-aware OS and route all of their Internet traffic through Vuvuzela. It is certainly possible to go to such an extent, and it will enhance your privacy. But I wouldn’t do it.

I am not a fan of the “nothing to hide, nothing to fear” argument. It is completely fallacious and does not justify surveillance. Nobody should feel guilty for valuing their privacy. Even so, you would have to be really determined to go to such extreme lengths for privacy.

I’m perfectly content using TLS to transmit private data, and using a VPN if I’m connected to a network that isn’t trustworthy. But my hard disk isn’t encrypted, I don’t use Tor, and I don’t use open-source firmware, so I’m certainly not what anyone would call paranoid.

I think that being paranoid usually isn’t worth it. For example, the average hacker isn’t capable of breaking your SSL session. A government can, but (with high probability) they would have to target you specifically. And if you’ve become the specific target of an actor with nation-state scales of resources, then you can never be 100% safe. There will always be some vulnerability that can be exploited.

A fun experiment with Twilio

Aside: I apologize for the long gap in posts. I’ve been in Mexico City for the past two weeks. I’ll post some photos once I get the film developed. In the meantime, here’s a small project I’ve been working on.

I first heard about Twilio a long, long time ago. As Google Voice faded out of relevance, it took the lead in the mobile-communication-as-a-service market. However, I had never had the chance (or inclination) to play around with its API until today.

About 12 hours after we landed back in the US from our holiday in Mexico, Lynsey departed once again — this time to the Plant and Animal Genome conference (PAG) in San Diego. She asked me to supply her with pictures of our cats for the duration of her trip. I told her I would send her a cat pic every hour, on the hour.

I didn’t realize what I had gotten myself into until I had already deposited $20 into a new Twilio account and spent 2 hours coding away… Though my goal was just to send some photos of cats, I had developed a pretty general application that lets you build a queue of MMSes to be disseminated at a constant rate.

This was not only my first experiment with Twilio — it was also my first adventure with Flask, a framework for building simple HTTP services in Python. I must say that both Flask and Twilio were extremely easy to use, and I would not hesitate to use them again.

I will certainly be finding some more uses for Twilio in the future — it will be invaluable for the SMS-based 2FA we plan to integrate into Zeall. But for the time being, it will be serving as a medium for the delivery of photos of cats.

If anyone is interested in my Twilio-based app, here is a link again: https://github.com/le1ca/twiliq

What goes around comes around: Is the Juniper backdoor the feds’ fault?

By now, everyone should know that Dual EC DRBG is unsafe. Way back in 2013, it was revealed that it has many weaknesses, some of which were traced back to the NSA (with the help of Edward Snowden’s leaks). Whether or not it really was inserted by the NSA, the backdoor has been proven to exist and is easily exploitable.

For those of you who don’t know, a DRBG, or “deterministic random bit generator,” is essentially just a way for computers to generate random numbers. We need random numbers for cryptography — they form the basis of our secret keys. If a random number generator is compromised, then the keys it produces are unsafe to use.

Earlier this week it was announced that Juniper Networks, a major vendor for network hardware, had found a massive backdoor in its own VPN appliances. The US government is pretty worried about this, since apparently they use a lot of Juniper hardware. They’re suspecting foreign governments; they think the backdoor was inserted by state sponsored hackers from China or Russia.

However, some researchers are looking back toward Dual EC. They think foreign governments are really just a scapegoat, and that the backdoor is the federal government’s own fault. There is documentation proving that Juniper still uses Dual EC in some of its products. If the Dual EC backdoor turns out to be the source of the vulnerability in Juniper’s VPNs, then this is certainly an ironic turn of events for the NSA.

Before I conclude, let me take a moment to answer the question posed in the title of this post: Is the backdoor the feds’ fault? Eh, probably not this time. But it is certainly possible. The NSA is known to have ways to backdoor hardware. And unfortunately, no backdoor is safe. Eventually, someone other than the intended user is going to find out how to get in. And what happens then?