One of the topics that’s received some attention recently is “Browser Fingerprinting.” At a high level, this is a way of attempting to identify a particular browser instance (and, by extension, the person who’s probably using it) without relying on values implanted by the website. There are obviously privacy issues surrounding this, but I’m not going to delve into that particular area. Instead, I’m going to discuss some of the technical means via which this is done.

Motivation Behind Browser Fingerprinting

In the vast majority of web applications, each individual request your browser sends to a web site is independent of any previous requests. This was an interesting problem in the early days of the web, since there are valid reasons why requests should be linked together. The classic example is a shopping cart – when you click “buy this” on Amazon and then want to check out and pay, Amazon needs to know that the request for the shopping cart page is coming from the same person who clicked “buy” so that you can be shown your cart and not mine. “Cookies” are the main method via which this is implemented – when you log in, Amazon issues you a “session cookie” which your browser returns on each successive request, allowing Amazon to sort out which request is associated with which user session.

Of course, then came advertising. Advertisers have long known that they stand a better chance of selling you something if they present an ad for something you’re interested in, as opposed to something you’re not. One of the promises of web advertising was that it could be “targeted” – if your likes and dislikes could be identified, the ads you were shown could be tailored to you. One of the early way this was done was to try to tailor the ads shown on a page to the content on the page. Google AdWords makes use of Google’s search capability to do this – Google uses what it knows about the content of each page to choose which ads to display, on the theory that if you’re reading that page, you’re probably interested in whatever topic is written about there.

Over time, however, the companies who place the ads got more ambitious. What they wanted to do was note that you looked at topic A on site B, and then be able to tailor the ads you saw on site X based on that. Initially, this was possible using what are known as “third-party cookies” – an ad displayed on site B could set a cookie that would also be returned when you then browsed site X.

Over time, browsers adapted to the wishes of their users, and most browsers now have the ability to prevent third-party cookies from being set. (Whether that switch is turned on or not by default is another question.) So, the ad companies turned to other means. One of these was so-called “Flash cookies”, in which Adobe Flash objects could store information on your machine and later retrieve it when desired. “Local storage” (the ability of JavaScript to request that your browser store bits of information in a browser-based database”) has also been used for this.

All of these techniques rely on the basic idea that a web site installs something on your computer that identifies you, and later on can get that bit of content back. The problem with them is that users can clear this information, or can (sometimes) prevent it from being stored at all by configuring their browsers correctly. This disappoints Those Who Want To Identify You.

(Sidebar: Identifying you via these means is not always nefarious. As one example, secure sites – such as your online banking site – have a perfectly valid need to want to confirm that you are you.)

Then along came the concept of browser fingerprinting.

The Core Idea

Have you ever played the game “Twenty Questions?” If so, you will recall that the basic idea is that you ask a series of questions, trying to figure out whatever it is that your opponent is thinking about. Each question is designed to eliminate some subset of the candidates that remain, until you get down to the point where (hopefully) one and only one possibility remains.

Phrased differently, the greater number of facts I have in my possession about you (or your browser), the greater the likelihood that I will be able to distinguish you (or your browser) from someone else. Information theorists use the term “entropy” to describe this, and measure the amount of information available in bits. As a crude example, the population of the United States was approximately 320 million people at the end of 2014. That means that, to uniquely identify one person out of the population of the United States, you’d need between 28 and 29 bits of information. (2^28 = 268,435,456, 2^29 = 536,870,912).

In theory, each yes/no fact about you provides one additional bit of information. In practice, yes/no questions probably provide fewer than a full bit, because many facts aren’t completely independent of previous facts. (Example, “Are you at least 6 feet tall” and “Are you male” aren’t independent, since males, on average, are taller than females.) On the other hand, questions with more than two answers frequently provide more than one bit of information about you. There are around 42,000 unique zip codes, for example, so if I know your zip code, on average I’ve narrowed you down to around 8,000 possible people. (Maybe more – there are zip codes with as many as 115,000 people in them, and others with many fewer. Similarly, if I know the month and day of your birth, I’ve eliminated, on average, over 99.5% of the potential candidates. Combine the two, and, on average, I’m down to about 20 people in the US. Add in the year of your birth, and I’ve probably uniquely identified you.

I say “probably” in all of this – there are always anomalies. (Example: twins living together.) But if you were to go to an advertiser and tell them that, with greater than 99% probability, you could tell that the person who browsed page A on site B was the same person who later browsed page C on site D, that advertiser would probably be very, very happy with you. Advertisers, in particular, doesn’t need perfection – they just want to significantly up the odds. And they’re willing to pay for it – a “targeted” ad placement commands a significantly higher fee than an untargeted one, because the conversion rate (i.e. the likelihood that you will make a purchase through that ad) is very, very much higher.

So the goal of browser fingerprinting is to identify you. Specifically, to identify you:

  1. in ways that do not depend on the web sites having installed cookies or other content that can be erased,
  2. in ways that you have difficulty avoiding or spoofing, and
  3. in ways that are largely invisible to you.

The means via which this is done largely resembles the “20 Questions” game – web sites collect as much information as possible about your browser (and the underlying computer), relying on the fact that there are very few browsers in the overall population that will every “answer” the same as yours.

I should point out that browser fingerprinting, in and of itself, cannot actually identify you as a unique person. In and of itself, the most it can say is that “the browser that generated this request is probably the same one that generated that request.” However, there are many sites “out there” to whom you have identified yourself – essentially, any site through which you’ve made a purchase knows you as a unique individual. Facebook (and other social media sites) also have an awful lot of information about you. If those sites collect browser fingerprints from you as you are interacting with them, and then provide those fingerprints to other sites, those other sites can probably identify you as a unique individual even if you haven’t provided them with that information. Think that won’t or can’t happen? Then you probably haven’t read the Terms of Service and Privacy Policies on many of the sites you’ve visited. Those sites may promise not to reveal “personal details” about you, but I bet they never define what is a “personal detail” and what is not. I would wager large money that, if push came to shove, they would claim that “interested in fishing equipment, and has this browser fingerprint” is not a personal detail that they won’t share.

In subsequent posts, we’ll look at some of the “Twenty Questions” that web sites are probably asking your browser.