[Noisebridge-discuss] CSS Fingerprint: preliminary data

Tue Mar 2 12:24:09 UTC 2010

up-to-date results page: http://cssfingerprint.com/results
technical details page (w/ pretty graphs): http://cssfingerprint.com/about
participation page: http://cssfingerprint.com

After some helpful suggestions, I've improved the javascript scraping
engine on http://cssfingerprint.com considerably.

It is now compatible with all browsers that render CSS/JS at all
(sorry, this doesn't include stuff like lynx).

It is also quite fast, at least locally. Based on current data that
let me target specific browsers with different scraping methods, I can
process approx:

Explorer: 200kURL/min
Firefox: 210kURL/min
Opera: 210kURL/min
Mozilla: 400kURL/min
Chrome: 2MURL/min
Safari: 3.4MURL/min (!!!)

(See the about page for pretty graphs of the live data. These numbers
are /4 if you want to test for http/https and bare/www. variants of
each URL.)

I'd like to reiterate that last one. *I can test whether you've
visited over three million URLs in one minute* on Safari. This will be
increasingly true on other browsers as they improve their DOM/JS
engines.

Right now, other components of the app aren't able to actually keep up
with that speed. The network i/o needs a lot of optimization, the
background processing has a synch bug that I'm working on (and is
overwhelmed by the front-end speed), I am still processing *all* the
data and not just the hits because the way it's set up doesn't let me
easily compress that info (... trying to insert ~3-50k rows per second
in mysql is kinda overtaxing my dev box), and I need to totally redo
the way I'm choosing which URLs to test to be intelligent (right now
I'm just using the Alexa db, rather than scraping my own and using a
bootstrapping method).

So that 3.4M is still a little bit theoretical, and the cost of
network i/o alone will bring it down some.

But that is only a matter of time, not a matter of it being not possible to do.

I've been talking with a few people about this (Dan Kaminsky; Peter
Eckersley & Seth Schoen of EFF; Gilbert Wondracek & Thorsten Holz of
ISecLab), and it seems we're in agreement that this hack was
underestimated. Being able to easily query 3M URLs is a whole
different ballpark from other things like cache timing attacks.

Gilbert & Thorsten have demonstrated
(http://www.iseclab.org/people/gilbert/experiment/) that by scraping
social network membership data and checking whether you have visited
various groups, they can identify pretty well who you are. Their
experiment was quite limited in scope - only targetting the Xing
network and a small number of volunteers. I'm fairly sure (but have
not yet verified) that testing across multiple networks would
deanonymize almost everyone who uses social networks at all.

My research is not aimed at deanonymization (yet?) but at getting a
more abstract user fingerprint and a user similarity metric that maps
to real world social networks and social behavior.

I believe that there can be positive, ethical uses of this hack; for
instance, a site can use it to provide you with a much more
intelligent selection of plausible social matches (both to connect
with current people you know and people you would actually like to
interact with), without your having to search for individuals
directly. It can show you links (more prominently / only) to social
bookmarking sites you actually use. Etc.

However, there are also serious privacy implications, and like any
powerful tool, there are very definitely ways that this could be used
to malicious ends.

There's been some discussion among us of how to mitigate the issue. So
far, the ideas are:
* abandon :visited altogether
* make visitation be remembered per source domain (not global), so
that links visited from site A do not show as visited on site B
* constrain :visited to only permit color (no background-image, font
changes, etc) and lie to JS when it's inspected
* admit we're all screwed

Dan thinks that he can get around #3, though I'm a bit skeptical.
AFAICT, it would block the attack - but break many sites in the
process. (How to balance the tradeoff between functionality and
privacy protection? Not my call.) But it's Dan, so I'm not writing it
off just yet. ;-)

Anyway, if you want to be in on the discussion, please email me directly.

If you want to help out, please keep visiting
http://cssfingerprint.com from all your browsers, remembering to use
the same code each time for the scraping. (If you don't, my AI will be
confused.)

It will also perform a self-test automatically that just gives me
timing and bogosity data (no real history data). You can do this even
if you don't want to give me your history info; it will do a lot to
help me develop better scraping methods, and you get to see the
results of it live on the about page. (Note that it needs to be hit at
least 10x by any given browser to show on the graphs, so if I don't
have enough hits from yours, you may need to reload the main page a
few times before I have enough data. I especially need timing data
from Explorer.)

I've added *optional* fields to input your name and email. This will
help me contact you if I have questions about your info, and if you
check the agreement box, let me use it in the future (i.e. once the AI
works) to give other users a list of similar users.

Speaking of the AI: it still sucks really really badly. But I haven't
really tried to work on it yet; I've been concentrating more on the
scraper and preliminary data analysis for now.

If you want to help out technically, I need help with:
* finding a way to speed up document.body.appendChild(list of ~2000 links)
* hosting this on a slashdot-proof server rather than my dev box at
home (requirements: mysql, memcached, starling, rails, and apache w/
mod_ruby [or privs to install them myself]; ssh; spikable RAM)
* scraping the URLs of the most popular pages on and links from all major sites
* developing a method to test whether text with a :visited color
styling is colored that way *without* directly inspecting the element
with the :visited psuedoclass (e.g. by somehow directly inspecting the
text, or a child node)
* writing an AI capable of accurately telling how similar a new
visited-sites n-boolean vector is to previously seen ones

The code is at http://github.com/saizai/cssfingerprint .

Happy hacking,
Sai