[Noisebridge-discuss] CSS fingerprinting

Sat Feb 6 02:20:48 UTC 2010

I recently had a chat with the people running EFF Panopticlick, and as
a result decided to try experimenting with one area that they
thought[0] was too unstable to use effectively: CSS-hack based history
extraction.

The way they are approaching it now, they're treating each datum as a
unique string. I think that more data could be extracted by treating
it as a high-dimensional vectorspace and trying to figure out what
clustering there is.

In principle, because people visit many of the same sites repeatedly
even on different computers/browsers[1], and even if not identical the
sites they visit should be culturally similar[2], one should be able
to detect a single user despite minor changes in their history, etc.
With Panopticlick's method, any pertrubance whatsoever (e.g. new site
visited or a link expiring from :visited cache) renders the comparison
to just return 'false', rather than being able to take advantage of
partial match or similarity.

Anyway, I've got a working test app deployed on a home box (so please
don't slashdot/pentest this until I have a real host):

http://www.cssfingerprint.com

More data - particularly multiple visits from the same human in
different contexts - would be very helpful, since at this point it's
mostly a matter of having enough to be able to usefully train the AI.
So please visit multiple times from different browsers/computers/days,
with the same codephrase.

It's fully open source of course:
http://github.com/saizai/cssfingerprint / CC by-sa 2.0. Please note
that some of the [very preliminary] AI files have libraries with weird
requirements, like SVD > linalg > FORTRAN.

It stores only your user-agent, the string input, and for the ~8k top
Alexa sites (plus a few custom ones), whether you appear to have
visited it. IP is stored in the server log (not the DB), but that gets
wiped regularly. I don't care who you are personally, I just want to
figure out if this can be used effectively as a fingerprinting
technique.

The software as is has some issues that I'm not sure what to do about
- see the bottom of the site's only page. There are also some that are
inherent to the current method, like the hack's being limited to fully
specified URLs, which reduce its effectiveness. If you have ideas for
how to improve any of that, speak up.

If you want to help analyze the data and can promise not to do naughty
things with it, let me know offlist. If you know a good free host[3],
let me know that also. I'm trying to figure out what my hosting
options are for this, as it's a bit different from the kinds of
projects I usually do professionally.

Thanks & enjoy,

- Sai

[0] https://panopticlick.eff.org/faq.php
[1] unless they really have a strict home/work or other functional
separation, in which case maybe not
[2] translation: near each other in an SVM-type projective space
[3] requirements: ruby, mysql, memcached, apache/mod_ruby or mongrel
or the like, starling; preferably ability to install a few things as
root, like linalg & libsvm. It shouldn't require a full box, just a
share with ability to spike RAM a few times a day and something I can
point the DNS to.