[Noisebridge-discuss] Noisebridger who deals with feed aggregation?

Thu Mar 11 17:12:06 UTC 2010

On Thu, Mar 11, 2010 at 8:47 AM, Brian Johnson <noisebridge at dogtoe.com> wrote:
> I don't think it was me that you were talking to, but I may be able to help.

Yay!

> I've a lot of experience scraping websites and a lot of experience with
> feed/content aggregation; more specifically: RSS, XML, JSON, etc... Some
> things I've used in the past are FeedParser (python), MagpieRSS (PHP),
> rsswriter (PHP). All are great tools and all have their uses.

I use mostly Ruby, but language really isn't a big deal, since this
section of things can be totally separated and given direct mysql
access if need be.

> If you let me know more specifics about your project, I may be able to help.

Since others might be curious, I'll elaborate here...

First off, see http://cssfingerprint.com/about .

What I'm doing now is very crude. I have a list of the alexa top 1M
sites, plus a handful of custom entries. I test them in the order
(probability site was hit by users so far, alexa rank).

I want to supplement this with a more intelligent version of what
WTIKAY[0] does: have the database keep a constantly updated list of
URLs that have an a priori ranking that approximates the likelihood
some random netizen has that URL in their current browser history, and
dynamically search all URLs that a user is likely to have hit. A naïve
first pass method (WTIKAY's) of doing that is to test URLs gotten from
site X iff the user has visited the root of X. A more sophisticated
method would use a more generalized and time-dependent correlation set
somehow.

In any case, to do this, I need a list of all popular URLs that people
go to - not just domains but fully specified deep links. URLs that
have random parameters when surfed naturally are garbage for my
purposes and need to be discarded so as not to waste scraping time. (I
can currently query ~100k-1M sites in 1 minute, but that's still quite
finite. :-P)

A couple related questions:
http://stackoverflow.com/questions/2424701/determining-an-a-priori-ranking-of-what-sites-a-user-has-most-likely-visited
http://stackoverflow.com/questions/2424570/ai-determining-what-tests-to-run-to-get-most-useful-data

A related scraping issue is, I want to build a database that will
allow me to do a more hardcore version of iSecLab's de-anonymization
experiment[1]. That is, for all major social networking sites, I want
a mirror of their entire group membership lists. There are of course
some significant scaling issues here - my mirror will always be out of
date - but I'm fine with settling for "good enough" on this.

I've written a couple scrapers so far[2], but it's really not what I'm
good at, so I'd appreciate help.

> You can find me on IRC (alienvenom).

Will ping.

Thanks,
Sai

[0] http://wtikay.com or http://whattheinternetknowsaboutyou.com
[1] http://www.iseclab.org/people/gilbert/experiment/
[2] http://github.com/saizai/cssfingerprint/tree/master/lib/tasks