[ml] SocialPageRank

Sat Nov 27 04:08:17 UTC 2010

Cool, thanks a lot for the writeup Jared. I'll do it in R and we can
check our work.

It's not clear to me that we'll get the convergence we want if we
remove the dampening factor. Our transition matrix won't be
irreducible (Joe found 27 disconnected subgraphs, yeah?) so I don't
think we're guaranteed a unique stationary distribution [1]. It might
be hard to verify the implementation without this. I haven't gone
through the proofs, so if this isn't usually a problem for convergence
then I hope someone will correct me.

In any case, I think we do want our random walk to have a small
non-zero chance of crossing disconnected subgraphs because their
number falls from 27 to 22 when the test edges are added in. Hopefully
a constant dampening factor won't slow it down much. There's probably
no extra I/O involved.

Comparing PR(t=i) and PR(t=i-1) to check for convergence makes sense
to me. I might also check to see how much the relative rank order
(rather than the absolute probabilities) changes between iterations.
[1] says they used 25 iterations, although their initial state was
weird.

Shahin

[1] "Topic Sensitive PageRank", T. Haveliwala,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.1183&rep=rep1&type=pdf

On Friday, November 26, 2010, Jared Dunne <jareddunne at gmail.com> wrote:
> Shahin-
>
> Quasi-PageRank meaning:
> - Semantically, It's not the actual PageRank algorithmn a la Google, since that's patented and so its doubtful that I've implemented exactly their PageRank alg.  Also, PageRank is a strange name for this feature in our domain, so I think SocialRank makes more sense.
> - Semantics aside, I also removed the dampening factor, since I suspect that's only needed when your random walk will not visit all nodes.  Since I can update all Nodes' values with each iteration, I left it out since it would make each iteration have a slower effect in updating values.  Or in other words, I've set d=1.0.  For nodes with no inbound edges, this means that their PR is 0.0 after one iteration, instead of approaching zero over time.
>
> It's not in R.  It's in Ruby and dumping the output to csv after each iteration.  I'd be happy to reformat my output to something yummy for R to gobble up if csv is not ideal.
>
> I'm not taking the approach Mike presented (power method), but rather an iterative method that should result in same calculations.  More or less described in the "Iterative" section here: http://en.wikipedia.org/wiki/PageRank#Computation (but like I said with dampening removed).  "The basic mathematical operations performed in the iterative method and the power method are identical."
>
> Another caveat is that I'm equally initializing PR(t=0) for each node at equal values (1/N where N=number of nodes) instead of something related to number of inbound edges versus outbound as Mike's talk suggested.  This is me keeping things simple to start, so I can verify correctness more easier.  I can always rerun with different initial values if we determine that's better.
>
> I've calculated 50 iterations deep so far, which isn't computationally difficult with my code (under an hour without any dumping, under two hours if dumping all iterations).  However, I think I have a correctness issue that I need to fix and then re-run.  Namely, I was keeping my PR values in one hash and updating it as I iterate but i think I need to keep separate hashes, one for the current timestep and another for the previous timestep.  This would unfairly diminish the PR values of the nodes that are iterated on last.  In theory since, my order changes with each iteration, this won't have a major effect, but it's easy enough for me to correct, so I will.
>
> More to follow later this week once I fix my bug and rerun... At that time I'll post my code and sample data.  That said, in the meantime if you want to try your hand at it in R using power method, go for it!  We can always compare results to confirm we are finding the same nodes to be most "important".  (So far node #146725 has the highest SocialRank value in our network, while #887961 is the lowest (non-zero) PR)
>
> One thing I could use help with is knowing when I can stop iterating.... Any ideas on that?  I'll be able to compare PR(t=i) and PR(t=i-1) for a given node in constant time, if that helps...
>
> Jared-
>
> On Fri, Nov 26, 2010 at 1:42 PM, Shahin Saneinejad <ssaneine at gmail.com> wrote:
>
> Hey Jared,
> Is this in R? Are you using power iteration? I'm considering both these approaches for memory footprint and code simplicity reasons, respectively (my laptop is a little long in the tooth).
>
>
> Also, what's "quasi" about your quasi-PageRank feature?
> Shahin
>
> On Thu, Nov 25, 2010 at 4:11 PM, Jared Dunne <jareddunne at gmail.com> wrote:
> I've implemented an iterative approach to quasi-PageRank feature
> generation for all nodes.  Each iteration across all nodes is under a
> minute each time.  I still need to fix some things with my approach and
> possibly get some help sanity-checking the correctness of my algorithm.
>  I'll post more later, but I just wanted to let everyone know that I'm
> making good progress on the PageRank feature generation, so that we
> aren't duplicating our efforts.  Gonna pick this back up later this
> weekend.  Enjoy your feasts.
>
> Jared-
>
> On Wed, Nov 24, 2010 at 6:23 PM,  <mike at mindmech.com> wrote:
>
>
>
>
>
>
> I've shared SocialPageRank <https://docs.google.com/present/edit?id=0AbXOdbbcuXxAZGNidmNnNjJfMGRkNmg4OGR2&hl=en&invite=CO_7wcEI>
>
>
>
> Message from mike at mindmech.com:
> Here's the presentation I'll be giving tonight on Random Walks, PageRank, and Social Net Analysis.
>
>
> Click to open:
> SocialPageRank <https://docs.google.com/present/edit?id=0AbXOdbbcuXxAZGNidmNnNjJfMGRkNmg4OGR2&hl=en&invite=CO_7wcEI>
>
>
>
>
> Google Docs makes it easy to create, store and share online documents, spreadsheets and presentations.
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>
>
>
>