[ml] SocialPageRank

Sat Nov 27 00:36:58 UTC 2010

Shahin-

Quasi-PageRank meaning:
- Semantically, It's not the actual PageRank algorithmn a la Google, since
that's patented and so its doubtful that I've implemented exactly their
PageRank alg.  Also, PageRank is a strange name for this feature in our
domain, so I think SocialRank makes more sense.
- Semantics aside, I also removed the dampening factor, since I suspect
that's only needed when your random walk will not visit all nodes.  Since I
can update all Nodes' values with each iteration, I left it out since it
would make each iteration have a slower effect in updating values.  Or in
other words, I've set d=1.0.  For nodes with no inbound edges, this means
that their PR is 0.0 after one iteration, instead of approaching zero over
time.

It's not in R.  It's in Ruby and dumping the output to csv after each
iteration.  I'd be happy to reformat my output to something yummy for R to
gobble up if csv is not ideal.

I'm not taking the approach Mike presented (power method), but rather an
iterative method that should result in same calculations.  More or less
described in the "Iterative" section here:
http://en.wikipedia.org/wiki/PageRank#Computation (but like I said with
dampening removed).  "The basic mathematical operations performed in the
iterative method and the power method are identical."

Another caveat is that I'm equally initializing PR(t=0) for each node at
equal values (1/N where N=number of nodes) instead of something related to
number of inbound edges versus outbound as Mike's talk suggested.  This is
me keeping things simple to start, so I can verify correctness more easier.
I can always rerun with different initial values if we determine that's
better.

I've calculated 50 iterations deep so far, which isn't computationally
difficult with my code (under an hour without any dumping, under two hours
if dumping all iterations).  However, I think I have a correctness issue
that I need to fix and then re-run.  Namely, I was keeping my PR values in
one hash and updating it as I iterate but i think I need to keep separate
hashes, one for the current timestep and another for the previous timestep.
This would unfairly diminish the PR values of the nodes that are iterated on
last.  In theory since, my order changes with each iteration, this won't
have a major effect, but it's easy enough for me to correct, so I will.

More to follow later this week once I fix my bug and rerun... At that time
I'll post my code and sample data.  That said, in the meantime if you want
to try your hand at it in R using power method, go for it!  We can always
compare results to confirm we are finding the same nodes to be most
"important".  (So far node #146725 has the highest SocialRank value in our
network, while #887961 is the lowest (non-zero) PR)

One thing I could use help with is knowing when I can stop iterating.... Any
ideas on that?  I'll be able to compare PR(t=i) and PR(t=i-1) for a given
node in constant time, if that helps...

Jared-

On Fri, Nov 26, 2010 at 1:42 PM, Shahin Saneinejad <ssaneine at gmail.com>wrote:

> Hey Jared,
>
> Is this in R? Are you using power iteration? I'm considering both these
> approaches for memory footprint and code simplicity reasons, respectively
> (my laptop is a little long in the tooth).
>
> Also, what's "quasi" about your quasi-PageRank feature?
>
> Shahin
>
>
> On Thu, Nov 25, 2010 at 4:11 PM, Jared Dunne <jareddunne at gmail.com> wrote:
>
>> I've implemented an iterative approach to quasi-PageRank feature
>> generation for all nodes.  Each iteration across all nodes is under a minute
>> each time.  I still need to fix some things with my approach and possibly
>> get some help sanity-checking the correctness of my algorithm.  I'll post
>> more later, but I just wanted to let everyone know that I'm making good
>> progress on the PageRank feature generation, so that we aren't duplicating
>> our efforts.  Gonna pick this back up later this weekend.  Enjoy your
>> feasts.
>>
>> Jared-
>>
>> On Wed, Nov 24, 2010 at 6:23 PM, <mike at mindmech.com> wrote:
>>
>>>  I've shared SocialPageRank<https://docs.google.com/present/edit?id=0AbXOdbbcuXxAZGNidmNnNjJfMGRkNmg4OGR2&hl=en&invite=CO_7wcEI>
>>> Message from mike at mindmech.com:
>>>
>>> Here's the presentation I'll be giving tonight on Random Walks, PageRank, and Social Net Analysis.
>>>
>>>
>>> Click to open:
>>>
>>>    - SocialPageRank<https://docs.google.com/present/edit?id=0AbXOdbbcuXxAZGNidmNnNjJfMGRkNmg4OGR2&hl=en&invite=CO_7wcEI>
>>>
>>>
>>> Google Docs makes it easy to create, store and share online documents,
>>> spreadsheets and presentations.
>>> [image: Logo for Google Docs] <http://docs.google.com>
>>>
>>> _______________________________________________
>>> ml mailing list
>>> ml at lists.noisebridge.net
>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>
>>>
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.noisebridge.net/pipermail/ml/attachments/20101126/4a40307a/attachment-0003.html>