[ml] new kaggle competition

Sat Nov 13 07:32:42 UTC 2010

Hi All-

I created a wiki page to assist in collaborating on this competition in case
that works better for people than email.

https://www.noisebridge.net/wiki/Machine_Learning/Kaggle_Social_Network_Contest

That said, it's rather public spot for our intel but at this point I don't
mind until we have something worthy of hiding. ;)

Jared-

On Fri, Nov 12, 2010 at 12:14 AM, Jared Dunne <jareddunne at gmail.com> wrote:

> Hi everyone-
>
> I've posted an adjacency list based from the training data:
> http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv
> First column: outbound vertex
> Remaining columns: list of vertices to which it points
>
> I also created a reversed adjacency list (for tracing backwards along the
> edges):
>
> http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv
> First column: inbound vertex
> Remaining columns: list of vertices which point to it
>
> Jared-
>
>
> On Thu, Nov 11, 2010 at 11:51 AM, Jared Dunne <jareddunne at gmail.com>wrote:
>
>> There seemed to be a lot of interest in the competition last night.  We
>> sorta splintered off into group discussions about the competition, but never
>> really reconvened before Erin's talk started.  Maybe we should report back
>> over the mailing list on what thoughts everyone had about the competition?
>>
>> Theo and I discussed two main areas...
>>
>> Process:
>> - We shouldn't have a single approach to solving the problem.  If people
>> have ideas they should run with them and report back their success/failure
>> to the group.  The collaboration between our diverse
>> ideas/approaches/experiences will be our strength in working together.
>> - Since this is throw away code for this competition only, we need not get
>> hung up on efficiency or elegant implementations.  That said, if we hit a
>> point where our code is not able to perform fast enough then we can address
>> it at that point, instead of overengineering from the get-go.
>> - Theo suggested that we start by using things like python/ruby scripts to
>> massage the starting data set into something more useful (with more
>> features), then analyse and visualize that using things like R.
>> - I'm wondering if people think it's legit to use the mailing list for
>> discussion or if we should create a discussion list for the competition to
>> prevent from spamming the main list with competition collboration?
>> - Also, as we transform the dataset into different views, we are going to
>> end up with some large files that we will be passing around to each other.
>> Any suggestions on how to best do that? ML git repo?
>>
>> Strategy (this is since just brainstorming level ideas):
>> - The dataset forms a graph of directed edges between vertices.  At the
>> core of this problem will performing analysis on that graph.  The first
>> intuitive approach we had come to mind was that the shorter the distance
>> between two vertices using existing edges, the more likely it would be that
>> an edge could/should exist between those vertices.
>> - After the talk, Erin, Theo, and I stumbled on the idea that some
>> vertices might be uber-followers (meaning more outbound edges than the
>> average vertex) and that some vertices might be uber-followees (meaning more
>> inbound edges than average).  This reminded me of PageRank for link graphs,
>> so perhaps we can draw from techniques in that vein.  The application of
>> this in our problem, might be in weighting since people who follow lots of
>> people might be more likely to follow someone further out in their "network"
>> where, someone who doesn't follow many people might less likely to follow
>> someone outside their "network".
>> - Since the edges are directional, we know that it's possible for people
>> to "follow" someone with out that person "following back".  At first glance
>> it might make sense that the reverse edges would be likely in cases like
>> this.  However consider a "hub" user with lots of followers who doesn't
>> reciprocate with edges back to his followers, then the information of who
>> follows him is less important in determining who he would follow.
>> Conversely, for a user who commonly reciprocates with followbacks, then the
>> information on who follows her might be useful in suggesting who she follow.
>>
>> Update:
>> - Last night I started thinking about this as a graph theory problem and
>> started researching techniques.  This section seemed useful for getting
>> started:
>> http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
>> - The data provided by kaggle is basically a "indicence list".  Theo and I
>> discussed converting the provided data in a form that maps outbound vertices
>> to their list of inbound/target vertices, which it turns out is called a
>> "adjacency list".
>> - I wrote some ruby code last night to generate an adjacency list from the
>> original training data.  I dumped it to CSV format where the first column in
>> a row is the outbound vertex, and all following columns for a given row are
>> the list of inbound vertexs pointed to by the oubtbound vertex's edges.  I
>> can upload that somewhere once we figure out the best spot to hand off
>> things like this...
>>
>> So what wonderful ideas were happening on the other side of the room prior
>> to Erin's talk?
>>
>> Jared-
>>
>>
>> On Wed, Nov 10, 2010 at 2:32 PM, Joe Hale <joe at jjhale.com> wrote:
>>
>>> Hey,
>>>
>>> I'll be going along to Noisebridge at 7.30 and will start having a look
>>> at the social network data in the 45 min before Erin's talk.
>>>
>>> Laters,
>>>
>>> Joe
>>>
>>>
>>> On 10 November 2010 13:19, Mike Schachter <mike at mindmech.com> wrote:
>>>
>>>> Awesome everyone! Just so you know, I won't be in tonight or
>>>> next week, please keep me informed via email list and wiki about
>>>> what's going on if you can,
>>>>
>>>>   mike
>>>>
>>>>
>>>>
>>>> On Wed, Nov 10, 2010 at 11:44 AM, Shahin Saneinejad <ssaneine at gmail.com
>>>> > wrote:
>>>>
>>>>> Hey, I'd really like to help but there's no way I can make it to the
>>>>> meeting tonight. My schedule's otherwise flexible in case everyone's open to
>>>>> meeting at a different time this week for the competition. If not, maybe I
>>>>> can catch up via project wiki notes or something.
>>>>>
>>>>> Shahin
>>>>>
>>>>>
>>>>> On Wed, Nov 10, 2010 at 11:11 AM, mnsqerr <mnsqerr at webmail.co.za>wrote:
>>>>>
>>>>>> Mike,
>>>>>> This sounds really fun.  Lets do it!
>>>>>>
>>>>>>
>>>>>> The link you posted is not working for me, here is a working link:
>>>>>> http://kaggle.com/component/taskmaster/?view=competition&task_id=2464
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Erin
>>>>>>
>>>>>> ------------------------------
>>>>>> South Africa premier free email service - webmail.co.za<http://www.webmail.co.za/>
>>>>>> <http://b.wm.co.za/click.pwm?cid=20039230&loc=N-MT&seq=4cdaee66>
>>>>>> _______________________________________________
>>>>>> ml mailing list
>>>>>> ml at lists.noisebridge.net
>>>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ml mailing list
>>>>> ml at lists.noisebridge.net
>>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> ml mailing list
>>>> ml at lists.noisebridge.net
>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>
>>>>
>>>
>>> _______________________________________________
>>> ml mailing list
>>> ml at lists.noisebridge.net
>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20101112/e213e9fd/attachment.html>