I've also thrown some code up on <a href="http://github.com/voberoi/hadoop-mrutils">http://github.com/voberoi/hadoop-mrutils</a> for the workshop tonight. There are a couple of example Python streaming/Pig scripts and Pig UDFs in addition to instructions on how to get up and running with Amazon's Elastic MapReduce.<div>


<br></div><div>If you have a moment to poke around the code, that'd be great!<div><div><div><br></div><div><div>Cheers,</div><div>Vikram<br><div><br><div class="gmail_quote">On Wed, May 19, 2010 at 4:25 PM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi all,<br>

<br>

For the discussion tonight it will be helpful if everybody could read<br>

through the KDD data format;  It's fairly technical and is not<br>

trivial, so instead of spending time to re-hash it during the meeting<br>

it would be great if we could all be on the same page.<br>

<br>

<a href="https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp" target="_blank">https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp</a><br>

<br>

Deadline for the challenge is June 8th, so we need to move fast if we<br>

are to submit an entry.<br>

<br>

Looking forward to tonight.<br>

<div><div></div><div class="h5"><br>

<br>

On Tue, May 18, 2010 at 8:52 AM, Andreas von Hessling<br>

<<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>

> Mike,<br>

> we haven't actually gotten far in running algorithms so far.  To this<br>

> point you're the only one working on dimensionality reduction.  I say<br>

> go for it; knock yourself out.  It will be good just to get a sense<br>

> where we should focus our energy.<br>

><br>

> BTW I'll put up a description of how to set up Weka with this dataset<br>

> soon.  There's some NN algorithms right in there...<br>

><br>

> Andy<br>

><br>

><br>

><br>

><br>

> On Mon, May 17, 2010 at 9:31 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>> wrote:<br>

>> Hey everyone!<br>

>><br>

>> Just got back the other day and looking forward to meeting up Wednesday<br>

>> and hearing about Hadoop. I just read a bit through the KDD challenge, and<br>

>> was wondering if I could help out by doing something involving neural nets?<br>

>><br>

>> Neural nets can be made good at generalization and prediction, and also<br>

>> reducing problem dimensionality by clustering. For example, we could<br>

>> cluster the input records into groups, and pass that group data into an SVM<br>

>> or something. Or we could use some sort of dimensionality reducing network<br>

>> and pass the dimensionally-reduced dataset to a bayesian learner (which<br>

>> wouldn't work well if the data was high dimensional).<br>

>><br>

>> If someone was already thinking of doing this I'd be happy to help out,<br>

>> can't<br>

>> glean much of what happened from the meeting notes.<br>

>><br>

>> See you Wednesday!<br>

>><br>

>>   mike<br>

>><br>

>><br>

>><br>

>> On Wed, May 12, 2010 at 10:05 PM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com">thomas.lotze@gmail.com</a>><br>

>> wrote:<br>

>>><br>

>>> Hello, all!  There was a good meeting today where we talked about the KDD<br>

>>> dataset and plans for the next steps.  I think it'll be a really good<br>

>>> opportunity for learning new tools and methods on machine learning, trading<br>

>>> knowledge and upping our collective ability!  We've got plans to look at R,<br>

>>> libsvm, weka, and Hadoop to tackle the problem.  I'm excited about working<br>

>>> with it, and anyone else who wants to get involved should email me, download<br>

>>> the data, and take a look at the wiki page I've put our initial plans in:<br>

>>><br>

>>> <a href="https://www.noisebridge.net/wiki/KDD_Competition_2010" target="_blank">https://www.noisebridge.net/wiki/KDD_Competition_2010</a><br>

>>><br>

>>><br>

>>> Next week, Vikarem will be presenting Hadoop, with some scripts and tools<br>

>>> to actually use it -- I think we're all aware of how important Hadoop<br>

>>> already is and will continue to be in the future for analyzing large data<br>

>>> sets, so I'm really glad that we've now got someone who knows about it and<br>

>>> is willing to tell us more!  I think this is a really great opportunity, and<br>

>>> many thanks to Vikarem for presenting!<br>

>>><br>

>>><br>

>>> Best wishes,<br>

>>> Thomas<br>

>>><br>

>>> _______________________________________________<br>

>>> ml mailing list<br>

>>> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>>><br>

>><br>

>><br>

>> _______________________________________________<br>

>> ml mailing list<br>

>> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>><br>

>><br>

><br>

_______________________________________________<br>

ml mailing list<br>

<a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

<a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

</div></div></blockquote></div><br></div></div></div></div></div></div>