[ml] Hadoop going forward

Fri May 21 07:24:37 UTC 2010

Yes, I see what you're saying.  Especially since we are having ML
members successfully run your setup instructions -- let's go with
individual installations for now.  I'm really looking forward to
running ML algorithms on Hadoop.

On Thu, May 20, 2010 at 5:37 PM, Vikram Oberoi <voberoi at gmail.com> wrote:
> Hey Andreas,
> That's a great idea -- I'll work on that for next week.
> As for having a common AWS account for all things ML at Noisebridge, I think
> it would be better for everyone to have individual accounts for a couple of
> reasons:
> - You can only run one job at a time on EMR. If you have a few people
> testing out hypotheses/running a few different algorithms at the same time,
> it'll become a point of contention and kill productivity.
> - AWS allows a user to provision 20 machines at most. If we have, say, 10
> machines per cluster, that's only 2 clusters we can do things on at any
> given time.
> - There's the payment issue. I'm not concerned that people/NB won't be
> willing to contribute to our we-need-machines fund, but I am concerned about
> who foots the bill when things go awry. What if we provision a high-end
> cluster that we're all working with one day and all of us forget to kill it
> for a week? Or, what if a bug in one of our scripts causes us to use a ton
> of incoming/outgoing S3 bandwidth? There are a bunch of things that can go
> wrong and cause us to accrue some major AWS costs, and that's when things
> get ugly.
> Finally, it's actually rather easy to set up your own environment where you
> can easily spin up clusters, launch jobs, and fetch results. All it takes is
> 20 minutes of (annoying) upfront work and you're good to go. With some
> better documentation, I can probably have you guys up and running in 5 - 10
> minutes, and I'll work on doing that.
> Thoughts?
> Vikram
> On Thu, May 20, 2010 at 11:56 AM, Andreas von Hessling
> <vonhessling at gmail.com> wrote:
>>
>> Vikram,
>>
>> From my perspective you could contribute the most in setting up a
>> Hadoop + Mahout infrastructure and documenting the setup process and
>> the hello-world mapreduce program etc.  While we went through this
>> yesterday (thanks) I feel like people will actually get to DO the
>> things they learned later; so a written reference (new wiki page)
>> would be great, because these questions will be asked over and over.
>> Even better, and this is just an idea:  can we set up a shared AWS
>> account so each of us doesnt have to install everything by himself?  I
>> know there's the question of who pays for it, but that aside, are
>> there technical restrictions why we could not share an account?  One
>> approach would be each of us throws in $10, or perhaps theres a way to
>> split the bills between us according to usage, or, even better we
>> could push Noisebridge Inc to give us some allowance.  Getting a
>> turnkey cloud Mahout infrastructure for Noisebridge would be H-U-G-E,
>> even if it would not be ready in time for KDD submission.  Feel free
>> to take the lead on that initiative.  You would go down in the history
>> books of NB as a hero :-)
>>
>> Erin and Mike are already working on transforming the data, so I think
>> we have already lots of manpower on that end.
>>
>> Let's tentatively plan this Sunday night to get together again.  Erin
>> also mentioned she'd like to meet again before the next Wednesday.  I
>> can give an impromptu talk about classifiers/machine learning problem
>> setups.
>> Will confirm.
>>
>> Andy
>>
>>
>>
>> On Wed, May 19, 2010 at 11:38 PM, Vikram Oberoi <voberoi at gmail.com> wrote:
>> > Hey folks,
>> > For those of you that came out tonight, I hope the code I walked through
>> > and
>> > initial (albeit rough) overview of MapReduce helped. If you guys have
>> > any
>> > questions or requests, the best way to ask would be to:
>> > a) direct an email to me over ml at lists.noisebridge.net or...
>> > b) open an issue at the Github
>> > project: http://github.com/voberoi/hadoop-mrutils
>> > Both of these ways someone else might be able to answer first and
>> > everyone
>> > will benefit from the answer, as there's a high probability that
>> > everyone
>> > will have the same questions.
>> > For next week, I'm going to write a script that transforms the KDD
>> > dataset
>> > in... some useful way. Your guys' input on what exactly I should do here
>> > is
>> > most welcome. The transformation should be involved enough that the code
>> > can
>> > serve as an example for scripts you all might implement later.
>> > I'll also be taking a look at Apache Mahout (a library containing Hadoop
>> > MapReduce implementations of numerous machine learning algorithms) and
>> > writing up an example of how to use it. If you have a particular
>> > algorithm
>> > that you want to apply to the dataset, check if it's in the Mahout
>> > library
>> > and let me know.
>> > Finally, is any brainstorming/discussion about what we're doing
>> > happening
>> > anywhere other than the meetups? I'd be happy to meet again some time
>> > before
>> > next Wednesday to hash out some ideas and run with them, as in-person
>> > conversation bandwidth is *so* much higher. Alternately, we could throw
>> > out
>> > ideas on the list and brainstorm over email threads. It doesn't seem
>> > like
>> > there's a whole lot of action on the wiki other than links to resources
>> > and
>> > TODOs. Or is there?
>> > Vikram
>> > _______________________________________________
>> > ml mailing list
>> > ml at lists.noisebridge.net
>> > https://www.noisebridge.net/mailman/listinfo/ml
>> >
>> >
>
>