[ml] Hadoop going forward

Vikram Oberoi voberoi at gmail.com
Fri May 21 00:37:23 UTC 2010

Hey Andreas,

That's a great idea -- I'll work on that for next week.

As for having a common AWS account for all things ML at Noisebridge, I think
it would be better for everyone to have individual accounts for a couple of

- You can only run one job at a time on EMR. If you have a few people
testing out hypotheses/running a few different algorithms at the same time,
it'll become a point of contention and kill productivity.

- AWS allows a user to provision 20 machines at most. If we have, say, 10
machines per cluster, that's only 2 clusters we can do things on at any
given time.

- There's the payment issue. I'm not concerned that people/NB won't be
willing to contribute to our we-need-machines fund, but I am concerned about
who foots the bill when things go awry. What if we provision a high-end
cluster that we're all working with one day and all of us forget to kill it
for a week? Or, what if a bug in one of our scripts causes us to use a ton
of incoming/outgoing S3 bandwidth? There are a bunch of things that can go
wrong and cause us to accrue some major AWS costs, and that's when things
get ugly.

Finally, it's actually rather easy to set up your own environment where you
can easily spin up clusters, launch jobs, and fetch results. All it takes is
20 minutes of (annoying) upfront work and you're good to go. With some
better documentation, I can probably have you guys up and running in 5 - 10
minutes, and I'll work on doing that.



On Thu, May 20, 2010 at 11:56 AM, Andreas von Hessling <
vonhessling at gmail.com> wrote:

> Vikram,
> From my perspective you could contribute the most in setting up a
> Hadoop + Mahout infrastructure and documenting the setup process and
> the hello-world mapreduce program etc.  While we went through this
> yesterday (thanks) I feel like people will actually get to DO the
> things they learned later; so a written reference (new wiki page)
> would be great, because these questions will be asked over and over.
> Even better, and this is just an idea:  can we set up a shared AWS
> account so each of us doesnt have to install everything by himself?  I
> know there's the question of who pays for it, but that aside, are
> there technical restrictions why we could not share an account?  One
> approach would be each of us throws in $10, or perhaps theres a way to
> split the bills between us according to usage, or, even better we
> could push Noisebridge Inc to give us some allowance.  Getting a
> turnkey cloud Mahout infrastructure for Noisebridge would be H-U-G-E,
> even if it would not be ready in time for KDD submission.  Feel free
> to take the lead on that initiative.  You would go down in the history
> books of NB as a hero :-)
> Erin and Mike are already working on transforming the data, so I think
> we have already lots of manpower on that end.
> Let's tentatively plan this Sunday night to get together again.  Erin
> also mentioned she'd like to meet again before the next Wednesday.  I
> can give an impromptu talk about classifiers/machine learning problem
> setups.
> Will confirm.
> Andy
> On Wed, May 19, 2010 at 11:38 PM, Vikram Oberoi <voberoi at gmail.com> wrote:
> > Hey folks,
> > For those of you that came out tonight, I hope the code I walked through
> and
> > initial (albeit rough) overview of MapReduce helped. If you guys have any
> > questions or requests, the best way to ask would be to:
> > a) direct an email to me over ml at lists.noisebridge.net or...
> > b) open an issue at the Github
> > project: http://github.com/voberoi/hadoop-mrutils
> > Both of these ways someone else might be able to answer first and
> everyone
> > will benefit from the answer, as there's a high probability that everyone
> > will have the same questions.
> > For next week, I'm going to write a script that transforms the KDD
> dataset
> > in... some useful way. Your guys' input on what exactly I should do here
> is
> > most welcome. The transformation should be involved enough that the code
> can
> > serve as an example for scripts you all might implement later.
> > I'll also be taking a look at Apache Mahout (a library containing Hadoop
> > MapReduce implementations of numerous machine learning algorithms) and
> > writing up an example of how to use it. If you have a particular
> algorithm
> > that you want to apply to the dataset, check if it's in the Mahout
> library
> > and let me know.
> > Finally, is any brainstorming/discussion about what we're doing happening
> > anywhere other than the meetups? I'd be happy to meet again some time
> before
> > next Wednesday to hash out some ideas and run with them, as in-person
> > conversation bandwidth is *so* much higher. Alternately, we could throw
> out
> > ideas on the list and brainstorm over email threads. It doesn't seem like
> > there's a whole lot of action on the wiki other than links to resources
> and
> > TODOs. Or is there?
> > Vikram
> > _______________________________________________
> > ml mailing list
> > ml at lists.noisebridge.net
> > https://www.noisebridge.net/mailman/listinfo/ml
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100520/1646066f/attachment.html>

More information about the ml mailing list