[ml] Hadoop going forward

Andreas von Hessling vonhessling at gmail.com
Sun May 23 01:10:45 UTC 2010

An idea:
Could we upload our data to a central S3 location that each member of
our ML group could access from EC3?  Or would that incur costs?  In
other words, could we re-use the raw and pre-processed data among us
without incurring cost?  It seems funny that each of us pre-processes
the data individually and we share this via USB stick :-)


On Fri, May 21, 2010 at 12:24 AM, Andreas von Hessling
<vonhessling at gmail.com> wrote:
> Yes, I see what you're saying.  Especially since we are having ML
> members successfully run your setup instructions -- let's go with
> individual installations for now.  I'm really looking forward to
> running ML algorithms on Hadoop.
> On Thu, May 20, 2010 at 5:37 PM, Vikram Oberoi <voberoi at gmail.com> wrote:
>> Hey Andreas,
>> That's a great idea -- I'll work on that for next week.
>> As for having a common AWS account for all things ML at Noisebridge, I think
>> it would be better for everyone to have individual accounts for a couple of
>> reasons:
>> - You can only run one job at a time on EMR. If you have a few people
>> testing out hypotheses/running a few different algorithms at the same time,
>> it'll become a point of contention and kill productivity.
>> - AWS allows a user to provision 20 machines at most. If we have, say, 10
>> machines per cluster, that's only 2 clusters we can do things on at any
>> given time.
>> - There's the payment issue. I'm not concerned that people/NB won't be
>> willing to contribute to our we-need-machines fund, but I am concerned about
>> who foots the bill when things go awry. What if we provision a high-end
>> cluster that we're all working with one day and all of us forget to kill it
>> for a week? Or, what if a bug in one of our scripts causes us to use a ton
>> of incoming/outgoing S3 bandwidth? There are a bunch of things that can go
>> wrong and cause us to accrue some major AWS costs, and that's when things
>> get ugly.
>> Finally, it's actually rather easy to set up your own environment where you
>> can easily spin up clusters, launch jobs, and fetch results. All it takes is
>> 20 minutes of (annoying) upfront work and you're good to go. With some
>> better documentation, I can probably have you guys up and running in 5 - 10
>> minutes, and I'll work on doing that.
>> Thoughts?
>> Vikram
>> On Thu, May 20, 2010 at 11:56 AM, Andreas von Hessling
>> <vonhessling at gmail.com> wrote:
>>> Vikram,
>>> From my perspective you could contribute the most in setting up a
>>> Hadoop + Mahout infrastructure and documenting the setup process and
>>> the hello-world mapreduce program etc.  While we went through this
>>> yesterday (thanks) I feel like people will actually get to DO the
>>> things they learned later; so a written reference (new wiki page)
>>> would be great, because these questions will be asked over and over.
>>> Even better, and this is just an idea:  can we set up a shared AWS
>>> account so each of us doesnt have to install everything by himself?  I
>>> know there's the question of who pays for it, but that aside, are
>>> there technical restrictions why we could not share an account?  One
>>> approach would be each of us throws in $10, or perhaps theres a way to
>>> split the bills between us according to usage, or, even better we
>>> could push Noisebridge Inc to give us some allowance.  Getting a
>>> turnkey cloud Mahout infrastructure for Noisebridge would be H-U-G-E,
>>> even if it would not be ready in time for KDD submission.  Feel free
>>> to take the lead on that initiative.  You would go down in the history
>>> books of NB as a hero :-)
>>> Erin and Mike are already working on transforming the data, so I think
>>> we have already lots of manpower on that end.
>>> Let's tentatively plan this Sunday night to get together again.  Erin
>>> also mentioned she'd like to meet again before the next Wednesday.  I
>>> can give an impromptu talk about classifiers/machine learning problem
>>> setups.
>>> Will confirm.
>>> Andy
>>> On Wed, May 19, 2010 at 11:38 PM, Vikram Oberoi <voberoi at gmail.com> wrote:
>>> > Hey folks,
>>> > For those of you that came out tonight, I hope the code I walked through
>>> > and
>>> > initial (albeit rough) overview of MapReduce helped. If you guys have
>>> > any
>>> > questions or requests, the best way to ask would be to:
>>> > a) direct an email to me over ml at lists.noisebridge.net or...
>>> > b) open an issue at the Github
>>> > project: http://github.com/voberoi/hadoop-mrutils
>>> > Both of these ways someone else might be able to answer first and
>>> > everyone
>>> > will benefit from the answer, as there's a high probability that
>>> > everyone
>>> > will have the same questions.
>>> > For next week, I'm going to write a script that transforms the KDD
>>> > dataset
>>> > in... some useful way. Your guys' input on what exactly I should do here
>>> > is
>>> > most welcome. The transformation should be involved enough that the code
>>> > can
>>> > serve as an example for scripts you all might implement later.
>>> > I'll also be taking a look at Apache Mahout (a library containing Hadoop
>>> > MapReduce implementations of numerous machine learning algorithms) and
>>> > writing up an example of how to use it. If you have a particular
>>> > algorithm
>>> > that you want to apply to the dataset, check if it's in the Mahout
>>> > library
>>> > and let me know.
>>> > Finally, is any brainstorming/discussion about what we're doing
>>> > happening
>>> > anywhere other than the meetups? I'd be happy to meet again some time
>>> > before
>>> > next Wednesday to hash out some ideas and run with them, as in-person
>>> > conversation bandwidth is *so* much higher. Alternately, we could throw
>>> > out
>>> > ideas on the list and brainstorm over email threads. It doesn't seem
>>> > like
>>> > there's a whole lot of action on the wiki other than links to resources
>>> > and
>>> > TODOs. Or is there?
>>> > Vikram
>>> > _______________________________________________
>>> > ml mailing list
>>> > ml at lists.noisebridge.net
>>> > https://www.noisebridge.net/mailman/listinfo/ml
>>> >
>>> >

More information about the ml mailing list