Ah yeah, sharing all this data via USB stick is not ideal. :-/<div><br></div><div>It appears that S3 has access controls. Take a look this post over here: <a href="http://stackoverflow.com/questions/1529869/making-files-uploaded-to-s3-public">http://stackoverflow.com/questions/1529869/making-files-uploaded-to-s3-public</a>.</div>


<div><br></div><div>Vikram</div><div><div><br><div class="gmail_quote">On Sat, May 22, 2010 at 6:10 PM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">An idea:<br>

Could we upload our data to a central S3 location that each member of<br>

our ML group could access from EC3?  Or would that incur costs?  In<br>

other words, could we re-use the raw and pre-processed data among us<br>

without incurring cost?  It seems funny that each of us pre-processes<br>

the data individually and we share this via USB stick :-)<br>

<br>

Andy<br>

<br>

<br>

On Fri, May 21, 2010 at 12:24 AM, Andreas von Hessling<br>

<div><div></div><div class="h5"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>

> Yes, I see what you're saying.  Especially since we are having ML<br>

> members successfully run your setup instructions -- let's go with<br>

> individual installations for now.  I'm really looking forward to<br>

> running ML algorithms on Hadoop.<br>

><br>

> On Thu, May 20, 2010 at 5:37 PM, Vikram Oberoi <<a href="mailto:voberoi@gmail.com">voberoi@gmail.com</a>> wrote:<br>

>> Hey Andreas,<br>

>> That's a great idea -- I'll work on that for next week.<br>

>> As for having a common AWS account for all things ML at Noisebridge, I think<br>

>> it would be better for everyone to have individual accounts for a couple of<br>

>> reasons:<br>

>> - You can only run one job at a time on EMR. If you have a few people<br>

>> testing out hypotheses/running a few different algorithms at the same time,<br>

>> it'll become a point of contention and kill productivity.<br>

>> - AWS allows a user to provision 20 machines at most. If we have, say, 10<br>

>> machines per cluster, that's only 2 clusters we can do things on at any<br>

>> given time.<br>

>> - There's the payment issue. I'm not concerned that people/NB won't be<br>

>> willing to contribute to our we-need-machines fund, but I am concerned about<br>

>> who foots the bill when things go awry. What if we provision a high-end<br>

>> cluster that we're all working with one day and all of us forget to kill it<br>

>> for a week? Or, what if a bug in one of our scripts causes us to use a ton<br>

>> of incoming/outgoing S3 bandwidth? There are a bunch of things that can go<br>

>> wrong and cause us to accrue some major AWS costs, and that's when things<br>

>> get ugly.<br>

>> Finally, it's actually rather easy to set up your own environment where you<br>

>> can easily spin up clusters, launch jobs, and fetch results. All it takes is<br>

>> 20 minutes of (annoying) upfront work and you're good to go. With some<br>

>> better documentation, I can probably have you guys up and running in 5 - 10<br>

>> minutes, and I'll work on doing that.<br>

>> Thoughts?<br>

>> Vikram<br>

>> On Thu, May 20, 2010 at 11:56 AM, Andreas von Hessling<br>

>> <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>

>>><br>

>>> Vikram,<br>

>>><br>

>>> From my perspective you could contribute the most in setting up a<br>

>>> Hadoop + Mahout infrastructure and documenting the setup process and<br>

>>> the hello-world mapreduce program etc.  While we went through this<br>

>>> yesterday (thanks) I feel like people will actually get to DO the<br>

>>> things they learned later; so a written reference (new wiki page)<br>

>>> would be great, because these questions will be asked over and over.<br>

>>> Even better, and this is just an idea:  can we set up a shared AWS<br>

>>> account so each of us doesnt have to install everything by himself?  I<br>

>>> know there's the question of who pays for it, but that aside, are<br>

>>> there technical restrictions why we could not share an account?  One<br>

>>> approach would be each of us throws in $10, or perhaps theres a way to<br>

>>> split the bills between us according to usage, or, even better we<br>

>>> could push Noisebridge Inc to give us some allowance.  Getting a<br>

>>> turnkey cloud Mahout infrastructure for Noisebridge would be H-U-G-E,<br>

>>> even if it would not be ready in time for KDD submission.  Feel free<br>

>>> to take the lead on that initiative.  You would go down in the history<br>

>>> books of NB as a hero :-)<br>

>>><br>

>>> Erin and Mike are already working on transforming the data, so I think<br>

>>> we have already lots of manpower on that end.<br>

>>><br>

>>> Let's tentatively plan this Sunday night to get together again.  Erin<br>

>>> also mentioned she'd like to meet again before the next Wednesday.  I<br>

>>> can give an impromptu talk about classifiers/machine learning problem<br>

>>> setups.<br>

>>> Will confirm.<br>

>>><br>

>>> Andy<br>

>>><br>

>>><br>

>>><br>

>>> On Wed, May 19, 2010 at 11:38 PM, Vikram Oberoi <<a href="mailto:voberoi@gmail.com">voberoi@gmail.com</a>> wrote:<br>

>>> > Hey folks,<br>

>>> > For those of you that came out tonight, I hope the code I walked through<br>

>>> > and<br>

>>> > initial (albeit rough) overview of MapReduce helped. If you guys have<br>

>>> > any<br>

>>> > questions or requests, the best way to ask would be to:<br>

>>> > a) direct an email to me over <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a> or...<br>

>>> > b) open an issue at the Github<br>

>>> > project: <a href="http://github.com/voberoi/hadoop-mrutils" target="_blank">http://github.com/voberoi/hadoop-mrutils</a><br>

>>> > Both of these ways someone else might be able to answer first and<br>

>>> > everyone<br>

>>> > will benefit from the answer, as there's a high probability that<br>

>>> > everyone<br>

>>> > will have the same questions.<br>

>>> > For next week, I'm going to write a script that transforms the KDD<br>

>>> > dataset<br>

>>> > in... some useful way. Your guys' input on what exactly I should do here<br>

>>> > is<br>

>>> > most welcome. The transformation should be involved enough that the code<br>

>>> > can<br>

>>> > serve as an example for scripts you all might implement later.<br>

>>> > I'll also be taking a look at Apache Mahout (a library containing Hadoop<br>

>>> > MapReduce implementations of numerous machine learning algorithms) and<br>

>>> > writing up an example of how to use it. If you have a particular<br>

>>> > algorithm<br>

>>> > that you want to apply to the dataset, check if it's in the Mahout<br>

>>> > library<br>

>>> > and let me know.<br>

>>> > Finally, is any brainstorming/discussion about what we're doing<br>

>>> > happening<br>

>>> > anywhere other than the meetups? I'd be happy to meet again some time<br>

>>> > before<br>

>>> > next Wednesday to hash out some ideas and run with them, as in-person<br>

>>> > conversation bandwidth is *so* much higher. Alternately, we could throw<br>

>>> > out<br>

>>> > ideas on the list and brainstorm over email threads. It doesn't seem<br>

>>> > like<br>

>>> > there's a whole lot of action on the wiki other than links to resources<br>

>>> > and<br>

>>> > TODOs. Or is there?<br>

>>> > Vikram<br>

>>> > _______________________________________________<br>

>>> > ml mailing list<br>

>>> > <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>>> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>>> ><br>

>>> ><br>

>><br>

>><br>

><br>

</div></div></blockquote></div><br></div></div>