[ml] Volunteer needed: computing Chance feature values using EC2

Sat Jun 5 21:10:56 UTC 2010

Andreas,

With proper indexing, I think we can do this in approximately 508,912 +
1,259,273 lookups (rather than *).  Which is to say, I think I can figure
out how to put this together; do you have SQL dumps available or an SQL
server I can access?

Cheers,
Thomas

P.S. This should not preclude anyone else from working on it, *especially*
if they want to put together a Hadoop/EC2 solution.

On Sat, Jun 5, 2010 at 2:04 PM, Andreas von Hessling
<vonhessling at gmail.com>wrote:

> I have a task to hand out.  This would be a great opportunity to apply
> what they've learned in Vikram's Hadoop/EC2 sessions.   This is
> time-critical and I don't think I can do it before Sunday/Monday, by
> which it is needed.
>
> Definitions:
>
> Chance Feature: the percentage how many unique problemsteps are solved
> correctly.
> Chance Strength Feature: the number of times a particular unique
> problemstepname occurs.
>
> Together they are supposed to represent how easy/hard it is to get the
> step right;
>
> Problem:
> I have computed all values for Chance and Chance Strength for algebra
> and bridge.  The problem/task is now to assign both value pairs back
> to each step (row) in our (test & train) datasets. The issue here is
> the speed at which this happens when I try to use SQL on my machines.
> Here's the order of magnitude of the data we're dealing with.
>
> The number of steps/rows:
>
> Algebra:
> sqlite> select count(*) from atest;
> 508,912
> sqlite> select count(*) from atrain;
> 8,918,054
>
> Bridge:
> sqlite> select count(*) from btest;
> 756,386
> sqlite> select count(*) from btrain;
> 20,012,498
>
>
> The number of chance/strength values:
>
> Algebra:
> sqlite> select count(*) from achance;
> count(*)
> 1,259,273
>
> Bridge:
> sqlite> select count(*) from bchance;
> count(*)
> 566,965
>
>
> So for the simplest case, putting chance values into algebra test
> would require up to 508,912 * 1,259,273 lookups.  I've tried splitting
> the problem into subproblems (smaller tables), but it still takes
> about 24 hours.  So SQL is not appropriate;
>
> It seems that this can be done with EC2 -- this seems like an
> analogous problem to our wordcount (hello-world) Hadoop example.  I
> can provide the data via FTP.
>
> Example:
> Input:
>
> steps:
> step1, some,data,blah
> ...
> step99, more,data,blubb
> step99, evenmore,data,blubb
> ...
> chance values:
> step1,0.92, 260
> step2,0.22, 21
> ...
> step99,0.25, 44
> ...
>
> Output:
> step1, some,data,blah,0.92, 260
> ...
> step99, more,data,blubb,0.25, 44
> step99, evenmore,data,blubb,0.25, 44
>
>
> Who wants to give it a try?
>
> Andy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100605/e744a3ed/attachment.html>