[ml] Volunteer needed: computing Chance feature values using EC2
Andreas von Hessling
vonhessling at gmail.com
Sat Jun 5 21:04:06 UTC 2010
I have a task to hand out. This would be a great opportunity to apply
what they've learned in Vikram's Hadoop/EC2 sessions. This is
time-critical and I don't think I can do it before Sunday/Monday, by
which it is needed.
Chance Feature: the percentage how many unique problemsteps are solved
Chance Strength Feature: the number of times a particular unique
Together they are supposed to represent how easy/hard it is to get the
I have computed all values for Chance and Chance Strength for algebra
and bridge. The problem/task is now to assign both value pairs back
to each step (row) in our (test & train) datasets. The issue here is
the speed at which this happens when I try to use SQL on my machines.
Here's the order of magnitude of the data we're dealing with.
The number of steps/rows:
sqlite> select count(*) from atest;
sqlite> select count(*) from atrain;
sqlite> select count(*) from btest;
sqlite> select count(*) from btrain;
The number of chance/strength values:
sqlite> select count(*) from achance;
sqlite> select count(*) from bchance;
So for the simplest case, putting chance values into algebra test
would require up to 508,912 * 1,259,273 lookups. I've tried splitting
the problem into subproblems (smaller tables), but it still takes
about 24 hours. So SQL is not appropriate;
It seems that this can be done with EC2 -- this seems like an
analogous problem to our wordcount (hello-world) Hadoop example. I
can provide the data via FTP.
step1, some,data,blah,0.92, 260
step99, more,data,blubb,0.25, 44
step99, evenmore,data,blubb,0.25, 44
Who wants to give it a try?
More information about the ml