[ml] Volunteer needed: computing Chance feature values using EC2

Andreas von Hessling vonhessling at gmail.com
Sat Jun 5 21:04:06 UTC 2010


I have a task to hand out.  This would be a great opportunity to apply
what they've learned in Vikram's Hadoop/EC2 sessions.   This is
time-critical and I don't think I can do it before Sunday/Monday, by
which it is needed.

Definitions:

Chance Feature: the percentage how many unique problemsteps are solved
correctly.
Chance Strength Feature: the number of times a particular unique
problemstepname occurs.

Together they are supposed to represent how easy/hard it is to get the
step right;

Problem:
I have computed all values for Chance and Chance Strength for algebra
and bridge.  The problem/task is now to assign both value pairs back
to each step (row) in our (test & train) datasets. The issue here is
the speed at which this happens when I try to use SQL on my machines.
Here's the order of magnitude of the data we're dealing with.

The number of steps/rows:

Algebra:
sqlite> select count(*) from atest;
508,912
sqlite> select count(*) from atrain;
8,918,054

Bridge:
sqlite> select count(*) from btest;
756,386
sqlite> select count(*) from btrain;
20,012,498


The number of chance/strength values:

Algebra:
sqlite> select count(*) from achance;
count(*)
1,259,273

Bridge:
sqlite> select count(*) from bchance;
count(*)
566,965


So for the simplest case, putting chance values into algebra test
would require up to 508,912 * 1,259,273 lookups.  I've tried splitting
the problem into subproblems (smaller tables), but it still takes
about 24 hours.  So SQL is not appropriate;

It seems that this can be done with EC2 -- this seems like an
analogous problem to our wordcount (hello-world) Hadoop example.  I
can provide the data via FTP.

Example:
Input:

steps:
step1, some,data,blah
...
step99, more,data,blubb
step99, evenmore,data,blubb
...
chance values:
step1,0.92, 260
step2,0.22, 21
...
step99,0.25, 44
...

Output:
step1, some,data,blah,0.92, 260
...
step99, more,data,blubb,0.25, 44
step99, evenmore,data,blubb,0.25, 44


Who wants to give it a try?

Andy



More information about the ml mailing list