[ml] Volunteer needed: computing Chance feature values using EC2
Andreas von Hessling
vonhessling at gmail.com
Sat Jun 5 21:39:26 UTC 2010
I've put a link to the dump download on the KDD wiki page -- the
file's md5 hash is 1e42ff64831d60cced16f5330b84f297. The upload is
currently running, started at 2.36pm, may take half an hour or so. It
contains sqlite dumps (start with sqlite <dbfilename>, then on the
command line type .read <dumpfilename>, but may/should also load into
other SQL engines. Contains Algebra ("a") and Bridge ("b") train/test
Please make sure in your output the rows kept in order.
IQ strength values are also already in these files; they represent
the number of steps a student has attempted.
sqlite> .schema atest1
CREATE TABLE 'atest1' (
sqlite> .schema achance
CREATE TABLE "achance"(
I'll be unavailable till 4pm, then back. Thanks!
On Sat, Jun 5, 2010 at 2:10 PM, Thomas Lotze <thomas.lotze at gmail.com> wrote:
> With proper indexing, I think we can do this in approximately 508,912 +
> 1,259,273 lookups (rather than *). Which is to say, I think I can figure
> out how to put this together; do you have SQL dumps available or an SQL
> server I can access?
> P.S. This should not preclude anyone else from working on it, *especially*
> if they want to put together a Hadoop/EC2 solution.
> On Sat, Jun 5, 2010 at 2:04 PM, Andreas von Hessling <vonhessling at gmail.com>
>> I have a task to hand out. This would be a great opportunity to apply
>> what they've learned in Vikram's Hadoop/EC2 sessions. This is
>> time-critical and I don't think I can do it before Sunday/Monday, by
>> which it is needed.
>> Chance Feature: the percentage how many unique problemsteps are solved
>> Chance Strength Feature: the number of times a particular unique
>> problemstepname occurs.
>> Together they are supposed to represent how easy/hard it is to get the
>> step right;
>> I have computed all values for Chance and Chance Strength for algebra
>> and bridge. The problem/task is now to assign both value pairs back
>> to each step (row) in our (test & train) datasets. The issue here is
>> the speed at which this happens when I try to use SQL on my machines.
>> Here's the order of magnitude of the data we're dealing with.
>> The number of steps/rows:
>> sqlite> select count(*) from atest;
>> sqlite> select count(*) from atrain;
>> sqlite> select count(*) from btest;
>> sqlite> select count(*) from btrain;
>> The number of chance/strength values:
>> sqlite> select count(*) from achance;
>> sqlite> select count(*) from bchance;
>> So for the simplest case, putting chance values into algebra test
>> would require up to 508,912 * 1,259,273 lookups. I've tried splitting
>> the problem into subproblems (smaller tables), but it still takes
>> about 24 hours. So SQL is not appropriate;
>> It seems that this can be done with EC2 -- this seems like an
>> analogous problem to our wordcount (hello-world) Hadoop example. I
>> can provide the data via FTP.
>> step1, some,data,blah
>> step99, more,data,blubb
>> step99, evenmore,data,blubb
>> chance values:
>> step1,0.92, 260
>> step2,0.22, 21
>> step99,0.25, 44
>> step1, some,data,blah,0.92, 260
>> step99, more,data,blubb,0.25, 44
>> step99, evenmore,data,blubb,0.25, 44
>> Who wants to give it a try?
> ml mailing list
> ml at lists.noisebridge.net
More information about the ml