Andreas,<br><br>With proper indexing, I think we can do this in approximately  508,912 + 1,259,273 lookups (rather than *).  Which is to say, I think I can figure out how to put this together; do you have SQL dumps available or an SQL server I can access?<br>

<br>Cheers,<br>Thomas<br><br>P.S. This should not preclude anyone else from working on it, *especially* if they want to put together a Hadoop/EC2 solution.<br><br><div class="gmail_quote">On Sat, Jun 5, 2010 at 2:04 PM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">I have a task to hand out.  This would be a great opportunity to apply<br>

what they've learned in Vikram's Hadoop/EC2 sessions.   This is<br>

time-critical and I don't think I can do it before Sunday/Monday, by<br>

which it is needed.<br>

<br>

Definitions:<br>

<br>

Chance Feature: the percentage how many unique problemsteps are solved<br>

correctly.<br>

Chance Strength Feature: the number of times a particular unique<br>

problemstepname occurs.<br>

<br>

Together they are supposed to represent how easy/hard it is to get the<br>

step right;<br>

<br>

Problem:<br>

I have computed all values for Chance and Chance Strength for algebra<br>

and bridge.  The problem/task is now to assign both value pairs back<br>

to each step (row) in our (test & train) datasets. The issue here is<br>

the speed at which this happens when I try to use SQL on my machines.<br>

Here's the order of magnitude of the data we're dealing with.<br>

<br>

The number of steps/rows:<br>

<br>

Algebra:<br>

sqlite> select count(*) from atest;<br>

508,912<br>

sqlite> select count(*) from atrain;<br>

8,918,054<br>

<br>

Bridge:<br>

sqlite> select count(*) from btest;<br>

756,386<br>

sqlite> select count(*) from btrain;<br>

20,012,498<br>

<br>

<br>

The number of chance/strength values:<br>

<br>

Algebra:<br>

sqlite> select count(*) from achance;<br>

count(*)<br>

1,259,273<br>

<br>

Bridge:<br>

sqlite> select count(*) from bchance;<br>

count(*)<br>

566,965<br>

<br>

<br>

So for the simplest case, putting chance values into algebra test<br>

would require up to 508,912 * 1,259,273 lookups.  I've tried splitting<br>

the problem into subproblems (smaller tables), but it still takes<br>

about 24 hours.  So SQL is not appropriate;<br>

<br>

It seems that this can be done with EC2 -- this seems like an<br>

analogous problem to our wordcount (hello-world) Hadoop example.  I<br>

can provide the data via FTP.<br>

<br>

Example:<br>

Input:<br>

<br>

steps:<br>

step1, some,data,blah<br>

...<br>

step99, more,data,blubb<br>

step99, evenmore,data,blubb<br>

...<br>

chance values:<br>

step1,0.92, 260<br>

step2,0.22, 21<br>

...<br>

step99,0.25, 44<br>

...<br>

<br>

Output:<br>

step1, some,data,blah,0.92, 260<br>

...<br>

step99, more,data,blubb,0.25, 44<br>

step99, evenmore,data,blubb,0.25, 44<br>

<br>

<br>

Who wants to give it a try?<br>

<br>

Andy<br>

</blockquote></div><br>