Hey Andy, the input to the classifier I'm trying to produce is<br>the orthogonalized dataset - i.e. the list of 1000+ columns where<br>each column has the value of the opportunity for that skill. The<br>dataset was produced by Erin and is is broken into several parts,<br>

for the algebra dataset this looks like:<br><br>algebra-output_partaa<br>algebra-output_partab<br>..<br>algebra-output_partah<br><br><br>You're going to have to orthogonalize the test datasets, which<br>I don't have a copy of. Erin - are you around? Maybe she can help<br>

you convert the test datasets?<br><br>  mike<br><br><br><br><br><div class="gmail_quote">On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Sweet, Mike.  Please note that we need the row -> clusterid mapping<br>

for both training AND testing sets.  Otherwise it will not help the ML<br>

algorithms.<br>

If I understand correctly, your input are the orthogonalized skills.<br>

So far, the girls only provided these orthogonalizations for the<br>

training files.  I'm computing them for the test sets so you can use<br>

them.  If I don't understand this assumption correctly, please let me<br>

know so I can use my CPU's cycles for other tasks.<br>

<br>

Ideally you can provide these cluster mappings by about Sunday, which<br>

is when I want to start running classifiers.  I will need some time to<br>

actually run the ML algorithms.<br>

<br>

I have now IQ and IQ strength feature values for all datasets and am<br>

hoping time permits to compute chance and chance strength values for<br>

rows.<br>

Computing # of skills required should not be difficult and I will add<br>

this feature as well.  I plan on sharing my datasets as new versions<br>

become available.<br>

<br>

Andy<br>

<div><div></div><div class="h5"><br>

<br>

<br>

<br>

On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>> wrote:<br>

> So it's taking about 9 hours to create a graph from a 4.4GB file, I'm<br>

> going to work on improving the code to make it a bit faster, and also<br>

> am investigating a MapReduce solution.<br>

><br>

> Basically the clustering process can be broken down into two stages:<br>

><br>

> 1) Construct the graph, apply the clustering algorithm to break graph into<br>

> clusters<br>

> 2) Apply the clustered graph to the data again to classify each skill set<br>

><br>

> I'll keep working on it and let everyone know how things are going with it,<br>

> as I mentioned in another email, the source code is in our new sourceforge<br>

> project's git repository.<br>

><br>

>  mike<br>

><br>

><br>

><br>

><br>

> On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>> wrote:<br>

>><br>

>> Sounds like you're making great progress! I'll be working on the<br>

>> graph clustering algorithm for the skill set tonight and will keep<br>

>> you posted on how things are going.<br>

>><br>

>>   mike<br>

>><br>

>><br>

>><br>

>><br>

>> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling<br>

>> <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>

>>><br>

>>> Doing a few basic tricks, I catapulted the submission into the 50th<br>

>>> percentile.  That is not even running any ML algorithm.<br>

>>><br>

>>> I'm planning on running the NaiveBayesUpdateable classifier<br>

>>> (<a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) over<br>

>>> discretized IQ/IQ strength/Chance/Chance strength from the command<br>

>>> line to evaluate performance.  Another attempt would be to load all<br>

>>> data into memory (<3GB, even for full Bridge Train) and run SVMlib<br>

>>> over it.<br>

>>><br>

>>> If someone wants to try MOA<br>

>>> (<a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/index.html" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/index.html</a>), this would be<br>

>>> helpful also in the long run (at least a tutorial how to set it up and<br>

>>> run).<br>

>>><br>

>>> The reduced datasets plus the IQ values are linked on the wiki: Features<br>

>>> are:<br>

>>>   ...> row INT,<br>

>>>   ...> studentid VARCHAR(30),<br>

>>>   ...> problemhierarchy TEXT,<br>

>>>   ...> problemname TEXT,<br>

>>>   ...> problemview INT,<br>

>>>   ...> problemstepname TEXT,<br>

>>>   ...> cfa INT,<br>

>>>   ...> iq REAL<br>

>>><br>

>>> IQ strength (number of attempts per student) should be available soon.<br>

>>>  (perhaps add'l features will become available as well)<br>

>>><br>

>>> I'm still hoping somebody could cluster Erin's normalized skills data<br>

>>> and provide a row -> cluster id mapping for algebra and bridge train<br>

>>> and test sets (I don't have the data any more).<br>

>>><br>

>>> Andy<br>

>>> _______________________________________________<br>

>>> ml mailing list<br>

>>> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>><br>

><br>

><br>

> _______________________________________________<br>

> ml mailing list<br>

> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

><br>

><br>

</div></div></blockquote></div><br>