I love open source software!<br>

<br>

The final predicted output (using iq and score as predictors, under a Naive Bayes model) for algebra and bridge (suitable, I believe, for submission) is available in 

<a href="http://thomaslotze.com/kdd/output.tgz" target="_blank">http://thomaslotze.com/kdd/output.tgz</a><br>


<br>The streams.tgz and jarfiles.tgz have been updated with streams for bridge and my newly-compiled "moa_personal.jar" jarfile.<br><br>run_moa.sh should have all the steps needed to duplicate this in MOA yourself (after creating or importing the SQL tables) -- I've also put up MOA instructions on the wiki at <a href="https://www.noisebridge.net/wiki/Machine_Learning/moa" target="_blank">https://www.noisebridge.net/wiki/Machine_Learning/moa</a><br>


<br>

Summary: since the moa code was available on sourceforge, I was able to create a 

new ClassificationPerformanceEvaluator (called BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA distribution into moa_personal.jar.  But this allows us to use this evaluator to print out row number and predicted probability of cfa.  The evaluator is currently pretty hard-coded for the KDD dataset right now, but I think I can modify it to a more general task/evaluator for use in the future (and potentially for inclusion back into the MOA trunk).  In any case, it should work for now.<br>


<br>

Hooray for open source machine learning!<br><br>-Thomas<br><br><div class="gmail_quote">On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Thomas,<br>

<br>

Have you finished joining the chance values into the steps?  If so,<br>

where can I download this joined_tables.sql.gz file?<br>

(the streams you provide are algebra only -- do you have bridge as<br>

well?)  I would like to concatenate your merged results with the<br>

number of skills feature I computed; will then provide this dataset.<br>

<br>

<br>

FYI, I'm trying to run of of the incremental classifiers within weka:<br>

I've started discretizing numeric values for Naive Bayes Updateable<br>

classifier (<a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html" target="_blank">http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html</a>,<br>

also see <a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) using<br>

something like this:  (need a lot of memory!)<br>

<br>

java -Xms2048m -Xmx4096m -cp weka.jar<br>

weka.filters.unsupervised.attribute.Discretize<br>

-unset-class-temporarily -F -B 10 -i inputfile -o outputfile<br>

<br>

Similarly, one can then run the NB algorithm incrementally;  Haven't<br>

done this yet but Thomas, this may be an alternative if MOA doesn't<br>

work out.<br>

<br>

Andy<br>

<div><div></div><div><br>

<br>

<br>

On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com" target="_blank">thomas.lotze@gmail.com</a>> wrote:<br>

> All,<br>

><br>

> I've been trying to use MOA to generate a classifier...and while I seem to<br>

> be able to do that, I'm having trouble getting it to actually output<br>

> classifications for new examples, so thought I'd share my current status and<br>

> see if anyone can help.<br>

><br>

> You can download the stream test and train files from<br>

> <a href="http://thomaslotze.com/kdd/streams.tgz" target="_blank">http://thomaslotze.com/kdd/streams.tgz</a><br>

> You can also download the jarfiles needed for MOA at<br>

> <a href="http://thomaslotze.com/kdd/jarfiles.tgz" target="_blank">http://thomaslotze.com/kdd/jarfiles.tgz</a><br>

><br>

> Unpack these all into the same directory.  Then, in that directory, using<br>

> the following command, you can create a MOA classifier:<br>

> java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "LearnModel<br>

> -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"<br>

><br>

> You can also summarize the test arff file using the following command:<br>

> java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff<br>

><br>

> But I cannot find a command for MOA which will input the amodel.moa model<br>

> and generate predicted classes for atest.arff.  The closest I've come is the<br>

> following, which runs amodel.moa on the atest.arff, and must be predicting<br>

> classes and comparing, because it declares how many it got correct:<br>

> java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask<br>

> "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c -1)"<br>

><br>

> So if anyone can figure it out (I've been using<br>

> <a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/Manual.pdf" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf</a> as a guide), I could<br>

> certainly use some help with this step.<br>

><br>

> Cheers,<br>

> Thomas<br>

><br>

> P.S. If you'd like to get the SQL loaded yourself, you can download<br>

> joined_tables.sql.gz (which was created using get_output.sh).  I then used<br>

> run_moa.sh to create the .arff files and try to run MOA.<br>

><br>

> On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling <<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>><br>

> wrote:<br>

>><br>

>> Mike,<br>

>> We're working on getting the test dataset orthogonalized.  Stay tuned.<br>

>> Andy<br>

>><br>

>><br>

>> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <<a href="mailto:mike@mindmech.com" target="_blank">mike@mindmech.com</a>> wrote:<br>

>> > Hey Andy, the input to the classifier I'm trying to produce is<br>

>> > the orthogonalized dataset - i.e. the list of 1000+ columns where<br>

>> > each column has the value of the opportunity for that skill. The<br>

>> > dataset was produced by Erin and is is broken into several parts,<br>

>> > for the algebra dataset this looks like:<br>

>> ><br>

>> > algebra-output_partaa<br>

>> > algebra-output_partab<br>

>> > ..<br>

>> > algebra-output_partah<br>

>> ><br>

>> ><br>

>> > You're going to have to orthogonalize the test datasets, which<br>

>> > I don't have a copy of. Erin - are you around? Maybe she can help<br>

>> > you convert the test datasets?<br>

>> ><br>

>> >   mike<br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling<br>

>> > <<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>> wrote:<br>

>> >><br>

>> >> Sweet, Mike.  Please note that we need the row -> clusterid mapping<br>

>> >> for both training AND testing sets.  Otherwise it will not help the ML<br>

>> >> algorithms.<br>

>> >> If I understand correctly, your input are the orthogonalized skills.<br>

>> >> So far, the girls only provided these orthogonalizations for the<br>

>> >> training files.  I'm computing them for the test sets so you can use<br>

>> >> them.  If I don't understand this assumption correctly, please let me<br>

>> >> know so I can use my CPU's cycles for other tasks.<br>

>> >><br>

>> >> Ideally you can provide these cluster mappings by about Sunday, which<br>

>> >> is when I want to start running classifiers.  I will need some time to<br>

>> >> actually run the ML algorithms.<br>

>> >><br>

>> >> I have now IQ and IQ strength feature values for all datasets and am<br>

>> >> hoping time permits to compute chance and chance strength values for<br>

>> >> rows.<br>

>> >> Computing # of skills required should not be difficult and I will add<br>

>> >> this feature as well.  I plan on sharing my datasets as new versions<br>

>> >> become available.<br>

>> >><br>

>> >> Andy<br>

>> >><br>

>> >><br>

>> >><br>

>> >><br>

>> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <<a href="mailto:mike@mindmech.com" target="_blank">mike@mindmech.com</a>><br>

>> >> wrote:<br>

>> >> > So it's taking about 9 hours to create a graph from a 4.4GB file, I'm<br>

>> >> > going to work on improving the code to make it a bit faster, and also<br>

>> >> > am investigating a MapReduce solution.<br>

>> >> ><br>

>> >> > Basically the clustering process can be broken down into two stages:<br>

>> >> ><br>

>> >> > 1) Construct the graph, apply the clustering algorithm to break graph<br>

>> >> > into<br>

>> >> > clusters<br>

>> >> > 2) Apply the clustered graph to the data again to classify each skill<br>

>> >> > set<br>

>> >> ><br>

>> >> > I'll keep working on it and let everyone know how things are going<br>

>> >> > with<br>

>> >> > it,<br>

>> >> > as I mentioned in another email, the source code is in our new<br>

>> >> > sourceforge<br>

>> >> > project's git repository.<br>

>> >> ><br>

>> >> >  mike<br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <<a href="mailto:mike@mindmech.com" target="_blank">mike@mindmech.com</a>><br>

>> >> > wrote:<br>

>> >> >><br>

>> >> >> Sounds like you're making great progress! I'll be working on the<br>

>> >> >> graph clustering algorithm for the skill set tonight and will keep<br>

>> >> >> you posted on how things are going.<br>

>> >> >><br>

>> >> >>   mike<br>

>> >> >><br>

>> >> >><br>

>> >> >><br>

>> >> >><br>

>> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling<br>

>> >> >> <<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>> wrote:<br>

>> >> >>><br>

>> >> >>> Doing a few basic tricks, I catapulted the submission into the 50th<br>

>> >> >>> percentile.  That is not even running any ML algorithm.<br>

>> >> >>><br>

>> >> >>> I'm planning on running the NaiveBayesUpdateable classifier<br>

>> >> >>> (<a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) over<br>

>> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the command<br>

>> >> >>> line to evaluate performance.  Another attempt would be to load all<br>

>> >> >>> data into memory (<3GB, even for full Bridge Train) and run SVMlib<br>

>> >> >>> over it.<br>

>> >> >>><br>

>> >> >>> If someone wants to try MOA<br>

>> >> >>> (<a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/index.html" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/index.html</a>), this would be<br>

>> >> >>> helpful also in the long run (at least a tutorial how to set it up<br>

>> >> >>> and<br>

>> >> >>> run).<br>

>> >> >>><br>

>> >> >>> The reduced datasets plus the IQ values are linked on the wiki:<br>

>> >> >>> Features<br>

>> >> >>> are:<br>

>> >> >>>   ...> row INT,<br>

>> >> >>>   ...> studentid VARCHAR(30),<br>

>> >> >>>   ...> problemhierarchy TEXT,<br>

>> >> >>>   ...> problemname TEXT,<br>

>> >> >>>   ...> problemview INT,<br>

>> >> >>>   ...> problemstepname TEXT,<br>

>> >> >>>   ...> cfa INT,<br>

>> >> >>>   ...> iq REAL<br>

>> >> >>><br>

>> >> >>> IQ strength (number of attempts per student) should be available<br>

>> >> >>> soon.<br>

>> >> >>>  (perhaps add'l features will become available as well)<br>

>> >> >>><br>

>> >> >>> I'm still hoping somebody could cluster Erin's normalized skills<br>

>> >> >>> data<br>

>> >> >>> and provide a row -> cluster id mapping for algebra and bridge<br>

>> >> >>> train<br>

>> >> >>> and test sets (I don't have the data any more).<br>

>> >> >>><br>

>> >> >>> Andy<br>

>> >> >>> _______________________________________________<br>

>> >> >>> ml mailing list<br>

>> >> >>> <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>

>> >> >>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>> >> >><br>

>> >> ><br>

>> >> ><br>

>> >> > _______________________________________________<br>

>> >> > ml mailing list<br>

>> >> > <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>

>> >> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>> >> ><br>

>> >> ><br>

>> ><br>

>> ><br>

>> _______________________________________________<br>

>> ml mailing list<br>

>> <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>

>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

><br>

><br>

> _______________________________________________<br>

> ml mailing list<br>

> <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>

> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

><br>

><br>

</div></div></blockquote></div><br>