Hey Andy,<br><br>I'm making good progress with the clustered skills. I generated<br>a graph from all the algebra data, and am currently created a<br>clustered graph as a classifer. The next step is to classify the<br>entire algebra dataset, then use the python script to normalize<br>


the test dataset, and classify that. It's looking like I'll have this<br>to you by later in the day, will keep you posted!<br><font color="#888888"><br>  mike</font><br><br><br><div class="gmail_quote">On Mon, Jun 7, 2010 at 9:21 AM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Current status:  We're currently trying to improve our scores by<br>

adding more features;  Thomas is supplying me with MOA predictions<br>

which I get scored.  I'm also adding new features and eventually will<br>

be running Weka's incremental classifiers.<br>

<br>

At is is right now, we cannot make use of the orthogonalized datasets,<br>

since we are unable to orthogonalize the test datasets as well -- this<br>

has not been done before and currently we are running into technical<br>

issues.  If somebody with a Python background wants to look at this<br>

today, it would be very helpful.  Currently I feel like we are limited<br>

with the features that we have computed; the clustered skills would be<br>

very valuable.<br>

<br>

Andy<br>

<div><div></div><div class="h5"><br>

<br>

On Sun, Jun 6, 2010 at 5:41 PM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com">thomas.lotze@gmail.com</a>> wrote:<br>

> I love open source software!<br>

><br>

> The final predicted output (using iq and score as predictors, under a Naive<br>

> Bayes model) for algebra and bridge (suitable, I believe, for submission) is<br>

> available in <a href="http://thomaslotze.com/kdd/output.tgz" target="_blank">http://thomaslotze.com/kdd/output.tgz</a><br>

><br>

> The streams.tgz and jarfiles.tgz have been updated with streams for bridge<br>

> and my newly-compiled "moa_personal.jar" jarfile.<br>

><br>

> run_moa.sh should have all the steps needed to duplicate this in MOA<br>

> yourself (after creating or importing the SQL tables) -- I've also put up<br>

> MOA instructions on the wiki at<br>

> <a href="https://www.noisebridge.net/wiki/Machine_Learning/moa" target="_blank">https://www.noisebridge.net/wiki/Machine_Learning/moa</a><br>

><br>

> Summary: since the moa code was available on sourceforge, I was able to<br>

> create a new ClassificationPerformanceEvaluator (called<br>

> BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA<br>

> distribution into moa_personal.jar.  But this allows us to use this<br>

> evaluator to print out row number and predicted probability of cfa.  The<br>

> evaluator is currently pretty hard-coded for the KDD dataset right now, but<br>

> I think I can modify it to a more general task/evaluator for use in the<br>

> future (and potentially for inclusion back into the MOA trunk).  In any<br>

> case, it should work for now.<br>

><br>

> Hooray for open source machine learning!<br>

><br>

> -Thomas<br>

><br>

> On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>><br>

> wrote:<br>

>><br>

>> Thomas,<br>

>><br>

>> Have you finished joining the chance values into the steps?  If so,<br>

>> where can I download this joined_tables.sql.gz file?<br>

>> (the streams you provide are algebra only -- do you have bridge as<br>

>> well?)  I would like to concatenate your merged results with the<br>

>> number of skills feature I computed; will then provide this dataset.<br>

>><br>

>><br>

>> FYI, I'm trying to run of of the incremental classifiers within weka:<br>

>> I've started discretizing numeric values for Naive Bayes Updateable<br>

>> classifier<br>

>> (<a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html" target="_blank">http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html</a>,<br>

>> also see <a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) using<br>

>> something like this:  (need a lot of memory!)<br>

>><br>

>> java -Xms2048m -Xmx4096m -cp weka.jar<br>

>> weka.filters.unsupervised.attribute.Discretize<br>

>> -unset-class-temporarily -F -B 10 -i inputfile -o outputfile<br>

>><br>

>> Similarly, one can then run the NB algorithm incrementally;  Haven't<br>

>> done this yet but Thomas, this may be an alternative if MOA doesn't<br>

>> work out.<br>

>><br>

>> Andy<br>

>><br>

>><br>

>><br>

>> On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com">thomas.lotze@gmail.com</a>><br>

>> wrote:<br>

>> > All,<br>

>> ><br>

>> > I've been trying to use MOA to generate a classifier...and while I seem<br>

>> > to<br>

>> > be able to do that, I'm having trouble getting it to actually output<br>

>> > classifications for new examples, so thought I'd share my current status<br>

>> > and<br>

>> > see if anyone can help.<br>

>> ><br>

>> > You can download the stream test and train files from<br>

>> > <a href="http://thomaslotze.com/kdd/streams.tgz" target="_blank">http://thomaslotze.com/kdd/streams.tgz</a><br>

>> > You can also download the jarfiles needed for MOA at<br>

>> > <a href="http://thomaslotze.com/kdd/jarfiles.tgz" target="_blank">http://thomaslotze.com/kdd/jarfiles.tgz</a><br>

>> ><br>

>> > Unpack these all into the same directory.  Then, in that directory,<br>

>> > using<br>

>> > the following command, you can create a MOA classifier:<br>

>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask<br>

>> > "LearnModel<br>

>> > -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"<br>

>> ><br>

>> > You can also summarize the test arff file using the following command:<br>

>> > java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff<br>

>> ><br>

>> > But I cannot find a command for MOA which will input the amodel.moa<br>

>> > model<br>

>> > and generate predicted classes for atest.arff.  The closest I've come is<br>

>> > the<br>

>> > following, which runs amodel.moa on the atest.arff, and must be<br>

>> > predicting<br>

>> > classes and comparing, because it declares how many it got correct:<br>

>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask<br>

>> > "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c<br>

>> > -1)"<br>

>> ><br>

>> > So if anyone can figure it out (I've been using<br>

>> > <a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/Manual.pdf" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf</a> as a guide), I could<br>

>> > certainly use some help with this step.<br>

>> ><br>

>> > Cheers,<br>

>> > Thomas<br>

>> ><br>

>> > P.S. If you'd like to get the SQL loaded yourself, you can download<br>

>> > joined_tables.sql.gz (which was created using get_output.sh).  I then<br>

>> > used<br>

>> > run_moa.sh to create the .arff files and try to run MOA.<br>

>> ><br>

>> > On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling<br>

>> > <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>><br>

>> > wrote:<br>

>> >><br>

>> >> Mike,<br>

>> >> We're working on getting the test dataset orthogonalized.  Stay tuned.<br>

>> >> Andy<br>

>> >><br>

>> >><br>

>> >> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>><br>

>> >> wrote:<br>

>> >> > Hey Andy, the input to the classifier I'm trying to produce is<br>

>> >> > the orthogonalized dataset - i.e. the list of 1000+ columns where<br>

>> >> > each column has the value of the opportunity for that skill. The<br>

>> >> > dataset was produced by Erin and is is broken into several parts,<br>

>> >> > for the algebra dataset this looks like:<br>

>> >> ><br>

>> >> > algebra-output_partaa<br>

>> >> > algebra-output_partab<br>

>> >> > ..<br>

>> >> > algebra-output_partah<br>

>> >> ><br>

>> >> ><br>

>> >> > You're going to have to orthogonalize the test datasets, which<br>

>> >> > I don't have a copy of. Erin - are you around? Maybe she can help<br>

>> >> > you convert the test datasets?<br>

>> >> ><br>

>> >> >   mike<br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling<br>

>> >> > <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>

>> >> >><br>

>> >> >> Sweet, Mike.  Please note that we need the row -> clusterid mapping<br>

>> >> >> for both training AND testing sets.  Otherwise it will not help the<br>

>> >> >> ML<br>

>> >> >> algorithms.<br>

>> >> >> If I understand correctly, your input are the orthogonalized skills.<br>

>> >> >> So far, the girls only provided these orthogonalizations for the<br>

>> >> >> training files.  I'm computing them for the test sets so you can use<br>

>> >> >> them.  If I don't understand this assumption correctly, please let<br>

>> >> >> me<br>

>> >> >> know so I can use my CPU's cycles for other tasks.<br>

>> >> >><br>

>> >> >> Ideally you can provide these cluster mappings by about Sunday,<br>

>> >> >> which<br>

>> >> >> is when I want to start running classifiers.  I will need some time<br>

>> >> >> to<br>

>> >> >> actually run the ML algorithms.<br>

>> >> >><br>

>> >> >> I have now IQ and IQ strength feature values for all datasets and am<br>

>> >> >> hoping time permits to compute chance and chance strength values for<br>

>> >> >> rows.<br>

>> >> >> Computing # of skills required should not be difficult and I will<br>

>> >> >> add<br>

>> >> >> this feature as well.  I plan on sharing my datasets as new versions<br>

>> >> >> become available.<br>

>> >> >><br>

>> >> >> Andy<br>

>> >> >><br>

>> >> >><br>

>> >> >><br>

>> >> >><br>

>> >> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>><br>

>> >> >> wrote:<br>

>> >> >> > So it's taking about 9 hours to create a graph from a 4.4GB file,<br>

>> >> >> > I'm<br>

>> >> >> > going to work on improving the code to make it a bit faster, and<br>

>> >> >> > also<br>

>> >> >> > am investigating a MapReduce solution.<br>

>> >> >> ><br>

>> >> >> > Basically the clustering process can be broken down into two<br>

>> >> >> > stages:<br>

>> >> >> ><br>

>> >> >> > 1) Construct the graph, apply the clustering algorithm to break<br>

>> >> >> > graph<br>

>> >> >> > into<br>

>> >> >> > clusters<br>

>> >> >> > 2) Apply the clustered graph to the data again to classify each<br>

>> >> >> > skill<br>

>> >> >> > set<br>

>> >> >> ><br>

>> >> >> > I'll keep working on it and let everyone know how things are going<br>

>> >> >> > with<br>

>> >> >> > it,<br>

>> >> >> > as I mentioned in another email, the source code is in our new<br>

>> >> >> > sourceforge<br>

>> >> >> > project's git repository.<br>

>> >> >> ><br>

>> >> >> >  mike<br>

>> >> >> ><br>

>> >> >> ><br>

>> >> >> ><br>

>> >> >> ><br>

>> >> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>><br>

>> >> >> > wrote:<br>

>> >> >> >><br>

>> >> >> >> Sounds like you're making great progress! I'll be working on the<br>

>> >> >> >> graph clustering algorithm for the skill set tonight and will<br>

>> >> >> >> keep<br>

>> >> >> >> you posted on how things are going.<br>

>> >> >> >><br>

>> >> >> >>   mike<br>

>> >> >> >><br>

>> >> >> >><br>

>> >> >> >><br>

>> >> >> >><br>

>> >> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling<br>

>> >> >> >> <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>

>> >> >> >>><br>

>> >> >> >>> Doing a few basic tricks, I catapulted the submission into the<br>

>> >> >> >>> 50th<br>

>> >> >> >>> percentile.  That is not even running any ML algorithm.<br>

>> >> >> >>><br>

>> >> >> >>> I'm planning on running the NaiveBayesUpdateable classifier<br>

>> >> >> >>> (<a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) over<br>

>> >> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the<br>

>> >> >> >>> command<br>

>> >> >> >>> line to evaluate performance.  Another attempt would be to load<br>

>> >> >> >>> all<br>

>> >> >> >>> data into memory (<3GB, even for full Bridge Train) and run<br>

>> >> >> >>> SVMlib<br>

>> >> >> >>> over it.<br>

>> >> >> >>><br>

>> >> >> >>> If someone wants to try MOA<br>

>> >> >> >>> (<a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/index.html" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/index.html</a>), this would<br>

>> >> >> >>> be<br>

>> >> >> >>> helpful also in the long run (at least a tutorial how to set it<br>

>> >> >> >>> up<br>

>> >> >> >>> and<br>

>> >> >> >>> run).<br>

>> >> >> >>><br>

>> >> >> >>> The reduced datasets plus the IQ values are linked on the wiki:<br>

>> >> >> >>> Features<br>

>> >> >> >>> are:<br>

>> >> >> >>>   ...> row INT,<br>

>> >> >> >>>   ...> studentid VARCHAR(30),<br>

>> >> >> >>>   ...> problemhierarchy TEXT,<br>

>> >> >> >>>   ...> problemname TEXT,<br>

>> >> >> >>>   ...> problemview INT,<br>

>> >> >> >>>   ...> problemstepname TEXT,<br>

>> >> >> >>>   ...> cfa INT,<br>

>> >> >> >>>   ...> iq REAL<br>

>> >> >> >>><br>

>> >> >> >>> IQ strength (number of attempts per student) should be available<br>

>> >> >> >>> soon.<br>

>> >> >> >>>  (perhaps add'l features will become available as well)<br>

>> >> >> >>><br>

>> >> >> >>> I'm still hoping somebody could cluster Erin's normalized skills<br>

>> >> >> >>> data<br>

>> >> >> >>> and provide a row -> cluster id mapping for algebra and bridge<br>

>> >> >> >>> train<br>

>> >> >> >>> and test sets (I don't have the data any more).<br>

>> >> >> >>><br>

>> >> >> >>> Andy<br>

>> >> >> >>> _______________________________________________<br>

>> >> >> >>> ml mailing list<br>

>> >> >> >>> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>> >> >> >>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>> >> >> >><br>

>> >> >> ><br>

>> >> >> ><br>

>> >> >> > _______________________________________________<br>

>> >> >> > ml mailing list<br>

>> >> >> > <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>> >> >> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>> >> >> ><br>

>> >> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> _______________________________________________<br>

>> >> ml mailing list<br>

>> >> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>> >> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>> ><br>

>> ><br>

>> > _______________________________________________<br>

>> > ml mailing list<br>

>> > <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

>> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

>> ><br>

>> ><br>

><br>

><br>

> _______________________________________________<br>

> ml mailing list<br>

> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

><br>

><br>

_______________________________________________<br>

ml mailing list<br>

<a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>

<a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>

</div></div></blockquote></div><br>