I love open source software!<br>
<br>
The final predicted output (using iq and score as predictors, under a Naive Bayes model) for algebra and bridge (suitable, I believe, for submission) is available in
<a href="http://thomaslotze.com/kdd/output.tgz" target="_blank">http://thomaslotze.com/kdd/output.tgz</a><br>
<br>The streams.tgz and jarfiles.tgz have been updated with streams for bridge and my newly-compiled "moa_personal.jar" jarfile.<br><br>run_moa.sh should have all the steps needed to duplicate this in MOA yourself (after creating or importing the SQL tables) -- I've also put up MOA instructions on the wiki at <a href="https://www.noisebridge.net/wiki/Machine_Learning/moa" target="_blank">https://www.noisebridge.net/wiki/Machine_Learning/moa</a><br>
<br>
Summary: since the moa code was available on sourceforge, I was able to create a
new ClassificationPerformanceEvaluator (called BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA distribution into moa_personal.jar. But this allows us to use this evaluator to print out row number and predicted probability of cfa. The evaluator is currently pretty hard-coded for the KDD dataset right now, but I think I can modify it to a more general task/evaluator for use in the future (and potentially for inclusion back into the MOA trunk). In any case, it should work for now.<br>
<br>
Hooray for open source machine learning!<br><br>-Thomas<br><br><div class="gmail_quote">On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Thomas,<br>
<br>
Have you finished joining the chance values into the steps? If so,<br>
where can I download this joined_tables.sql.gz file?<br>
(the streams you provide are algebra only -- do you have bridge as<br>
well?) I would like to concatenate your merged results with the<br>
number of skills feature I computed; will then provide this dataset.<br>
<br>
<br>
FYI, I'm trying to run of of the incremental classifiers within weka:<br>
I've started discretizing numeric values for Naive Bayes Updateable<br>
classifier (<a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html" target="_blank">http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html</a>,<br>
also see <a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) using<br>
something like this: (need a lot of memory!)<br>
<br>
java -Xms2048m -Xmx4096m -cp weka.jar<br>
weka.filters.unsupervised.attribute.Discretize<br>
-unset-class-temporarily -F -B 10 -i inputfile -o outputfile<br>
<br>
Similarly, one can then run the NB algorithm incrementally; Haven't<br>
done this yet but Thomas, this may be an alternative if MOA doesn't<br>
work out.<br>
<br>
Andy<br>
<div><div></div><div><br>
<br>
<br>
On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com" target="_blank">thomas.lotze@gmail.com</a>> wrote:<br>
> All,<br>
><br>
> I've been trying to use MOA to generate a classifier...and while I seem to<br>
> be able to do that, I'm having trouble getting it to actually output<br>
> classifications for new examples, so thought I'd share my current status and<br>
> see if anyone can help.<br>
><br>
> You can download the stream test and train files from<br>
> <a href="http://thomaslotze.com/kdd/streams.tgz" target="_blank">http://thomaslotze.com/kdd/streams.tgz</a><br>
> You can also download the jarfiles needed for MOA at<br>
> <a href="http://thomaslotze.com/kdd/jarfiles.tgz" target="_blank">http://thomaslotze.com/kdd/jarfiles.tgz</a><br>
><br>
> Unpack these all into the same directory. Then, in that directory, using<br>
> the following command, you can create a MOA classifier:<br>
> java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "LearnModel<br>
> -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"<br>
><br>
> You can also summarize the test arff file using the following command:<br>
> java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff<br>
><br>
> But I cannot find a command for MOA which will input the amodel.moa model<br>
> and generate predicted classes for atest.arff. The closest I've come is the<br>
> following, which runs amodel.moa on the atest.arff, and must be predicting<br>
> classes and comparing, because it declares how many it got correct:<br>
> java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask<br>
> "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c -1)"<br>
><br>
> So if anyone can figure it out (I've been using<br>
> <a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/Manual.pdf" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf</a> as a guide), I could<br>
> certainly use some help with this step.<br>
><br>
> Cheers,<br>
> Thomas<br>
><br>
> P.S. If you'd like to get the SQL loaded yourself, you can download<br>
> joined_tables.sql.gz (which was created using get_output.sh). I then used<br>
> run_moa.sh to create the .arff files and try to run MOA.<br>
><br>
> On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling <<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Mike,<br>
>> We're working on getting the test dataset orthogonalized. Stay tuned.<br>
>> Andy<br>
>><br>
>><br>
>> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <<a href="mailto:mike@mindmech.com" target="_blank">mike@mindmech.com</a>> wrote:<br>
>> > Hey Andy, the input to the classifier I'm trying to produce is<br>
>> > the orthogonalized dataset - i.e. the list of 1000+ columns where<br>
>> > each column has the value of the opportunity for that skill. The<br>
>> > dataset was produced by Erin and is is broken into several parts,<br>
>> > for the algebra dataset this looks like:<br>
>> ><br>
>> > algebra-output_partaa<br>
>> > algebra-output_partab<br>
>> > ..<br>
>> > algebra-output_partah<br>
>> ><br>
>> ><br>
>> > You're going to have to orthogonalize the test datasets, which<br>
>> > I don't have a copy of. Erin - are you around? Maybe she can help<br>
>> > you convert the test datasets?<br>
>> ><br>
>> > mike<br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling<br>
>> > <<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>> wrote:<br>
>> >><br>
>> >> Sweet, Mike. Please note that we need the row -> clusterid mapping<br>
>> >> for both training AND testing sets. Otherwise it will not help the ML<br>
>> >> algorithms.<br>
>> >> If I understand correctly, your input are the orthogonalized skills.<br>
>> >> So far, the girls only provided these orthogonalizations for the<br>
>> >> training files. I'm computing them for the test sets so you can use<br>
>> >> them. If I don't understand this assumption correctly, please let me<br>
>> >> know so I can use my CPU's cycles for other tasks.<br>
>> >><br>
>> >> Ideally you can provide these cluster mappings by about Sunday, which<br>
>> >> is when I want to start running classifiers. I will need some time to<br>
>> >> actually run the ML algorithms.<br>
>> >><br>
>> >> I have now IQ and IQ strength feature values for all datasets and am<br>
>> >> hoping time permits to compute chance and chance strength values for<br>
>> >> rows.<br>
>> >> Computing # of skills required should not be difficult and I will add<br>
>> >> this feature as well. I plan on sharing my datasets as new versions<br>
>> >> become available.<br>
>> >><br>
>> >> Andy<br>
>> >><br>
>> >><br>
>> >><br>
>> >><br>
>> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <<a href="mailto:mike@mindmech.com" target="_blank">mike@mindmech.com</a>><br>
>> >> wrote:<br>
>> >> > So it's taking about 9 hours to create a graph from a 4.4GB file, I'm<br>
>> >> > going to work on improving the code to make it a bit faster, and also<br>
>> >> > am investigating a MapReduce solution.<br>
>> >> ><br>
>> >> > Basically the clustering process can be broken down into two stages:<br>
>> >> ><br>
>> >> > 1) Construct the graph, apply the clustering algorithm to break graph<br>
>> >> > into<br>
>> >> > clusters<br>
>> >> > 2) Apply the clustered graph to the data again to classify each skill<br>
>> >> > set<br>
>> >> ><br>
>> >> > I'll keep working on it and let everyone know how things are going<br>
>> >> > with<br>
>> >> > it,<br>
>> >> > as I mentioned in another email, the source code is in our new<br>
>> >> > sourceforge<br>
>> >> > project's git repository.<br>
>> >> ><br>
>> >> > mike<br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <<a href="mailto:mike@mindmech.com" target="_blank">mike@mindmech.com</a>><br>
>> >> > wrote:<br>
>> >> >><br>
>> >> >> Sounds like you're making great progress! I'll be working on the<br>
>> >> >> graph clustering algorithm for the skill set tonight and will keep<br>
>> >> >> you posted on how things are going.<br>
>> >> >><br>
>> >> >> mike<br>
>> >> >><br>
>> >> >><br>
>> >> >><br>
>> >> >><br>
>> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling<br>
>> >> >> <<a href="mailto:vonhessling@gmail.com" target="_blank">vonhessling@gmail.com</a>> wrote:<br>
>> >> >>><br>
>> >> >>> Doing a few basic tricks, I catapulted the submission into the 50th<br>
>> >> >>> percentile. That is not even running any ML algorithm.<br>
>> >> >>><br>
>> >> >>> I'm planning on running the NaiveBayesUpdateable classifier<br>
>> >> >>> (<a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) over<br>
>> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the command<br>
>> >> >>> line to evaluate performance. Another attempt would be to load all<br>
>> >> >>> data into memory (<3GB, even for full Bridge Train) and run SVMlib<br>
>> >> >>> over it.<br>
>> >> >>><br>
>> >> >>> If someone wants to try MOA<br>
>> >> >>> (<a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/index.html" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/index.html</a>), this would be<br>
>> >> >>> helpful also in the long run (at least a tutorial how to set it up<br>
>> >> >>> and<br>
>> >> >>> run).<br>
>> >> >>><br>
>> >> >>> The reduced datasets plus the IQ values are linked on the wiki:<br>
>> >> >>> Features<br>
>> >> >>> are:<br>
>> >> >>> ...> row INT,<br>
>> >> >>> ...> studentid VARCHAR(30),<br>
>> >> >>> ...> problemhierarchy TEXT,<br>
>> >> >>> ...> problemname TEXT,<br>
>> >> >>> ...> problemview INT,<br>
>> >> >>> ...> problemstepname TEXT,<br>
>> >> >>> ...> cfa INT,<br>
>> >> >>> ...> iq REAL<br>
>> >> >>><br>
>> >> >>> IQ strength (number of attempts per student) should be available<br>
>> >> >>> soon.<br>
>> >> >>> (perhaps add'l features will become available as well)<br>
>> >> >>><br>
>> >> >>> I'm still hoping somebody could cluster Erin's normalized skills<br>
>> >> >>> data<br>
>> >> >>> and provide a row -> cluster id mapping for algebra and bridge<br>
>> >> >>> train<br>
>> >> >>> and test sets (I don't have the data any more).<br>
>> >> >>><br>
>> >> >>> Andy<br>
>> >> >>> _______________________________________________<br>
>> >> >>> ml mailing list<br>
>> >> >>> <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>
>> >> >>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
>> >> >><br>
>> >> ><br>
>> >> ><br>
>> >> > _______________________________________________<br>
>> >> > ml mailing list<br>
>> >> > <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>
>> >> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
>> >> ><br>
>> >> ><br>
>> ><br>
>> ><br>
>> _______________________________________________<br>
>> ml mailing list<br>
>> <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>
>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
><br>
><br>
> _______________________________________________<br>
> ml mailing list<br>
> <a href="mailto:ml@lists.noisebridge.net" target="_blank">ml@lists.noisebridge.net</a><br>
> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
><br>
><br>
</div></div></blockquote></div><br>