Hey Andy,<br><br>I'm making good progress with the clustered skills. I generated<br>a graph from all the algebra data, and am currently created a<br>clustered graph as a classifer. The next step is to classify the<br>entire algebra dataset, then use the python script to normalize<br>
the test dataset, and classify that. It's looking like I'll have this<br>to you by later in the day, will keep you posted!<br><font color="#888888"><br> mike</font><br><br><br><div class="gmail_quote">On Mon, Jun 7, 2010 at 9:21 AM, Andreas von Hessling <span dir="ltr"><<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Current status: We're currently trying to improve our scores by<br>
adding more features; Thomas is supplying me with MOA predictions<br>
which I get scored. I'm also adding new features and eventually will<br>
be running Weka's incremental classifiers.<br>
<br>
At is is right now, we cannot make use of the orthogonalized datasets,<br>
since we are unable to orthogonalize the test datasets as well -- this<br>
has not been done before and currently we are running into technical<br>
issues. If somebody with a Python background wants to look at this<br>
today, it would be very helpful. Currently I feel like we are limited<br>
with the features that we have computed; the clustered skills would be<br>
very valuable.<br>
<br>
Andy<br>
<div><div></div><div class="h5"><br>
<br>
On Sun, Jun 6, 2010 at 5:41 PM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com">thomas.lotze@gmail.com</a>> wrote:<br>
> I love open source software!<br>
><br>
> The final predicted output (using iq and score as predictors, under a Naive<br>
> Bayes model) for algebra and bridge (suitable, I believe, for submission) is<br>
> available in <a href="http://thomaslotze.com/kdd/output.tgz" target="_blank">http://thomaslotze.com/kdd/output.tgz</a><br>
><br>
> The streams.tgz and jarfiles.tgz have been updated with streams for bridge<br>
> and my newly-compiled "moa_personal.jar" jarfile.<br>
><br>
> run_moa.sh should have all the steps needed to duplicate this in MOA<br>
> yourself (after creating or importing the SQL tables) -- I've also put up<br>
> MOA instructions on the wiki at<br>
> <a href="https://www.noisebridge.net/wiki/Machine_Learning/moa" target="_blank">https://www.noisebridge.net/wiki/Machine_Learning/moa</a><br>
><br>
> Summary: since the moa code was available on sourceforge, I was able to<br>
> create a new ClassificationPerformanceEvaluator (called<br>
> BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA<br>
> distribution into moa_personal.jar. But this allows us to use this<br>
> evaluator to print out row number and predicted probability of cfa. The<br>
> evaluator is currently pretty hard-coded for the KDD dataset right now, but<br>
> I think I can modify it to a more general task/evaluator for use in the<br>
> future (and potentially for inclusion back into the MOA trunk). In any<br>
> case, it should work for now.<br>
><br>
> Hooray for open source machine learning!<br>
><br>
> -Thomas<br>
><br>
> On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Thomas,<br>
>><br>
>> Have you finished joining the chance values into the steps? If so,<br>
>> where can I download this joined_tables.sql.gz file?<br>
>> (the streams you provide are algebra only -- do you have bridge as<br>
>> well?) I would like to concatenate your merged results with the<br>
>> number of skills feature I computed; will then provide this dataset.<br>
>><br>
>><br>
>> FYI, I'm trying to run of of the incremental classifiers within weka:<br>
>> I've started discretizing numeric values for Naive Bayes Updateable<br>
>> classifier<br>
>> (<a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html" target="_blank">http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html</a>,<br>
>> also see <a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) using<br>
>> something like this: (need a lot of memory!)<br>
>><br>
>> java -Xms2048m -Xmx4096m -cp weka.jar<br>
>> weka.filters.unsupervised.attribute.Discretize<br>
>> -unset-class-temporarily -F -B 10 -i inputfile -o outputfile<br>
>><br>
>> Similarly, one can then run the NB algorithm incrementally; Haven't<br>
>> done this yet but Thomas, this may be an alternative if MOA doesn't<br>
>> work out.<br>
>><br>
>> Andy<br>
>><br>
>><br>
>><br>
>> On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <<a href="mailto:thomas.lotze@gmail.com">thomas.lotze@gmail.com</a>><br>
>> wrote:<br>
>> > All,<br>
>> ><br>
>> > I've been trying to use MOA to generate a classifier...and while I seem<br>
>> > to<br>
>> > be able to do that, I'm having trouble getting it to actually output<br>
>> > classifications for new examples, so thought I'd share my current status<br>
>> > and<br>
>> > see if anyone can help.<br>
>> ><br>
>> > You can download the stream test and train files from<br>
>> > <a href="http://thomaslotze.com/kdd/streams.tgz" target="_blank">http://thomaslotze.com/kdd/streams.tgz</a><br>
>> > You can also download the jarfiles needed for MOA at<br>
>> > <a href="http://thomaslotze.com/kdd/jarfiles.tgz" target="_blank">http://thomaslotze.com/kdd/jarfiles.tgz</a><br>
>> ><br>
>> > Unpack these all into the same directory. Then, in that directory,<br>
>> > using<br>
>> > the following command, you can create a MOA classifier:<br>
>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask<br>
>> > "LearnModel<br>
>> > -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"<br>
>> ><br>
>> > You can also summarize the test arff file using the following command:<br>
>> > java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff<br>
>> ><br>
>> > But I cannot find a command for MOA which will input the amodel.moa<br>
>> > model<br>
>> > and generate predicted classes for atest.arff. The closest I've come is<br>
>> > the<br>
>> > following, which runs amodel.moa on the atest.arff, and must be<br>
>> > predicting<br>
>> > classes and comparing, because it declares how many it got correct:<br>
>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask<br>
>> > "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c<br>
>> > -1)"<br>
>> ><br>
>> > So if anyone can figure it out (I've been using<br>
>> > <a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/Manual.pdf" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf</a> as a guide), I could<br>
>> > certainly use some help with this step.<br>
>> ><br>
>> > Cheers,<br>
>> > Thomas<br>
>> ><br>
>> > P.S. If you'd like to get the SQL loaded yourself, you can download<br>
>> > joined_tables.sql.gz (which was created using get_output.sh). I then<br>
>> > used<br>
>> > run_moa.sh to create the .arff files and try to run MOA.<br>
>> ><br>
>> > On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling<br>
>> > <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> Mike,<br>
>> >> We're working on getting the test dataset orthogonalized. Stay tuned.<br>
>> >> Andy<br>
>> >><br>
>> >><br>
>> >> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>><br>
>> >> wrote:<br>
>> >> > Hey Andy, the input to the classifier I'm trying to produce is<br>
>> >> > the orthogonalized dataset - i.e. the list of 1000+ columns where<br>
>> >> > each column has the value of the opportunity for that skill. The<br>
>> >> > dataset was produced by Erin and is is broken into several parts,<br>
>> >> > for the algebra dataset this looks like:<br>
>> >> ><br>
>> >> > algebra-output_partaa<br>
>> >> > algebra-output_partab<br>
>> >> > ..<br>
>> >> > algebra-output_partah<br>
>> >> ><br>
>> >> ><br>
>> >> > You're going to have to orthogonalize the test datasets, which<br>
>> >> > I don't have a copy of. Erin - are you around? Maybe she can help<br>
>> >> > you convert the test datasets?<br>
>> >> ><br>
>> >> > mike<br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling<br>
>> >> > <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>
>> >> >><br>
>> >> >> Sweet, Mike. Please note that we need the row -> clusterid mapping<br>
>> >> >> for both training AND testing sets. Otherwise it will not help the<br>
>> >> >> ML<br>
>> >> >> algorithms.<br>
>> >> >> If I understand correctly, your input are the orthogonalized skills.<br>
>> >> >> So far, the girls only provided these orthogonalizations for the<br>
>> >> >> training files. I'm computing them for the test sets so you can use<br>
>> >> >> them. If I don't understand this assumption correctly, please let<br>
>> >> >> me<br>
>> >> >> know so I can use my CPU's cycles for other tasks.<br>
>> >> >><br>
>> >> >> Ideally you can provide these cluster mappings by about Sunday,<br>
>> >> >> which<br>
>> >> >> is when I want to start running classifiers. I will need some time<br>
>> >> >> to<br>
>> >> >> actually run the ML algorithms.<br>
>> >> >><br>
>> >> >> I have now IQ and IQ strength feature values for all datasets and am<br>
>> >> >> hoping time permits to compute chance and chance strength values for<br>
>> >> >> rows.<br>
>> >> >> Computing # of skills required should not be difficult and I will<br>
>> >> >> add<br>
>> >> >> this feature as well. I plan on sharing my datasets as new versions<br>
>> >> >> become available.<br>
>> >> >><br>
>> >> >> Andy<br>
>> >> >><br>
>> >> >><br>
>> >> >><br>
>> >> >><br>
>> >> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>><br>
>> >> >> wrote:<br>
>> >> >> > So it's taking about 9 hours to create a graph from a 4.4GB file,<br>
>> >> >> > I'm<br>
>> >> >> > going to work on improving the code to make it a bit faster, and<br>
>> >> >> > also<br>
>> >> >> > am investigating a MapReduce solution.<br>
>> >> >> ><br>
>> >> >> > Basically the clustering process can be broken down into two<br>
>> >> >> > stages:<br>
>> >> >> ><br>
>> >> >> > 1) Construct the graph, apply the clustering algorithm to break<br>
>> >> >> > graph<br>
>> >> >> > into<br>
>> >> >> > clusters<br>
>> >> >> > 2) Apply the clustered graph to the data again to classify each<br>
>> >> >> > skill<br>
>> >> >> > set<br>
>> >> >> ><br>
>> >> >> > I'll keep working on it and let everyone know how things are going<br>
>> >> >> > with<br>
>> >> >> > it,<br>
>> >> >> > as I mentioned in another email, the source code is in our new<br>
>> >> >> > sourceforge<br>
>> >> >> > project's git repository.<br>
>> >> >> ><br>
>> >> >> > mike<br>
>> >> >> ><br>
>> >> >> ><br>
>> >> >> ><br>
>> >> >> ><br>
>> >> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <<a href="mailto:mike@mindmech.com">mike@mindmech.com</a>><br>
>> >> >> > wrote:<br>
>> >> >> >><br>
>> >> >> >> Sounds like you're making great progress! I'll be working on the<br>
>> >> >> >> graph clustering algorithm for the skill set tonight and will<br>
>> >> >> >> keep<br>
>> >> >> >> you posted on how things are going.<br>
>> >> >> >><br>
>> >> >> >> mike<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling<br>
>> >> >> >> <<a href="mailto:vonhessling@gmail.com">vonhessling@gmail.com</a>> wrote:<br>
>> >> >> >>><br>
>> >> >> >>> Doing a few basic tricks, I catapulted the submission into the<br>
>> >> >> >>> 50th<br>
>> >> >> >>> percentile. That is not even running any ML algorithm.<br>
>> >> >> >>><br>
>> >> >> >>> I'm planning on running the NaiveBayesUpdateable classifier<br>
>> >> >> >>> (<a href="http://weka.wikispaces.com/Classifying+large+datasets" target="_blank">http://weka.wikispaces.com/Classifying+large+datasets</a>) over<br>
>> >> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the<br>
>> >> >> >>> command<br>
>> >> >> >>> line to evaluate performance. Another attempt would be to load<br>
>> >> >> >>> all<br>
>> >> >> >>> data into memory (<3GB, even for full Bridge Train) and run<br>
>> >> >> >>> SVMlib<br>
>> >> >> >>> over it.<br>
>> >> >> >>><br>
>> >> >> >>> If someone wants to try MOA<br>
>> >> >> >>> (<a href="http://www.cs.waikato.ac.nz/%7Eabifet/MOA/index.html" target="_blank">http://www.cs.waikato.ac.nz/~abifet/MOA/index.html</a>), this would<br>
>> >> >> >>> be<br>
>> >> >> >>> helpful also in the long run (at least a tutorial how to set it<br>
>> >> >> >>> up<br>
>> >> >> >>> and<br>
>> >> >> >>> run).<br>
>> >> >> >>><br>
>> >> >> >>> The reduced datasets plus the IQ values are linked on the wiki:<br>
>> >> >> >>> Features<br>
>> >> >> >>> are:<br>
>> >> >> >>> ...> row INT,<br>
>> >> >> >>> ...> studentid VARCHAR(30),<br>
>> >> >> >>> ...> problemhierarchy TEXT,<br>
>> >> >> >>> ...> problemname TEXT,<br>
>> >> >> >>> ...> problemview INT,<br>
>> >> >> >>> ...> problemstepname TEXT,<br>
>> >> >> >>> ...> cfa INT,<br>
>> >> >> >>> ...> iq REAL<br>
>> >> >> >>><br>
>> >> >> >>> IQ strength (number of attempts per student) should be available<br>
>> >> >> >>> soon.<br>
>> >> >> >>> (perhaps add'l features will become available as well)<br>
>> >> >> >>><br>
>> >> >> >>> I'm still hoping somebody could cluster Erin's normalized skills<br>
>> >> >> >>> data<br>
>> >> >> >>> and provide a row -> cluster id mapping for algebra and bridge<br>
>> >> >> >>> train<br>
>> >> >> >>> and test sets (I don't have the data any more).<br>
>> >> >> >>><br>
>> >> >> >>> Andy<br>
>> >> >> >>> _______________________________________________<br>
>> >> >> >>> ml mailing list<br>
>> >> >> >>> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>
>> >> >> >>> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
>> >> >> >><br>
>> >> >> ><br>
>> >> >> ><br>
>> >> >> > _______________________________________________<br>
>> >> >> > ml mailing list<br>
>> >> >> > <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>
>> >> >> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
>> >> >> ><br>
>> >> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> _______________________________________________<br>
>> >> ml mailing list<br>
>> >> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>
>> >> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
>> ><br>
>> ><br>
>> > _______________________________________________<br>
>> > ml mailing list<br>
>> > <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>
>> > <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
>> ><br>
>> ><br>
><br>
><br>
> _______________________________________________<br>
> ml mailing list<br>
> <a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>
> <a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
><br>
><br>
_______________________________________________<br>
ml mailing list<br>
<a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a><br>
<a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank">https://www.noisebridge.net/mailman/listinfo/ml</a><br>
</div></div></blockquote></div><br>