[ml] KDD cup submission status

Mon Jun 7 22:59:02 UTC 2010

Oops, the previous dataset I announced was in .csv format and the
commas messed up the data.  I've relinked a new zip file in tab
separated format from the wiki for download.  Uploading now.
MD5 (4skillsAddedNoDiscretization.zip) = dd6da9163dff5a570a80ec9bc8eaaedd

On Mon, Jun 7, 2010 at 1:23 PM, Andreas von Hessling
<vonhessling at gmail.com> wrote:
> I've added the latest datasets to the wiki (uploading for about
> another half an hour).  It contains step success chance and # of
> skills values.  Numeric values are not discretized.  (I have started
> discretizing them for the Naive Bayes algorithm though)
>
> MD5 (4skillsAddedNoDiscretization.zip) = bb70e584f729b0b0c1edba14eff45b73
>
> If we can do so in time, we will add the clustered skills feature as
> well, but that's it.  Let the algorithms run free!
>
> BTW, the evaluation website seems to be slowing down under the
> increased load just before the deadline.  Something to consider.
>
> Andy
>
>
>
> On Sun, Jun 6, 2010 at 5:41 PM, Thomas Lotze <thomas.lotze at gmail.com> wrote:
>> I love open source software!
>>
>> The final predicted output (using iq and score as predictors, under a Naive
>> Bayes model) for algebra and bridge (suitable, I believe, for submission) is
>> available in http://thomaslotze.com/kdd/output.tgz
>>
>> The streams.tgz and jarfiles.tgz have been updated with streams for bridge
>> and my newly-compiled "moa_personal.jar" jarfile.
>>
>> run_moa.sh should have all the steps needed to duplicate this in MOA
>> yourself (after creating or importing the SQL tables) -- I've also put up
>> MOA instructions on the wiki at
>> https://www.noisebridge.net/wiki/Machine_Learning/moa
>>
>> Summary: since the moa code was available on sourceforge, I was able to
>> create a new ClassificationPerformanceEvaluator (called
>> BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA
>> distribution into moa_personal.jar.  But this allows us to use this
>> evaluator to print out row number and predicted probability of cfa.  The
>> evaluator is currently pretty hard-coded for the KDD dataset right now, but
>> I think I can modify it to a more general task/evaluator for use in the
>> future (and potentially for inclusion back into the MOA trunk).  In any
>> case, it should work for now.
>>
>> Hooray for open source machine learning!
>>
>> -Thomas
>>
>> On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <vonhessling at gmail.com>
>> wrote:
>>>
>>> Thomas,
>>>
>>> Have you finished joining the chance values into the steps?  If so,
>>> where can I download this joined_tables.sql.gz file?
>>> (the streams you provide are algebra only -- do you have bridge as
>>> well?)  I would like to concatenate your merged results with the
>>> number of skills feature I computed; will then provide this dataset.
>>>
>>>
>>> FYI, I'm trying to run of of the incremental classifiers within weka:
>>> I've started discretizing numeric values for Naive Bayes Updateable
>>> classifier
>>> (http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html,
>>> also see http://weka.wikispaces.com/Classifying+large+datasets) using
>>> something like this:  (need a lot of memory!)
>>>
>>> java -Xms2048m -Xmx4096m -cp weka.jar
>>> weka.filters.unsupervised.attribute.Discretize
>>> -unset-class-temporarily -F -B 10 -i inputfile -o outputfile
>>>
>>> Similarly, one can then run the NB algorithm incrementally;  Haven't
>>> done this yet but Thomas, this may be an alternative if MOA doesn't
>>> work out.
>>>
>>> Andy
>>>
>>>
>>>
>>> On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <thomas.lotze at gmail.com>
>>> wrote:
>>> > All,
>>> >
>>> > I've been trying to use MOA to generate a classifier...and while I seem
>>> > to
>>> > be able to do that, I'm having trouble getting it to actually output
>>> > classifications for new examples, so thought I'd share my current status
>>> > and
>>> > see if anyone can help.
>>> >
>>> > You can download the stream test and train files from
>>> > http://thomaslotze.com/kdd/streams.tgz
>>> > You can also download the jarfiles needed for MOA at
>>> > http://thomaslotze.com/kdd/jarfiles.tgz
>>> >
>>> > Unpack these all into the same directory.  Then, in that directory,
>>> > using
>>> > the following command, you can create a MOA classifier:
>>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
>>> > "LearnModel
>>> > -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"
>>> >
>>> > You can also summarize the test arff file using the following command:
>>> > java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff
>>> >
>>> > But I cannot find a command for MOA which will input the amodel.moa
>>> > model
>>> > and generate predicted classes for atest.arff.  The closest I've come is
>>> > the
>>> > following, which runs amodel.moa on the atest.arff, and must be
>>> > predicting
>>> > classes and comparing, because it declares how many it got correct:
>>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
>>> > "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c
>>> > -1)"
>>> >
>>> > So if anyone can figure it out (I've been using
>>> > http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf as a guide), I could
>>> > certainly use some help with this step.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > P.S. If you'd like to get the SQL loaded yourself, you can download
>>> > joined_tables.sql.gz (which was created using get_output.sh).  I then
>>> > used
>>> > run_moa.sh to create the .arff files and try to run MOA.
>>> >
>>> > On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling
>>> > <vonhessling at gmail.com>
>>> > wrote:
>>> >>
>>> >> Mike,
>>> >> We're working on getting the test dataset orthogonalized.  Stay tuned.
>>> >> Andy
>>> >>
>>> >>
>>> >> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <mike at mindmech.com>
>>> >> wrote:
>>> >> > Hey Andy, the input to the classifier I'm trying to produce is
>>> >> > the orthogonalized dataset - i.e. the list of 1000+ columns where
>>> >> > each column has the value of the opportunity for that skill. The
>>> >> > dataset was produced by Erin and is is broken into several parts,
>>> >> > for the algebra dataset this looks like:
>>> >> >
>>> >> > algebra-output_partaa
>>> >> > algebra-output_partab
>>> >> > ..
>>> >> > algebra-output_partah
>>> >> >
>>> >> >
>>> >> > You're going to have to orthogonalize the test datasets, which
>>> >> > I don't have a copy of. Erin - are you around? Maybe she can help
>>> >> > you convert the test datasets?
>>> >> >
>>> >> >   mike
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling
>>> >> > <vonhessling at gmail.com> wrote:
>>> >> >>
>>> >> >> Sweet, Mike.  Please note that we need the row -> clusterid mapping
>>> >> >> for both training AND testing sets.  Otherwise it will not help the
>>> >> >> ML
>>> >> >> algorithms.
>>> >> >> If I understand correctly, your input are the orthogonalized skills.
>>> >> >> So far, the girls only provided these orthogonalizations for the
>>> >> >> training files.  I'm computing them for the test sets so you can use
>>> >> >> them.  If I don't understand this assumption correctly, please let
>>> >> >> me
>>> >> >> know so I can use my CPU's cycles for other tasks.
>>> >> >>
>>> >> >> Ideally you can provide these cluster mappings by about Sunday,
>>> >> >> which
>>> >> >> is when I want to start running classifiers.  I will need some time
>>> >> >> to
>>> >> >> actually run the ML algorithms.
>>> >> >>
>>> >> >> I have now IQ and IQ strength feature values for all datasets and am
>>> >> >> hoping time permits to compute chance and chance strength values for
>>> >> >> rows.
>>> >> >> Computing # of skills required should not be difficult and I will
>>> >> >> add
>>> >> >> this feature as well.  I plan on sharing my datasets as new versions
>>> >> >> become available.
>>> >> >>
>>> >> >> Andy
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.com>
>>> >> >> wrote:
>>> >> >> > So it's taking about 9 hours to create a graph from a 4.4GB file,
>>> >> >> > I'm
>>> >> >> > going to work on improving the code to make it a bit faster, and
>>> >> >> > also
>>> >> >> > am investigating a MapReduce solution.
>>> >> >> >
>>> >> >> > Basically the clustering process can be broken down into two
>>> >> >> > stages:
>>> >> >> >
>>> >> >> > 1) Construct the graph, apply the clustering algorithm to break
>>> >> >> > graph
>>> >> >> > into
>>> >> >> > clusters
>>> >> >> > 2) Apply the clustered graph to the data again to classify each
>>> >> >> > skill
>>> >> >> > set
>>> >> >> >
>>> >> >> > I'll keep working on it and let everyone know how things are going
>>> >> >> > with
>>> >> >> > it,
>>> >> >> > as I mentioned in another email, the source code is in our new
>>> >> >> > sourceforge
>>> >> >> > project's git repository.
>>> >> >> >
>>> >> >> >  mike
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> Sounds like you're making great progress! I'll be working on the
>>> >> >> >> graph clustering algorithm for the skill set tonight and will
>>> >> >> >> keep
>>> >> >> >> you posted on how things are going.
>>> >> >> >>
>>> >> >> >>   mike
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
>>> >> >> >> <vonhessling at gmail.com> wrote:
>>> >> >> >>>
>>> >> >> >>> Doing a few basic tricks, I catapulted the submission into the
>>> >> >> >>> 50th
>>> >> >> >>> percentile.  That is not even running any ML algorithm.
>>> >> >> >>>
>>> >> >> >>> I'm planning on running the NaiveBayesUpdateable classifier
>>> >> >> >>> (http://weka.wikispaces.com/Classifying+large+datasets) over
>>> >> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the
>>> >> >> >>> command
>>> >> >> >>> line to evaluate performance.  Another attempt would be to load
>>> >> >> >>> all
>>> >> >> >>> data into memory (<3GB, even for full Bridge Train) and run
>>> >> >> >>> SVMlib
>>> >> >> >>> over it.
>>> >> >> >>>
>>> >> >> >>> If someone wants to try MOA
>>> >> >> >>> (http://www.cs.waikato.ac.nz/~abifet/MOA/index.html), this would
>>> >> >> >>> be
>>> >> >> >>> helpful also in the long run (at least a tutorial how to set it
>>> >> >> >>> up
>>> >> >> >>> and
>>> >> >> >>> run).
>>> >> >> >>>
>>> >> >> >>> The reduced datasets plus the IQ values are linked on the wiki:
>>> >> >> >>> Features
>>> >> >> >>> are:
>>> >> >> >>>   ...> row INT,
>>> >> >> >>>   ...> studentid VARCHAR(30),
>>> >> >> >>>   ...> problemhierarchy TEXT,
>>> >> >> >>>   ...> problemname TEXT,
>>> >> >> >>>   ...> problemview INT,
>>> >> >> >>>   ...> problemstepname TEXT,
>>> >> >> >>>   ...> cfa INT,
>>> >> >> >>>   ...> iq REAL
>>> >> >> >>>
>>> >> >> >>> IQ strength (number of attempts per student) should be available
>>> >> >> >>> soon.
>>> >> >> >>>  (perhaps add'l features will become available as well)
>>> >> >> >>>
>>> >> >> >>> I'm still hoping somebody could cluster Erin's normalized skills
>>> >> >> >>> data
>>> >> >> >>> and provide a row -> cluster id mapping for algebra and bridge
>>> >> >> >>> train
>>> >> >> >>> and test sets (I don't have the data any more).
>>> >> >> >>>
>>> >> >> >>> Andy
>>> >> >> >>> _______________________________________________
>>> >> >> >>> ml mailing list
>>> >> >> >>> ml at lists.noisebridge.net
>>> >> >> >>> https://www.noisebridge.net/mailman/listinfo/ml
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > ml mailing list
>>> >> >> > ml at lists.noisebridge.net
>>> >> >> > https://www.noisebridge.net/mailman/listinfo/ml
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >> _______________________________________________
>>> >> ml mailing list
>>> >> ml at lists.noisebridge.net
>>> >> https://www.noisebridge.net/mailman/listinfo/ml
>>> >
>>> >
>>> > _______________________________________________
>>> > ml mailing list
>>> > ml at lists.noisebridge.net
>>> > https://www.noisebridge.net/mailman/listinfo/ml
>>> >
>>> >
>>
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>>
>>
>