[ml] KDD cup submission status

Sun Jun 6 23:42:51 UTC 2010

Thomas,

Have you finished joining the chance values into the steps?  If so,
where can I download this joined_tables.sql.gz file?
(the streams you provide are algebra only -- do you have bridge as
well?)  I would like to concatenate your merged results with the
number of skills feature I computed; will then provide this dataset.

FYI, I'm trying to run of of the incremental classifiers within weka:
I've started discretizing numeric values for Naive Bayes Updateable
classifier (http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html,
also see http://weka.wikispaces.com/Classifying+large+datasets) using
something like this:  (need a lot of memory!)

java -Xms2048m -Xmx4096m -cp weka.jar
weka.filters.unsupervised.attribute.Discretize
-unset-class-temporarily -F -B 10 -i inputfile -o outputfile

Similarly, one can then run the NB algorithm incrementally;  Haven't
done this yet but Thomas, this may be an alternative if MOA doesn't
work out.

Andy

On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <thomas.lotze at gmail.com> wrote:
> All,
>
> I've been trying to use MOA to generate a classifier...and while I seem to
> be able to do that, I'm having trouble getting it to actually output
> classifications for new examples, so thought I'd share my current status and
> see if anyone can help.
>
> You can download the stream test and train files from
> http://thomaslotze.com/kdd/streams.tgz
> You can also download the jarfiles needed for MOA at
> http://thomaslotze.com/kdd/jarfiles.tgz
>
> Unpack these all into the same directory.  Then, in that directory, using
> the following command, you can create a MOA classifier:
> java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "LearnModel
> -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"
>
> You can also summarize the test arff file using the following command:
> java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff
>
> But I cannot find a command for MOA which will input the amodel.moa model
> and generate predicted classes for atest.arff.  The closest I've come is the
> following, which runs amodel.moa on the atest.arff, and must be predicting
> classes and comparing, because it declares how many it got correct:
> java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
> "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c -1)"
>
> So if anyone can figure it out (I've been using
> http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf as a guide), I could
> certainly use some help with this step.
>
> Cheers,
> Thomas
>
> P.S. If you'd like to get the SQL loaded yourself, you can download
> joined_tables.sql.gz (which was created using get_output.sh).  I then used
> run_moa.sh to create the .arff files and try to run MOA.
>
> On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling <vonhessling at gmail.com>
> wrote:
>>
>> Mike,
>> We're working on getting the test dataset orthogonalized.  Stay tuned.
>> Andy
>>
>>
>> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <mike at mindmech.com> wrote:
>> > Hey Andy, the input to the classifier I'm trying to produce is
>> > the orthogonalized dataset - i.e. the list of 1000+ columns where
>> > each column has the value of the opportunity for that skill. The
>> > dataset was produced by Erin and is is broken into several parts,
>> > for the algebra dataset this looks like:
>> >
>> > algebra-output_partaa
>> > algebra-output_partab
>> > ..
>> > algebra-output_partah
>> >
>> >
>> > You're going to have to orthogonalize the test datasets, which
>> > I don't have a copy of. Erin - are you around? Maybe she can help
>> > you convert the test datasets?
>> >
>> >   mike
>> >
>> >
>> >
>> >
>> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling
>> > <vonhessling at gmail.com> wrote:
>> >>
>> >> Sweet, Mike.  Please note that we need the row -> clusterid mapping
>> >> for both training AND testing sets.  Otherwise it will not help the ML
>> >> algorithms.
>> >> If I understand correctly, your input are the orthogonalized skills.
>> >> So far, the girls only provided these orthogonalizations for the
>> >> training files.  I'm computing them for the test sets so you can use
>> >> them.  If I don't understand this assumption correctly, please let me
>> >> know so I can use my CPU's cycles for other tasks.
>> >>
>> >> Ideally you can provide these cluster mappings by about Sunday, which
>> >> is when I want to start running classifiers.  I will need some time to
>> >> actually run the ML algorithms.
>> >>
>> >> I have now IQ and IQ strength feature values for all datasets and am
>> >> hoping time permits to compute chance and chance strength values for
>> >> rows.
>> >> Computing # of skills required should not be difficult and I will add
>> >> this feature as well.  I plan on sharing my datasets as new versions
>> >> become available.
>> >>
>> >> Andy
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.com>
>> >> wrote:
>> >> > So it's taking about 9 hours to create a graph from a 4.4GB file, I'm
>> >> > going to work on improving the code to make it a bit faster, and also
>> >> > am investigating a MapReduce solution.
>> >> >
>> >> > Basically the clustering process can be broken down into two stages:
>> >> >
>> >> > 1) Construct the graph, apply the clustering algorithm to break graph
>> >> > into
>> >> > clusters
>> >> > 2) Apply the clustered graph to the data again to classify each skill
>> >> > set
>> >> >
>> >> > I'll keep working on it and let everyone know how things are going
>> >> > with
>> >> > it,
>> >> > as I mentioned in another email, the source code is in our new
>> >> > sourceforge
>> >> > project's git repository.
>> >> >
>> >> >  mike
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.com>
>> >> > wrote:
>> >> >>
>> >> >> Sounds like you're making great progress! I'll be working on the
>> >> >> graph clustering algorithm for the skill set tonight and will keep
>> >> >> you posted on how things are going.
>> >> >>
>> >> >>   mike
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
>> >> >> <vonhessling at gmail.com> wrote:
>> >> >>>
>> >> >>> Doing a few basic tricks, I catapulted the submission into the 50th
>> >> >>> percentile.  That is not even running any ML algorithm.
>> >> >>>
>> >> >>> I'm planning on running the NaiveBayesUpdateable classifier
>> >> >>> (http://weka.wikispaces.com/Classifying+large+datasets) over
>> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the command
>> >> >>> line to evaluate performance.  Another attempt would be to load all
>> >> >>> data into memory (<3GB, even for full Bridge Train) and run SVMlib
>> >> >>> over it.
>> >> >>>
>> >> >>> If someone wants to try MOA
>> >> >>> (http://www.cs.waikato.ac.nz/~abifet/MOA/index.html), this would be
>> >> >>> helpful also in the long run (at least a tutorial how to set it up
>> >> >>> and
>> >> >>> run).
>> >> >>>
>> >> >>> The reduced datasets plus the IQ values are linked on the wiki:
>> >> >>> Features
>> >> >>> are:
>> >> >>>   ...> row INT,
>> >> >>>   ...> studentid VARCHAR(30),
>> >> >>>   ...> problemhierarchy TEXT,
>> >> >>>   ...> problemname TEXT,
>> >> >>>   ...> problemview INT,
>> >> >>>   ...> problemstepname TEXT,
>> >> >>>   ...> cfa INT,
>> >> >>>   ...> iq REAL
>> >> >>>
>> >> >>> IQ strength (number of attempts per student) should be available
>> >> >>> soon.
>> >> >>>  (perhaps add'l features will become available as well)
>> >> >>>
>> >> >>> I'm still hoping somebody could cluster Erin's normalized skills
>> >> >>> data
>> >> >>> and provide a row -> cluster id mapping for algebra and bridge
>> >> >>> train
>> >> >>> and test sets (I don't have the data any more).
>> >> >>>
>> >> >>> Andy
>> >> >>> _______________________________________________
>> >> >>> ml mailing list
>> >> >>> ml at lists.noisebridge.net
>> >> >>> https://www.noisebridge.net/mailman/listinfo/ml
>> >> >>
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > ml mailing list
>> >> > ml at lists.noisebridge.net
>> >> > https://www.noisebridge.net/mailman/listinfo/ml
>> >> >
>> >> >
>> >
>> >
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>