[ml] KDD cup submission status

Wed Jun 9 00:48:02 UTC 2010

All,

The submission is currently ranked at 902 out of ~3200.  There are
about 30-40 teams ahead of us.  I wrote a Java program that uses
Weka's incremental classifier over the entire data sets.
Please note that the deadline is 11.59pm EST, which translates to 9pm PST.

Andy

On Tue, Jun 8, 2010 at 10:21 AM, Andreas von Hessling
<vonhessling at gmail.com> wrote:
> Hi Thomas,
>
> how's it going, what's your status?  Are you still working on this?
> Have you attempted to submit your results on your own? What's your
> score/experience?
>
> On my side, I'm finally finishing discretization of all numeric
> features and will be pushing the data through the incremental NB
> classifier.  Initial attempts have resulted only in mediocre
> performance.  The skills may be the key to good scores. This is also
> suggested by the "fact sheet" questionnaire that they have put up
> (pasted below) that asks revealing questions.
>
> Here's the features, * have been discretized
>   row INT,
>   studentid VARCHAR(30),
>   problemhierarchy TEXT,
>   problemname TEXT, (this has many thousand nominal values; may
> ignore this for the ML algorithms)
> *   problemview INT,
>   problemstepname TEXT,
>   cfa INT,
> *   iq REAL,
> *   iqstrength REAL,
> *   chance REAL,
> *   chancestrength REAL,
> *   numsub REAL, (number of subskills required for this step)
> *   numtraced REAL (number of traced skills)
>
> I can provide this dataset.
>
>
> Depending on the datasetsize, I may try to push it through libsvm.
> Also, I'll try MOA; so I may have a few questions on running both of
> them later.
>
>
> Andy
>
> ========
>
> Title of the contribution*
>
> Provide a title for your team's contribution that will appear in the results.
> Supplementary online material
>
> Provide a URL to a web page, technical memorandum, or a paper.
> Background*
>
> Provide a general summary with relevant background information: Where
> does the method come from? Is it novel? Name the prior art.
> Used Weka to extensively preprocess data; bash scripts; attempted
> Weka's incremental classifiers (e.g. Naive Bayes Updateable) to
> provide predictions with the large amounts of data. No new ML
> algorithms.
> Method
>
> Summarize the algorithms you used in a way that those skilled in the
> art should understand what to do. Profile of your methods as follows:
> Data exploration and understanding
>
> Did you use data exploration techniques to
>
> Identify selection biases
> Identify temporal effects (e.g. students getting better over time)
> Understand the variables
> Explore the usefulness of the KC models
> Understand the relationships between the different KC types
>
> Please describe your data understanding efforts, and interesting observations:
> Student IQ = % correct for each student: very valuable variable,
> lifted us into the 50th percentile of submissions. Many features are
> not available in test set, so they have been removed. It seems
> analysis of KC models (not performed) is necessary to get into the top
> scorers.
> Preprocessing
>
> Feature generation
>
> Features designed to capture the step type (e.g. enter given, or ... )
> Features based on the textual step name
> Features designed to capture the KC type
> Features based on the textual KC name
> Features derived from opportunity counts
> Features derived from the problem name
> Features based on student ID
> Other features
>
> Details on feature generation:
> Student IQ = % correct by student Step chance = % correct attempts
> IQ/chance strength = total counts of attempts. % of features required
> in each step.
> Feature selection
>
> Feature ranking with correlation or other criterion (specify below)
> Filter method (other than feature ranking)
> Wrapper with forward or backward selection (nested subset method)
> Wrapper with intensive search (subsets not nested)
> Embedded method
> Other method not listed above (specify below)
>
> Details on feature selection:
> Did you attempt to identify latent factors?
>
> Cluster students
> Cluster knowledge components
> Cluster steps
> Latent feature discovery was performed jointly with learning
>
> Details on latent factor discovery (techniques used, useful
> student/step features, how were the factors used, etc.):
> Other preprocessing
>
> Filling missing values (for KC)
> Principal component analysis
>
> More details on preprocessing:
> Classification
>
> Base classifier
>
> Decision tree, stub, or Random Forest
> Linear classifier (Fisher's discriminant, SVM, linear regression)
> Non-linear kernel method (SVM, kernel ridge regression, kernel
> logistic regression)
> Naïve
> Bayesian Network (other than Naïve Bayes)
> Neural Network
> Bayesian Neural Network
> Nearest neighbors
> Latent variable models (e.g. matrix factorization)
> Neighborhood/correlation based collaborative filtering
> Bayesian Knowledge Tracing
> Additive Factor Model
> Item Response Theory
> Other classifier not listed above (specify below)
> Loss Function
>
> Hinge loss (like in SVM)
> Square loss (like in ridge regression)
> Logistic loss or cross-entropy (like in logistic regression)
> Exponential loss (like in boosting)
> None
> Don't know
> Other loss (specify below)
> Regularizer
>
> One-norm (sum of weight magnitudes, like in Lasso)
> Two-norm (||w||^2, like in ridge regression and regular SVM)
> Structured regularizer (like in group lasso)
> None
> Don't know
> Other (specify below)
> Ensemble Method
>
> Boosting
> Bagging (check this if you use Random Forest)
> Other ensemble method
> None
> Were you able to use information present only in the training set?
>
> Corrects, incorrects, hints
> Step start/end times
> Did you use post-training calibration to obtain accurate probabilities?
>
> Yes
> No
> Did you make use of the development data sets for training?
>
> Yes
> No
>
> Details on classification:
> Model selection/hyperparameter selection
>
> We used the online feedback of the leaderboard.
> K-fold or leave-one-out cross-validation (using training data)
> Virtual leave-one-out (closed for estimations of LOO with a single
> classifier training)
> Out-of-bag estimation (for bagging methods)
> Bootstrap estimation (other than out-of-bag)
> Other cross-validation method
> Bayesian model selection
> Penalty-based method (non-Bayesian)
> Bi-level optimization
> Other method not listed above (specify below)
>
> Details on model selection:
> Results
>
> A reader should also know from reading the fact sheet what the
> strength of the method is.
>
> Please comment about the following:
> Quantitative advantages (e.g., compact feature subset, simplicity,
> computational advantages).
>
> Qualitative advantages (e.g. compute posterior probabilities,
> theoretically motivated, has some elements of novelty).
>
> Other methods. List other methods you tried.
>
> How helpful did you find the included KC models?
>
> Crucial in getting good predictions
> Somewhat helpful in getting good predictions
> Neutral
> Not particularly helpful
> Irrelevant
> If you learned latent factors, how helpful were they?
>
> Crucial in getting good predictions
> Somewhat helpful in getting good predictions
> Neutral
> Not particularly helpful
> Irrelevant
>
> Details on the relevance of the KC models and latent factors:
> Software Implementation
>
> Availability
>
> Proprietary in-house software
> Commercially available in-house software
> Freeware or shareware in-house software
> Off-the-shelf third party commercial software
> Off-the-shelf third party freeware or shareware
> Language
>
> C/C++
> Java
> Matlab
> Python/NumPy/SciPy
> Other (specify below)
>
> Details on software implementation:
> Hardware implementation
>
> Platform
>
> Windows
> Linux or other Unix
> Mac OS
> Other (specify below)
> Memory
>
> <= 2 GB
> <= 8 GB
>>= 8 GB
>>= 32 GB
> Parallelism
>
> Multi-processor machine
> Run in parallel different algorithms on different machines
> Other (specify below)
>
> Details on hardware implementation. Specify whether you provide a self
> contained-application or libraries.
> Code URL
>
> Provide a URL for the code (if available):
> Competition Setup
>
> From a performance point of view, the training set was
>
> Too big (could have achieved the same performance with significantly less data)
> Too small (more data would have led to better performance)
> From a computational point of view, the training set was
>
> Too big (imposed serious computational challenges, limited the types
> of methods that can be applied)
> Adequate (the computational load was easy to handle)
> Was the time constraint imposed by the challenge a difficulty or did
> you feel enough time to understand the data, prepare it, and train
> models?
>
> Not enough time
> Enough time
> It was enough time to do something decent, but there was a lot left to
> explore. With more time performance could have been significantly
> improved.
> How likely are you to keep working on this problem?
>
> It is my main research area.
> It was a very interesting problem. I'll keep working on it.
> This data is a good fit for the data mining methods I am
> using/developing. I will use it in the future for empirical
> evaluation.
> Maybe I'll try some ideas , but it is not high priority.
> Not likely to keep working on it.
> Comments on the problem (What aspects of the problem you found most
> interesting? Did it inspire you to develop new techniques?)
>
> References
>
> List references below.
>
>
>
>
>
> On Mon, Jun 7, 2010 at 3:59 PM, Andreas von Hessling
> <vonhessling at gmail.com> wrote:
>> Oops, the previous dataset I announced was in .csv format and the
>> commas messed up the data.  I've relinked a new zip file in tab
>> separated format from the wiki for download.  Uploading now.
>> MD5 (4skillsAddedNoDiscretization.zip) = dd6da9163dff5a570a80ec9bc8eaaedd
>>
>>
>>
>> On Mon, Jun 7, 2010 at 1:23 PM, Andreas von Hessling
>> <vonhessling at gmail.com> wrote:
>>> I've added the latest datasets to the wiki (uploading for about
>>> another half an hour).  It contains step success chance and # of
>>> skills values.  Numeric values are not discretized.  (I have started
>>> discretizing them for the Naive Bayes algorithm though)
>>>
>>> MD5 (4skillsAddedNoDiscretization.zip) = bb70e584f729b0b0c1edba14eff45b73
>>>
>>> If we can do so in time, we will add the clustered skills feature as
>>> well, but that's it.  Let the algorithms run free!
>>>
>>> BTW, the evaluation website seems to be slowing down under the
>>> increased load just before the deadline.  Something to consider.
>>>
>>> Andy
>>>
>>>
>>>
>>> On Sun, Jun 6, 2010 at 5:41 PM, Thomas Lotze <thomas.lotze at gmail.com> wrote:
>>>> I love open source software!
>>>>
>>>> The final predicted output (using iq and score as predictors, under a Naive
>>>> Bayes model) for algebra and bridge (suitable, I believe, for submission) is
>>>> available in http://thomaslotze.com/kdd/output.tgz
>>>>
>>>> The streams.tgz and jarfiles.tgz have been updated with streams for bridge
>>>> and my newly-compiled "moa_personal.jar" jarfile.
>>>>
>>>> run_moa.sh should have all the steps needed to duplicate this in MOA
>>>> yourself (after creating or importing the SQL tables) -- I've also put up
>>>> MOA instructions on the wiki at
>>>> https://www.noisebridge.net/wiki/Machine_Learning/moa
>>>>
>>>> Summary: since the moa code was available on sourceforge, I was able to
>>>> create a new ClassificationPerformanceEvaluator (called
>>>> BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA
>>>> distribution into moa_personal.jar.  But this allows us to use this
>>>> evaluator to print out row number and predicted probability of cfa.  The
>>>> evaluator is currently pretty hard-coded for the KDD dataset right now, but
>>>> I think I can modify it to a more general task/evaluator for use in the
>>>> future (and potentially for inclusion back into the MOA trunk).  In any
>>>> case, it should work for now.
>>>>
>>>> Hooray for open source machine learning!
>>>>
>>>> -Thomas
>>>>
>>>> On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <vonhessling at gmail.com>
>>>> wrote:
>>>>>
>>>>> Thomas,
>>>>>
>>>>> Have you finished joining the chance values into the steps?  If so,
>>>>> where can I download this joined_tables.sql.gz file?
>>>>> (the streams you provide are algebra only -- do you have bridge as
>>>>> well?)  I would like to concatenate your merged results with the
>>>>> number of skills feature I computed; will then provide this dataset.
>>>>>
>>>>>
>>>>> FYI, I'm trying to run of of the incremental classifiers within weka:
>>>>> I've started discretizing numeric values for Naive Bayes Updateable
>>>>> classifier
>>>>> (http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html,
>>>>> also see http://weka.wikispaces.com/Classifying+large+datasets) using
>>>>> something like this:  (need a lot of memory!)
>>>>>
>>>>> java -Xms2048m -Xmx4096m -cp weka.jar
>>>>> weka.filters.unsupervised.attribute.Discretize
>>>>> -unset-class-temporarily -F -B 10 -i inputfile -o outputfile
>>>>>
>>>>> Similarly, one can then run the NB algorithm incrementally;  Haven't
>>>>> done this yet but Thomas, this may be an alternative if MOA doesn't
>>>>> work out.
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <thomas.lotze at gmail.com>
>>>>> wrote:
>>>>> > All,
>>>>> >
>>>>> > I've been trying to use MOA to generate a classifier...and while I seem
>>>>> > to
>>>>> > be able to do that, I'm having trouble getting it to actually output
>>>>> > classifications for new examples, so thought I'd share my current status
>>>>> > and
>>>>> > see if anyone can help.
>>>>> >
>>>>> > You can download the stream test and train files from
>>>>> > http://thomaslotze.com/kdd/streams.tgz
>>>>> > You can also download the jarfiles needed for MOA at
>>>>> > http://thomaslotze.com/kdd/jarfiles.tgz
>>>>> >
>>>>> > Unpack these all into the same directory.  Then, in that directory,
>>>>> > using
>>>>> > the following command, you can create a MOA classifier:
>>>>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
>>>>> > "LearnModel
>>>>> > -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"
>>>>> >
>>>>> > You can also summarize the test arff file using the following command:
>>>>> > java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff
>>>>> >
>>>>> > But I cannot find a command for MOA which will input the amodel.moa
>>>>> > model
>>>>> > and generate predicted classes for atest.arff.  The closest I've come is
>>>>> > the
>>>>> > following, which runs amodel.moa on the atest.arff, and must be
>>>>> > predicting
>>>>> > classes and comparing, because it declares how many it got correct:
>>>>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
>>>>> > "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c
>>>>> > -1)"
>>>>> >
>>>>> > So if anyone can figure it out (I've been using
>>>>> > http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf as a guide), I could
>>>>> > certainly use some help with this step.
>>>>> >
>>>>> > Cheers,
>>>>> > Thomas
>>>>> >
>>>>> > P.S. If you'd like to get the SQL loaded yourself, you can download
>>>>> > joined_tables.sql.gz (which was created using get_output.sh).  I then
>>>>> > used
>>>>> > run_moa.sh to create the .arff files and try to run MOA.
>>>>> >
>>>>> > On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling
>>>>> > <vonhessling at gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Mike,
>>>>> >> We're working on getting the test dataset orthogonalized.  Stay tuned.
>>>>> >> Andy
>>>>> >>
>>>>> >>
>>>>> >> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <mike at mindmech.com>
>>>>> >> wrote:
>>>>> >> > Hey Andy, the input to the classifier I'm trying to produce is
>>>>> >> > the orthogonalized dataset - i.e. the list of 1000+ columns where
>>>>> >> > each column has the value of the opportunity for that skill. The
>>>>> >> > dataset was produced by Erin and is is broken into several parts,
>>>>> >> > for the algebra dataset this looks like:
>>>>> >> >
>>>>> >> > algebra-output_partaa
>>>>> >> > algebra-output_partab
>>>>> >> > ..
>>>>> >> > algebra-output_partah
>>>>> >> >
>>>>> >> >
>>>>> >> > You're going to have to orthogonalize the test datasets, which
>>>>> >> > I don't have a copy of. Erin - are you around? Maybe she can help
>>>>> >> > you convert the test datasets?
>>>>> >> >
>>>>> >> >   mike
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling
>>>>> >> > <vonhessling at gmail.com> wrote:
>>>>> >> >>
>>>>> >> >> Sweet, Mike.  Please note that we need the row -> clusterid mapping
>>>>> >> >> for both training AND testing sets.  Otherwise it will not help the
>>>>> >> >> ML
>>>>> >> >> algorithms.
>>>>> >> >> If I understand correctly, your input are the orthogonalized skills.
>>>>> >> >> So far, the girls only provided these orthogonalizations for the
>>>>> >> >> training files.  I'm computing them for the test sets so you can use
>>>>> >> >> them.  If I don't understand this assumption correctly, please let
>>>>> >> >> me
>>>>> >> >> know so I can use my CPU's cycles for other tasks.
>>>>> >> >>
>>>>> >> >> Ideally you can provide these cluster mappings by about Sunday,
>>>>> >> >> which
>>>>> >> >> is when I want to start running classifiers.  I will need some time
>>>>> >> >> to
>>>>> >> >> actually run the ML algorithms.
>>>>> >> >>
>>>>> >> >> I have now IQ and IQ strength feature values for all datasets and am
>>>>> >> >> hoping time permits to compute chance and chance strength values for
>>>>> >> >> rows.
>>>>> >> >> Computing # of skills required should not be difficult and I will
>>>>> >> >> add
>>>>> >> >> this feature as well.  I plan on sharing my datasets as new versions
>>>>> >> >> become available.
>>>>> >> >>
>>>>> >> >> Andy
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.com>
>>>>> >> >> wrote:
>>>>> >> >> > So it's taking about 9 hours to create a graph from a 4.4GB file,
>>>>> >> >> > I'm
>>>>> >> >> > going to work on improving the code to make it a bit faster, and
>>>>> >> >> > also
>>>>> >> >> > am investigating a MapReduce solution.
>>>>> >> >> >
>>>>> >> >> > Basically the clustering process can be broken down into two
>>>>> >> >> > stages:
>>>>> >> >> >
>>>>> >> >> > 1) Construct the graph, apply the clustering algorithm to break
>>>>> >> >> > graph
>>>>> >> >> > into
>>>>> >> >> > clusters
>>>>> >> >> > 2) Apply the clustered graph to the data again to classify each
>>>>> >> >> > skill
>>>>> >> >> > set
>>>>> >> >> >
>>>>> >> >> > I'll keep working on it and let everyone know how things are going
>>>>> >> >> > with
>>>>> >> >> > it,
>>>>> >> >> > as I mentioned in another email, the source code is in our new
>>>>> >> >> > sourceforge
>>>>> >> >> > project's git repository.
>>>>> >> >> >
>>>>> >> >> >  mike
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.com>
>>>>> >> >> > wrote:
>>>>> >> >> >>
>>>>> >> >> >> Sounds like you're making great progress! I'll be working on the
>>>>> >> >> >> graph clustering algorithm for the skill set tonight and will
>>>>> >> >> >> keep
>>>>> >> >> >> you posted on how things are going.
>>>>> >> >> >>
>>>>> >> >> >>   mike
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
>>>>> >> >> >> <vonhessling at gmail.com> wrote:
>>>>> >> >> >>>
>>>>> >> >> >>> Doing a few basic tricks, I catapulted the submission into the
>>>>> >> >> >>> 50th
>>>>> >> >> >>> percentile.  That is not even running any ML algorithm.
>>>>> >> >> >>>
>>>>> >> >> >>> I'm planning on running the NaiveBayesUpdateable classifier
>>>>> >> >> >>> (http://weka.wikispaces.com/Classifying+large+datasets) over
>>>>> >> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the
>>>>> >> >> >>> command
>>>>> >> >> >>> line to evaluate performance.  Another attempt would be to load
>>>>> >> >> >>> all
>>>>> >> >> >>> data into memory (<3GB, even for full Bridge Train) and run
>>>>> >> >> >>> SVMlib
>>>>> >> >> >>> over it.
>>>>> >> >> >>>
>>>>> >> >> >>> If someone wants to try MOA
>>>>> >> >> >>> (http://www.cs.waikato.ac.nz/~abifet/MOA/index.html), this would
>>>>> >> >> >>> be
>>>>> >> >> >>> helpful also in the long run (at least a tutorial how to set it
>>>>> >> >> >>> up
>>>>> >> >> >>> and
>>>>> >> >> >>> run).
>>>>> >> >> >>>
>>>>> >> >> >>> The reduced datasets plus the IQ values are linked on the wiki:
>>>>> >> >> >>> Features
>>>>> >> >> >>> are:
>>>>> >> >> >>>   ...> row INT,
>>>>> >> >> >>>   ...> studentid VARCHAR(30),
>>>>> >> >> >>>   ...> problemhierarchy TEXT,
>>>>> >> >> >>>   ...> problemname TEXT,
>>>>> >> >> >>>   ...> problemview INT,
>>>>> >> >> >>>   ...> problemstepname TEXT,
>>>>> >> >> >>>   ...> cfa INT,
>>>>> >> >> >>>   ...> iq REAL
>>>>> >> >> >>>
>>>>> >> >> >>> IQ strength (number of attempts per student) should be available
>>>>> >> >> >>> soon.
>>>>> >> >> >>>  (perhaps add'l features will become available as well)
>>>>> >> >> >>>
>>>>> >> >> >>> I'm still hoping somebody could cluster Erin's normalized skills
>>>>> >> >> >>> data
>>>>> >> >> >>> and provide a row -> cluster id mapping for algebra and bridge
>>>>> >> >> >>> train
>>>>> >> >> >>> and test sets (I don't have the data any more).
>>>>> >> >> >>>
>>>>> >> >> >>> Andy
>>>>> >> >> >>> _______________________________________________
>>>>> >> >> >>> ml mailing list
>>>>> >> >> >>> ml at lists.noisebridge.net
>>>>> >> >> >>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>> >> >> >>
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > _______________________________________________
>>>>> >> >> > ml mailing list
>>>>> >> >> > ml at lists.noisebridge.net
>>>>> >> >> > https://www.noisebridge.net/mailman/listinfo/ml
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> _______________________________________________
>>>>> >> ml mailing list
>>>>> >> ml at lists.noisebridge.net
>>>>> >> https://www.noisebridge.net/mailman/listinfo/ml
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > ml mailing list
>>>>> > ml at lists.noisebridge.net
>>>>> > https://www.noisebridge.net/mailman/listinfo/ml
>>>>> >
>>>>> >
>>>>
>>>>
>>>> _______________________________________________
>>>> ml mailing list
>>>> ml at lists.noisebridge.net
>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>
>>>>
>>>
>>
>