[ml] KDD cup submission status

Sat Jun 5 21:05:44 UTC 2010

Mike,
We're working on getting the test dataset orthogonalized.  Stay tuned.
Andy

On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <mike at mindmech.com> wrote:
> Hey Andy, the input to the classifier I'm trying to produce is
> the orthogonalized dataset - i.e. the list of 1000+ columns where
> each column has the value of the opportunity for that skill. The
> dataset was produced by Erin and is is broken into several parts,
> for the algebra dataset this looks like:
>
> algebra-output_partaa
> algebra-output_partab
> ..
> algebra-output_partah
>
>
> You're going to have to orthogonalize the test datasets, which
> I don't have a copy of. Erin - are you around? Maybe she can help
> you convert the test datasets?
>
>   mike
>
>
>
>
> On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling
> <vonhessling at gmail.com> wrote:
>>
>> Sweet, Mike.  Please note that we need the row -> clusterid mapping
>> for both training AND testing sets.  Otherwise it will not help the ML
>> algorithms.
>> If I understand correctly, your input are the orthogonalized skills.
>> So far, the girls only provided these orthogonalizations for the
>> training files.  I'm computing them for the test sets so you can use
>> them.  If I don't understand this assumption correctly, please let me
>> know so I can use my CPU's cycles for other tasks.
>>
>> Ideally you can provide these cluster mappings by about Sunday, which
>> is when I want to start running classifiers.  I will need some time to
>> actually run the ML algorithms.
>>
>> I have now IQ and IQ strength feature values for all datasets and am
>> hoping time permits to compute chance and chance strength values for
>> rows.
>> Computing # of skills required should not be difficult and I will add
>> this feature as well.  I plan on sharing my datasets as new versions
>> become available.
>>
>> Andy
>>
>>
>>
>>
>> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.com> wrote:
>> > So it's taking about 9 hours to create a graph from a 4.4GB file, I'm
>> > going to work on improving the code to make it a bit faster, and also
>> > am investigating a MapReduce solution.
>> >
>> > Basically the clustering process can be broken down into two stages:
>> >
>> > 1) Construct the graph, apply the clustering algorithm to break graph
>> > into
>> > clusters
>> > 2) Apply the clustered graph to the data again to classify each skill
>> > set
>> >
>> > I'll keep working on it and let everyone know how things are going with
>> > it,
>> > as I mentioned in another email, the source code is in our new
>> > sourceforge
>> > project's git repository.
>> >
>> >  mike
>> >
>> >
>> >
>> >
>> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.com>
>> > wrote:
>> >>
>> >> Sounds like you're making great progress! I'll be working on the
>> >> graph clustering algorithm for the skill set tonight and will keep
>> >> you posted on how things are going.
>> >>
>> >>   mike
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
>> >> <vonhessling at gmail.com> wrote:
>> >>>
>> >>> Doing a few basic tricks, I catapulted the submission into the 50th
>> >>> percentile.  That is not even running any ML algorithm.
>> >>>
>> >>> I'm planning on running the NaiveBayesUpdateable classifier
>> >>> (http://weka.wikispaces.com/Classifying+large+datasets) over
>> >>> discretized IQ/IQ strength/Chance/Chance strength from the command
>> >>> line to evaluate performance.  Another attempt would be to load all
>> >>> data into memory (<3GB, even for full Bridge Train) and run SVMlib
>> >>> over it.
>> >>>
>> >>> If someone wants to try MOA
>> >>> (http://www.cs.waikato.ac.nz/~abifet/MOA/index.html), this would be
>> >>> helpful also in the long run (at least a tutorial how to set it up and
>> >>> run).
>> >>>
>> >>> The reduced datasets plus the IQ values are linked on the wiki:
>> >>> Features
>> >>> are:
>> >>>   ...> row INT,
>> >>>   ...> studentid VARCHAR(30),
>> >>>   ...> problemhierarchy TEXT,
>> >>>   ...> problemname TEXT,
>> >>>   ...> problemview INT,
>> >>>   ...> problemstepname TEXT,
>> >>>   ...> cfa INT,
>> >>>   ...> iq REAL
>> >>>
>> >>> IQ strength (number of attempts per student) should be available soon.
>> >>>  (perhaps add'l features will become available as well)
>> >>>
>> >>> I'm still hoping somebody could cluster Erin's normalized skills data
>> >>> and provide a row -> cluster id mapping for algebra and bridge train
>> >>> and test sets (I don't have the data any more).
>> >>>
>> >>> Andy
>> >>> _______________________________________________
>> >>> ml mailing list
>> >>> ml at lists.noisebridge.net
>> >>> https://www.noisebridge.net/mailman/listinfo/ml
>> >>
>> >
>> >
>> > _______________________________________________
>> > ml mailing list
>> > ml at lists.noisebridge.net
>> > https://www.noisebridge.net/mailman/listinfo/ml
>> >
>> >
>
>