[ml] Commands for running weka for the KDD set.

Thu Jun 24 21:51:28 UTC 2010

Here are the weka commands I used for discretization, obfuscation (to
reduce size of files) and classification of the KDD set.
Note: I'm running on a latest gen Macbook that I've overclocked with
8GB ram, which was needed (-Xms4096m -Xmx8192m) during processing even
for the obfuscated files.

Discretize: need to unset class temporarily in order to treat the
class attribute the same as all other attributes;  Not all filters
support this, and they consequently cause a lot of pain to apply;
This is a small detail in weka that makes it much less usable in many
cases.

java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Discretize
-unset-class-temporarily -F -B 10 -i aUnified.csv -o
aUnifiedDiscretized.csv
 (to understand the command line options, you can invoke java -cp
weka.jar weka.filters.unsupervised.attribute.Discretize -h )

Obfuscating:

java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Obfuscate -i atest.arff -o
atestObf.arff

Classification: this is the command I tried for producing predictions
but I wasn't able to get the labels for the test data....

java -Xms4096m -Xmx8192m -cp weka.jar
weka.classifiers.bayes.NaiveBayesUpdateable -t atrainObf.arff -T
atestObf.arff -p last > aOut1.txt

So instead I wrote a small Java class to do this. Using an updateable
classifier so it loads the file one line at a time, so it will fit
into memory.

        log.info("Loading data...");
        NaiveBayesUpdateable nb;
        {
	        ArffLoader loader = new ArffLoader();
	        loader.setFile(new File("atrain.arff"));
	        Instances structure = loader.getStructure();
	        structure.setClassIndex(structure.numAttributes() - 1);

	        // train NaiveBayes
	        nb = new NaiveBayesUpdateable();
	        nb.buildClassifier(structure);
	        Instance current;
	        while ((current = loader.getNextInstance(structure)) != null) {
	          nb.updateClassifier(current);
	        }

        }

        log.info("Now classifying...");
        {
            FileWriter fw = new FileWriter("aPredictions.txt", true);
	        ArffLoader loader = new ArffLoader();
	        loader.setFile(new File("atest.arff"));
	        Instances structure = loader.getStructure();
	        structure.setClassIndex(structure.numAttributes() - 1);

	        // classify using NaiveBayes
	        Instance current;
	        while ((current = loader.getNextInstance(structure)) != null) {
	            double clsLabel = nb.classifyInstance(current);
	            double[] distribution = nb.distributionForInstance(current);
// here I tried to cap the probability predictions at mean +- one
standard deviation of the iq; could instead also just predict
distribution[1] value
double estimate = Math.max(Math.min(distribution[1], 0.92d),0.80d); //
iq mean: 0.86, standard dev 6
	            log.info("ClassLabel: " + clsLabel + ", estimate: " +estimate);
	            fw.write("" + estimate + "\r\n");
	        }
            fw.close();

        }