[ml] Commands for running weka for the KDD set.
Andreas von Hessling
vonhessling at gmail.com
Thu Jun 24 21:51:28 UTC 2010
Here are the weka commands I used for discretization, obfuscation (to
reduce size of files) and classification of the KDD set.
Note: I'm running on a latest gen Macbook that I've overclocked with
8GB ram, which was needed (-Xms4096m -Xmx8192m) during processing even
for the obfuscated files.
Discretize: need to unset class temporarily in order to treat the
class attribute the same as all other attributes; Not all filters
support this, and they consequently cause a lot of pain to apply;
This is a small detail in weka that makes it much less usable in many
cases.
java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Discretize
-unset-class-temporarily -F -B 10 -i aUnified.csv -o
aUnifiedDiscretized.csv
(to understand the command line options, you can invoke java -cp
weka.jar weka.filters.unsupervised.attribute.Discretize -h )
Obfuscating:
java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Obfuscate -i atest.arff -o
atestObf.arff
Classification: this is the command I tried for producing predictions
but I wasn't able to get the labels for the test data....
java -Xms4096m -Xmx8192m -cp weka.jar
weka.classifiers.bayes.NaiveBayesUpdateable -t atrainObf.arff -T
atestObf.arff -p last > aOut1.txt
So instead I wrote a small Java class to do this. Using an updateable
classifier so it loads the file one line at a time, so it will fit
into memory.
log.info("Loading data...");
NaiveBayesUpdateable nb;
{
ArffLoader loader = new ArffLoader();
loader.setFile(new File("atrain.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
// train NaiveBayes
nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
nb.updateClassifier(current);
}
}
log.info("Now classifying...");
{
FileWriter fw = new FileWriter("aPredictions.txt", true);
ArffLoader loader = new ArffLoader();
loader.setFile(new File("atest.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
// classify using NaiveBayes
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
double clsLabel = nb.classifyInstance(current);
double[] distribution = nb.distributionForInstance(current);
// here I tried to cap the probability predictions at mean +- one
standard deviation of the iq; could instead also just predict
distribution[1] value
double estimate = Math.max(Math.min(distribution[1], 0.92d),0.80d); //
iq mean: 0.86, standard dev 6
log.info("ClassLabel: " + clsLabel + ", estimate: " +estimate);
fw.write("" + estimate + "\r\n");
}
fw.close();
}
More information about the ml
mailing list