[Noisebridge-discuss] ML Meetup, tonight, 8PM!

Thu Apr 16 01:22:00 UTC 2009

On Wed, Apr 08, 2009 at 04:24:00PM -0700, Josh Myer wrote:
> Tonight, we'll be using a neural network library to do classification,
> and hopefully working up to some light-duty OCR.

I've been meaning to send this followup to last week's ML meeting for a
while now, but I kept getting distracted; sorry.

I put the code I wrote last week on github.  There's a perl script,
mnist-converter.pl, which can convert the MNIST handwriting data to the
file format that libfann expects:

    http://github.com/mct/noisebridge/blob/d6ba0c6a90156951f8db24ef7ce2dadb8f69e8d9/machine-learning/mnist/mnist-converter.pl

(Does anyone know offhand if there's a way to tell github to display
the version of a given file from "head", rather than always specifying
a particular commit in the URL?)

Here's the output of that script on the MNIST 10k sample set.  I'd be
very curious if this matches up with the data other people ended up:

    http://github.com/mct/noisebridge/raw/d6ba0c6a90156951f8db24ef7ce2dadb8f69e8d9/machine-learning/fann/t10k-fann-input.dat

That script is also able to convert the MNIST images to PPM files, a very
simple graphics file format which can easily be converted to GIF images
using ppmtogif(1), convert(1).  I converted each image in the 10k set to
a GIF, grouped them by label, then used montage(1) to create a collage of
images for each digit.  The resulting set of images is *really* fun to
scroll through:

    http://mct.github.com/noisebridge/machine-learning/mnist/

As far as actually using the 10k sample set to train a neural network, that
hasn't gone so well yet.  At 83c last Wednesday, I wrote learn.c, which was
heavily based on examples/xor_train.c from the fann-2.1.0 distribution.  It
was modified based on some pointers from Josh to set the number of inputs,
outputs, hidden nodes, etc correctly:

    http://github.com/mct/noisebridge/blob/d6ba0c6a90156951f8db24ef7ce2dadb8f69e8d9/machine-learning/fann/learn.c

For the initial run, we set the maximum number of iterations (max_epochs)
to 500,000, but it looks like that value is just way too high.  I started
running this program last Wednesday evening around 2am on a machine with a
2Ghz Xeon processor, and it's still going as of this writing.  You can view
its (timestamped) stdout, at:

    http://github.com/mct/noisebridge/raw/d6ba0c6a90156951f8db24ef7ce2dadb8f69e8d9/machine-learning/fann/t10k-learn-stdout-20090409.gz

The error rate fell pretty quickly to between 0.009 and 0.008, but it looks
like its just been oscillating between those points for most of the time
since.  More interesting, if you look at the bottom of the file, the error
rate started to rise a bit.  Hmm.

I was really hoping I would be able to let this program run until it
completed all 500,000 iterations, but unfortunately the data center where
the machine is located will be undergoing a scheduled power maintenance
tomorrow night.  Judging by the rate it's been running at so far, it
wouldn't finish until sometime this weekend, well after the scheduled
maintenance.  Worse, the way the code is structured, it won't be able
to write to disk any of the training data it's been producing until all
500,000 iterations are completed.  Oh, well.  If only I had compiled it
with debugging symbols, so that I could attach with gdb and alter the
value of the iterator... :-)

I'll need to examine the output some more, pick an iteration number that
looked like a good compromise between error rates, then run it again with
max_epochs set appropriately.

-mct