[ml] Clustering woes

Wed Jun 2 15:30:26 UTC 2010

So, after not getting Hadoop/Mahout working (anyone remember how to run an
example locally? Amazon's logging was a bit obtuse trying to debug), I
decided to try using SciPy's implementation -- and after running into
errors, hacking on their code (adding, for example, the following to line
493 in vq.py so that it actually sets an initial value before
comparing:  *best_book
= take(obs, randint(0, No, k), 0)* ), and writing my own function to find
the correct cluster for each point...it turns out it's still trying to load
it all into memory, and there's no way it fits.  I'll still come back to
looking at a version which doesn't load everything into memory at once (and
can maybe be distributed via Hadoop), but I'm sorry to say that for this
week, I am a failure.  Source code is attached for anyone who wants to see
(I do a little bit of set/skill parsing before attempting clustering, which
might be useful).

Glumly,
Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100602/2f982372/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.py
Type: application/octet-stream
Size: 2771 bytes
Desc: not available
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100602/2f982372/attachment.obj>