[ml] Kaggle HIV update

Wed Jun 23 16:01:24 UTC 2010

It looks like they cannot be unambiguously mapped to amino acids. I wonder
if it would be sensible in this case to invent a new symbol expressing all
the possibilities, eg, just glom together the names of the possible amino
acids in sorted order. Can we count on them using the same symbol coding for
multiple nucleotides at the same sites across sequences? -- probably not,
right? Do the sequence matchers already take care of this?

>>> import dna
# First PR.Seq from the training data:
>>> t =
'CCTCAAATCACTCTTTGGCAACGACCCCTCGTCCCAATAAGGATAGGGGGGCAACTAAAGGAAGCYCTATTAGATACAGGAGCAGATGATACAGTATTAGAAGACATGGAGTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAARACAGTATGATCAGRTACCCATAGAAATCTATGGACATAAAGCTGTAGGTACAGTATTAATAGGACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGCTTGGTTGCACTTTAAATTTY'
>>> dna.DisambiguateAmino(t)Traceback (most recent call last):  File
"<stdin>", line 1, in <module>
  File "dna.py", line 40, in DisambiguateAmino
    raise Exception('Wrong number: <<%s>> for %s' % (',
'.join(possibilities), triple))
Exception: Wrong number: <<Arg, Lys>> for ARA

I hope someone can find use for the dictionaries in the attached code
anyway.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100623/1741f711/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dna.py
Type: application/octet-stream
Size: 1995 bytes
Desc: not available
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100623/1741f711/attachment.obj>