[ml] [kaggle-hiv] amino acid sequences available

Mike Schachter mike at mindmech.com
Wed Jun 30 18:09:26 UTC 2010

(For newcomers: we're working on a data mining competition to help predict
HIV treatment outcomes:

Thanks to Dave's script I made some progress converting
gene sequences to amino acid sequences. Code and data
is available in the git repo:

git clone git://

The sequences can be found in ml-noisebridge/kaggle/data/{pr,rt}seq_aas.csv.
files are comma-separated lists of amino acid letters. Not every codon (3
DNA bases)
converts to a unique amino acid, this is represented by amino acid letters
by a |

I'll go over this more tonight, but there are two proteins we
have sequences for - HIV Protease and HIV Reverse
transcriptase. A great video that describes how these work
is available here:


HIV Protease helps cut up HIV proteins in to their right shapes
once the cell starts producing them:

PR Wiki: http://en.wikipedia.org/wiki/HIV-1_protease
PR Sequence Info: http://www.bioafrica.net/proteomics/POL-PRprot.html
PR Drug Resistance Info:

Reverse Transcriptase takes the viral RNA and converts it into
DNA to be integrated into the cell:

RT Wiki: http://en.wikipedia.org/wiki/Reverse_transcriptase (see HIV
RT Sequence Info: http://bioafrica.mrc.ac.za/proteomics/POL-RTprot.html
RT Drug Resistance Info:

The number of permutations of amino acid sequences is pretty large, my
are that we should use the drug resistance info to target specific amino
acids and
for classification and reduce the data, then go from there.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100630/0c50ecb6/attachment.html>

More information about the ml mailing list