[ml] [kaggle-hiv] amino acid sequences available

Mike Schachter mike at mindmech.com
Wed Jun 30 18:09:26 UTC 2010


(For newcomers: we're working on a data mining competition to help predict
HIV treatment outcomes:
https://www.noisebridge.net/wiki/Machine_Learning/Kaggle_HIV)

Thanks to Dave's script I made some progress converting
gene sequences to amino acid sequences. Code and data
is available in the git repo:

git clone git://
ml-noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-noisebridge

The sequences can be found in ml-noisebridge/kaggle/data/{pr,rt}seq_aas.csv.
The
files are comma-separated lists of amino acid letters. Not every codon (3
DNA bases)
converts to a unique amino acid, this is represented by amino acid letters
separated
by a |

I'll go over this more tonight, but there are two proteins we
have sequences for - HIV Protease and HIV Reverse
transcriptase. A great video that describes how these work
is available here:

http://www.youtube.com/watch?v=RO8MP3wMvqg&feature=related

HIV Protease helps cut up HIV proteins in to their right shapes
once the cell starts producing them:

PR Wiki: http://en.wikipedia.org/wiki/HIV-1_protease
PR Sequence Info: http://www.bioafrica.net/proteomics/POL-PRprot.html
PR Drug Resistance Info:
http://hivdb.stanford.edu/cgi-bin/PositionPhenoSummary.cgi

Reverse Transcriptase takes the viral RNA and converts it into
DNA to be integrated into the cell:

RT Wiki: http://en.wikipedia.org/wiki/Reverse_transcriptase (see HIV
subsection)
RT Sequence Info: http://bioafrica.mrc.ac.za/proteomics/POL-RTprot.html
RT Drug Resistance Info:
http://hivdb.stanford.edu/cgi-bin/PositionPhenoSummary.cgi

The number of permutations of amino acid sequences is pretty large, my
thoughts
are that we should use the drug resistance info to target specific amino
acids and
for classification and reduce the data, then go from there.

  mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100630/0c50ecb6/attachment.html>


More information about the ml mailing list