[ml] Kaggle HIV update

Wed Jun 23 03:48:00 UTC 2010

Ah, that makes sense. Thank you. It looks like neither set of sequences has
any Us so they're both DNA?

> table(unlist(strsplit(d0$PR.Seq, "")))

    A     B     C     D     G     H     K     M     N     R     S     T
V     W     Y
97090     1 43476     4 61977     3   104   297     9  1334    59 67495
5   155   634

> table(unlist(strsplit(d0$RT.Seq, "")))

     A      B      C      D      G      H      K      M      N      R      S
     T      V      W      Y
376584      5 156864      5 196008     16    435    694    235   4207    210
213708      5    465   2523

I'm trying to take a quick look now at mapping these back to amino acids
using the table you linked, just giving up for the moment if something is
ambiguous or unknown.

On Tue, Jun 22, 2010 at 7:31 PM, Mike Schachter <mike at mindmech.com> wrote:

> I found an explanation on the forum of the Kaggle page that
> explains what the non-standard letters mean, it linked to this:
>
> http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html
>
>    mike
>
>
>
> On Tue, Jun 22, 2010 at 5:27 PM, Mike Schachter <mike at mindmech.com> wrote:
>
>> Hey David,
>>
>> Unfortunately I don't think the sequences are amino acid sequences.
>>
>> For the PR sequences, most of them have a length of 297. If it's a
>> DNA sequence, then this means it codes for 99 amino acids. A quick
>> look shows that HIV-1 Protease (the protein whose sequence we're
>> dealing with in the first sequence column) has 99 amino acid pairs:
>>
>> http://www.bioafrica.net/proteomics/POL-PRprot.html
>>
>> Does that make sense? If it does, then the sequences from the data are
>> just noisy and of poor quality, and we're going to have to throw out some
>> of the noisy data before running it through a sequence aligner. I'm in the
>> process of doing this now, and will let everyone know how things are
>> coming
>> along at the meeting.
>>
>> See everyone tonight!
>>
>>    mike
>>
>>
>>
>>
>> On Tue, Jun 22, 2010 at 8:37 AM, David Faden <dfaden at gmail.com> wrote:
>>
>>> It looks like the sequences are already coded in terms of amino acids
>>> rather than nucleotide triples? <
>>> http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html>
>>>
>>> On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze <thomas.lotze at gmail.com>wrote:
>>>
>>>> I committed some python for generating base pair triplet count features,
>>>> and R code for determining frequency and doing a basic GLM including the
>>>> most frequent triplets.
>>>> (The Noisebridge machine learning sourceforge git repository is here:
>>>> https://sourceforge.net/scm/?type=git&group_id=326816  To download the
>>>> files, run "git clone git://
>>>> ml-noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-noisebridge"
>>>> or, better yet, ask Mike to give you read/write access to this project so
>>>> you can upload code as well)
>>>>
>>>> This got me to 53.8462 MCE, 36th out of 49 teams.
>>>>
>>>> See you tomorrow night at 9 for fun with Hadoop!
>>>> -Thomas
>>>>
>>>> _______________________________________________
>>>> ml mailing list
>>>> ml at lists.noisebridge.net
>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>
>>>>
>>>
>>> _______________________________________________
>>> ml mailing list
>>> ml at lists.noisebridge.net
>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100622/ab74ad17/attachment.html>