[ml] Kaggle HIV update

Wed Jun 23 04:29:22 UTC 2010

those are protein sequences, not nucleic acid sequences.  they're all  
specific amino acids, except for B, which stands for Asp OR Asn.

-tim

(sent from my iPad Nano.)

On Jun 22, 2010, at 8:48 PM, David Faden <dfaden at gmail.com> wrote:

> Ah, that makes sense. Thank you. It looks like neither set of  
> sequences has any Us so they're both DNA?
>
> > table(unlist(strsplit(d0$PR.Seq, "")))
>
>     A     B     C     D     G     H     K     M     N     R      
> S     T     V     W     Y
> 97090     1 43476     4 61977     3   104   297     9  1334    59  
> 67495     5   155   634
>
> > table(unlist(strsplit(d0$RT.Seq, "")))
>
>      A      B      C      D      G      H      K      M      N       
> R      S      T      V      W      Y
> 376584      5 156864      5 196008     16    435    694    235    
> 4207    210 213708      5    465   2523
>
> I'm trying to take a quick look now at mapping these back to amino  
> acids using the table you linked, just giving up for the moment if  
> something is ambiguous or unknown.
>
> On Tue, Jun 22, 2010 at 7:31 PM, Mike Schachter <mike at mindmech.com>  
> wrote:
> I found an explanation on the forum of the Kaggle page that
> explains what the non-standard letters mean, it linked to this:
>
> http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html
>
>    mike
>
>
>
> On Tue, Jun 22, 2010 at 5:27 PM, Mike Schachter <mike at mindmech.com>  
> wrote:
> Hey David,
>
> Unfortunately I don't think the sequences are amino acid sequences.
>
> For the PR sequences, most of them have a length of 297. If it's a
> DNA sequence, then this means it codes for 99 amino acids. A quick
> look shows that HIV-1 Protease (the protein whose sequence we're
> dealing with in the first sequence column) has 99 amino acid pairs:
>
> http://www.bioafrica.net/proteomics/POL-PRprot.html
>
> Does that make sense? If it does, then the sequences from the data are
> just noisy and of poor quality, and we're going to have to throw out  
> some
> of the noisy data before running it through a sequence aligner. I'm  
> in the
> process of doing this now, and will let everyone know how things are  
> coming
> along at the meeting.
>
> See everyone tonight!
>
>    mike
>
>
>
>
> On Tue, Jun 22, 2010 at 8:37 AM, David Faden <dfaden at gmail.com> wrote:
> It looks like the sequences are already coded in terms of amino  
> acids rather than nucleotide triples? <http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html 
> >
>
> On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze  
> <thomas.lotze at gmail.com> wrote:
> I committed some python for generating base pair triplet count  
> features, and R code for determining frequency and doing a basic GLM  
> including the most frequent triplets.
> (The Noisebridge machine learning sourceforge git repository is  
> here: https://sourceforge.net/scm/?type=git&group_id=326816  To  
> download the files, run "git clone git://ml- 
> noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml- 
> noisebridge" or, better yet, ask Mike to give you read/write access  
> to this project so you can upload code as well)
>
> This got me to 53.8462 MCE, 36th out of 49 teams.
>
> See you tomorrow night at 9 for fun with Hadoop!
> -Thomas
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100622/8f855468/attachment.html>