[ml] Kaggle HIV update
Tim
timaro at gmail.com
Wed Jun 23 04:29:22 UTC 2010
those are protein sequences, not nucleic acid sequences. they're all
specific amino acids, except for B, which stands for Asp OR Asn.
-tim
(sent from my iPad Nano.)
On Jun 22, 2010, at 8:48 PM, David Faden <dfaden at gmail.com> wrote:
> Ah, that makes sense. Thank you. It looks like neither set of
> sequences has any Us so they're both DNA?
>
> > table(unlist(strsplit(d0$PR.Seq, "")))
>
> A B C D G H K M N R
> S T V W Y
> 97090 1 43476 4 61977 3 104 297 9 1334 59
> 67495 5 155 634
>
> > table(unlist(strsplit(d0$RT.Seq, "")))
>
> A B C D G H K M N
> R S T V W Y
> 376584 5 156864 5 196008 16 435 694 235
> 4207 210 213708 5 465 2523
>
> I'm trying to take a quick look now at mapping these back to amino
> acids using the table you linked, just giving up for the moment if
> something is ambiguous or unknown.
>
> On Tue, Jun 22, 2010 at 7:31 PM, Mike Schachter <mike at mindmech.com>
> wrote:
> I found an explanation on the forum of the Kaggle page that
> explains what the non-standard letters mean, it linked to this:
>
> http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html
>
> mike
>
>
>
> On Tue, Jun 22, 2010 at 5:27 PM, Mike Schachter <mike at mindmech.com>
> wrote:
> Hey David,
>
> Unfortunately I don't think the sequences are amino acid sequences.
>
> For the PR sequences, most of them have a length of 297. If it's a
> DNA sequence, then this means it codes for 99 amino acids. A quick
> look shows that HIV-1 Protease (the protein whose sequence we're
> dealing with in the first sequence column) has 99 amino acid pairs:
>
> http://www.bioafrica.net/proteomics/POL-PRprot.html
>
> Does that make sense? If it does, then the sequences from the data are
> just noisy and of poor quality, and we're going to have to throw out
> some
> of the noisy data before running it through a sequence aligner. I'm
> in the
> process of doing this now, and will let everyone know how things are
> coming
> along at the meeting.
>
> See everyone tonight!
>
> mike
>
>
>
>
> On Tue, Jun 22, 2010 at 8:37 AM, David Faden <dfaden at gmail.com> wrote:
> It looks like the sequences are already coded in terms of amino
> acids rather than nucleotide triples? <http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html
> >
>
> On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze
> <thomas.lotze at gmail.com> wrote:
> I committed some python for generating base pair triplet count
> features, and R code for determining frequency and doing a basic GLM
> including the most frequent triplets.
> (The Noisebridge machine learning sourceforge git repository is
> here: https://sourceforge.net/scm/?type=git&group_id=326816 To
> download the files, run "git clone git://ml-
> noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-
> noisebridge" or, better yet, ask Mike to give you read/write access
> to this project so you can upload code as well)
>
> This got me to 53.8462 MCE, 36th out of 49 teams.
>
> See you tomorrow night at 9 for fun with Hadoop!
> -Thomas
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100622/8f855468/attachment.html>
More information about the ml
mailing list