[ml] Kaggle HIV update
Tim
timaro at gmail.com
Wed Jun 23 05:26:52 UTC 2010
hah...actually, strike that. I didn't look closely enough at the
counts from my phone. with all the A's it's gotta be just a very
degenerate nucleic acid sequence.
serves me right for butting in ;-)
-tim
(sent from my iPad Nano.)
On Jun 22, 2010, at 9:29 PM, Tim <timaro at gmail.com> wrote:
> those are protein sequences, not nucleic acid sequences. they're
> all specific amino acids, except for B, which stands for Asp OR Asn.
>
> -tim
>
> (sent from my iPad Nano.)
>
> On Jun 22, 2010, at 8:48 PM, David Faden <dfaden at gmail.com> wrote:
>
>> Ah, that makes sense. Thank you. It looks like neither set of
>> sequences has any Us so they're both DNA?
>>
>> > table(unlist(strsplit(d0$PR.Seq, "")))
>>
>> A B C D G H K M N R
>> S T V W Y
>> 97090 1 43476 4 61977 3 104 297 9 1334 59
>> 67495 5 155 634
>>
>> > table(unlist(strsplit(d0$RT.Seq, "")))
>>
>> A B C D G H K M N
>> R S T V W Y
>> 376584 5 156864 5 196008 16 435 694 235
>> 4207 210 213708 5 465 2523
>>
>> I'm trying to take a quick look now at mapping these back to amino
>> acids using the table you linked, just giving up for the moment if
>> something is ambiguous or unknown.
>>
>> On Tue, Jun 22, 2010 at 7:31 PM, Mike Schachter <mike at mindmech.com>
>> wrote:
>> I found an explanation on the forum of the Kaggle page that
>> explains what the non-standard letters mean, it linked to this:
>>
>> http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html
>>
>> mike
>>
>>
>>
>> On Tue, Jun 22, 2010 at 5:27 PM, Mike Schachter <mike at mindmech.com>
>> wrote:
>> Hey David,
>>
>> Unfortunately I don't think the sequences are amino acid sequences.
>>
>> For the PR sequences, most of them have a length of 297. If it's a
>> DNA sequence, then this means it codes for 99 amino acids. A quick
>> look shows that HIV-1 Protease (the protein whose sequence we're
>> dealing with in the first sequence column) has 99 amino acid pairs:
>>
>> http://www.bioafrica.net/proteomics/POL-PRprot.html
>>
>> Does that make sense? If it does, then the sequences from the data
>> are
>> just noisy and of poor quality, and we're going to have to throw
>> out some
>> of the noisy data before running it through a sequence aligner. I'm
>> in the
>> process of doing this now, and will let everyone know how things
>> are coming
>> along at the meeting.
>>
>> See everyone tonight!
>>
>> mike
>>
>>
>>
>>
>> On Tue, Jun 22, 2010 at 8:37 AM, David Faden <dfaden at gmail.com>
>> wrote:
>> It looks like the sequences are already coded in terms of amino
>> acids rather than nucleotide triples? <http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html
>> >
>>
>> On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze <thomas.lotze at gmail.com
>> > wrote:
>> I committed some python for generating base pair triplet count
>> features, and R code for determining frequency and doing a basic
>> GLM including the most frequent triplets.
>> (The Noisebridge machine learning sourceforge git repository is
>> here: https://sourceforge.net/scm/?type=git&group_id=326816 To
>> download the files, run "git clone git://ml-
>> noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-
>> noisebridge" or, better yet, ask Mike to give you read/write access
>> to this project so you can upload code as well)
>>
>> This got me to 53.8462 MCE, 36th out of 49 teams.
>>
>> See you tomorrow night at 9 for fun with Hadoop!
>> -Thomas
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>>
>>
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>>
>>
>>
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.noisebridge.net/pipermail/ml/attachments/20100622/88519937/attachment-0003.html>
More information about the ml
mailing list