<html><body bgcolor="#FFFFFF"><div>those are protein sequences, not nucleic acid sequences.  they're all specific amino acids, except for B, which stands for Asp OR Asn.</div><div><br></div><div>-tim<div><br><div>(sent from my iPad Nano.)</div></div></div><div><br>On Jun 22, 2010, at 8:48 PM, David Faden <<a href="mailto:dfaden@gmail.com">dfaden@gmail.com</a>> wrote:<br><br></div><div></div><blockquote type="cite"><div>Ah, that makes sense. Thank you. It looks like neither set of sequences has any Us so they're both DNA?<div><br></div><div><div>> table(unlist(strsplit(d0$PR.Seq, "")))</div><div><br></div><div>    A     B     C     D     G     H     K     M     N     R     S     T     V     W     Y </div>

<div>97090     1 43476     4 61977     3   104   297     9  1334    59 67495     5   155   634 </div><div><br></div><div><div>> table(unlist(strsplit(d0$RT.Seq, "")))</div><div><br></div><div>     A      B      C      D      G      H      K      M      N      R      S      T      V      W      Y </div>

<div>376584      5 156864      5 196008     16    435    694    235   4207    210 213708      5    465   2523 </div></div><div><br></div><div>I'm trying to take a quick look now at mapping these back to amino acids using the table you linked, just giving up for the moment if something is ambiguous or unknown.</div>

<br><div class="gmail_quote">On Tue, Jun 22, 2010 at 7:31 PM, Mike Schachter <span dir="ltr"><<a href="mailto:mike@mindmech.com"><a href="mailto:mike@mindmech.com">mike@mindmech.com</a></a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

I found an explanation on the forum of the Kaggle page that<br>explains what the non-standard letters mean, it linked to this:<br><br><a href="http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html" target="_blank"><a href="http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html">http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html</a></a><br>

<font color="#888888">

<br>   mike</font><div><div></div><div class="h5"><br><br><br><div class="gmail_quote">On Tue, Jun 22, 2010 at 5:27 PM, Mike Schachter <span dir="ltr"><<a href="mailto:mike@mindmech.com" target="_blank"><a href="mailto:mike@mindmech.com">mike@mindmech.com</a></a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">

Hey David,<br><br>Unfortunately I don't think the sequences are amino acid sequences.<br><br>For the PR sequences, most of them have a length of 297. If it's a<br>DNA sequence, then this means it codes for 99 amino acids. A quick<br>


look shows that HIV-1 Protease (the protein whose sequence we're<br>dealing with in the first sequence column) has 99 amino acid pairs:<br><br><a href="http://www.bioafrica.net/proteomics/POL-PRprot.html" target="_blank"><a href="http://www.bioafrica.net/proteomics/POL-PRprot.html">http://www.bioafrica.net/proteomics/POL-PRprot.html</a></a><br>


<br>Does that make sense? If it does, then the sequences from the data are<br>just noisy and of poor quality, and we're going to have to throw out some<br>of the noisy data before running it through a sequence aligner. I'm in the<br>


process of doing this now, and will let everyone know how things are coming<br>along at the meeting.<br><br>See everyone tonight!<br><font color="#888888"> <br>   mike<br><br><br><br><br></font><div class="gmail_quote"><div>


On Tue, Jun 22, 2010 at 8:37 AM, David Faden <span dir="ltr"><<a href="mailto:dfaden@gmail.com" target="_blank"><a href="mailto:dfaden@gmail.com">dfaden@gmail.com</a></a>></span> wrote:<br>

</div><div><div></div><div><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">It looks like the sequences are already coded in terms of amino acids rather than nucleotide triples? <<a href="http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html" target="_blank"><a href="http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html">http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html</a></a>><br>


<br><div class="gmail_quote"><div><div></div><div>On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze <span dir="ltr"><<a href="mailto:thomas.lotze@gmail.com" target="_blank"><a href="mailto:thomas.lotze@gmail.com">thomas.lotze@gmail.com</a></a>></span> wrote:<br>


</div></div><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex"><div><div></div><div>

I committed some python for generating base pair triplet count features, and R code for determining frequency and doing a basic GLM including the most frequent triplets.<br>(The Noisebridge machine learning sourceforge git repository is here: <a href="https://sourceforge.net/scm/?type=git&group_id=326816" target="_blank"><a href="https://sourceforge.net/scm/?type=git&group_id=326816">https://sourceforge.net/scm/?type=git&group_id=326816</a></a>  To download the files, run "git clone git://<a href="http://ml-noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-noisebridge" target="_blank"><a href="http://ml-noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-noisebridge">ml-noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-noisebridge</a></a>" or, better yet, ask Mike to give you read/write access to this project so you can upload code as well)<br>


<br>This got me to 53.8462 MCE, 36th out of 49 teams.<br><br>See you tomorrow night at 9 for fun with Hadoop!<br><font color="#888888">-Thomas<br>

</font><br></div></div>_______________________________________________<br>

ml mailing list<br>

<a href="mailto:ml@lists.noisebridge.net" target="_blank"><a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a></a><br>

<a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank"><a href="https://www.noisebridge.net/mailman/listinfo/ml">https://www.noisebridge.net/mailman/listinfo/ml</a></a><br>

<br></blockquote></div><br>

<br>_______________________________________________<br>

ml mailing list<br>

<a href="mailto:ml@lists.noisebridge.net" target="_blank"><a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a></a><br>

<a href="https://www.noisebridge.net/mailman/listinfo/ml" target="_blank"><a href="https://www.noisebridge.net/mailman/listinfo/ml">https://www.noisebridge.net/mailman/listinfo/ml</a></a><br>

<br></blockquote></div></div></div><br>

</blockquote></div><br>

</div></div></blockquote></div><br></div>

</div></blockquote><blockquote type="cite"><div><span>_______________________________________________</span><br><span>ml mailing list</span><br><span><a href="mailto:ml@lists.noisebridge.net">ml@lists.noisebridge.net</a></span><br><span><a href="https://www.noisebridge.net/mailman/listinfo/ml">https://www.noisebridge.net/mailman/listinfo/ml</a></span><br></div></blockquote></body></html>