[Noisebridge-discuss] Bioinformatics Tutoring/Hack Day

Glen Jarvis glen at glenjarvis.com
Thu May 13 00:27:59 UTC 2010


I've been wanting to get the open source community more involved with some
of the problems that we're tackling. Open Source code is *so* much better
than code reviewed by only a few eyes. And, this would also give everyone a
chance to see what a simple bioinformatics problem would be like.

There may be some *real* bioinformaticians on this list (I don't yet
consider myself on that level yet -- although that's what I get paid for and
I'm getting there). So, if you're a real bioinformatician, this may be a
trivial problem for you. But, if you want to come and help explain
things/help others work this out, that'd be cool!

I'd like to get together (on a weekend, possibly) and hack on this problem.
I will describe the things that I think you need to know:

* What is FASTA format (http://www.ncbi.nlm.nih.gov/blast/fasta.shtml)
* An brief introduction to BioPython (http://biopython.org/) -- you can use
your own language and library, we'll be using python to explain
* What is a genome
* What is a gene
* What are amino acids (contrasting against DNA data)
* What is a 'percent identity' between genes
* What is a species
* What is a strain (loosely defined because it seems to be very loose in
this problem)
* The term taxa (plural) and taxon (singular)
* How can genes vary and still be the same gene
* How errors can exist in different databases
* An introduction to the JGI (http://www.jgi.doe.gov/) database
* An introduction to the UniProt (http://www.uniprot.org/)


With this introduction, you should have a theoretical understanding of all
that you need to solve this problem -- the rest is coding. (That is, if I do
my job and explain things well -- and don't fall into pot holes of
information that I don't know).... Also, I over simplified things that you
don't need to know for this problem (e.g., We won't talk about open reading
frames at all or what that means. Since we're already given amino acids, we
don't care).

The problem is:

I will give you a file in FASTA format of the genes for a particular species
(let's say: Chlamydophila pneumoniae). That file will contain a list of
genes, one after the other, again in FASTA format. The file will have the
JGI unique identifiers. However, we also want the UniProt identifier for
this same gene.

Now, this should be as simple as: "Take the gene from the JGI database,
look-up the same gene in UniProt, record the number, dust off your hands -
you're done" -- There are lots of little tedious problems, however, that
keep it from being this easy.

For example, if two genes are absolutely identical (they have the same amino
acid sequence) except for in a single position, are they actually identical?
What if the sequence found was in a strain instead of from the original
exact species? What if it's an identical ortholog?

Let me ask another question: If you were to somehow magically sequence your
personal entire genome (everything - not just genes) from a cell in your toe
and also sequence your entire genome from a cell from your nose, would they
be identical?  I bet not... I'll explain why. Now, we expect less
differences in actual genes (not in other parts of your genome), but even
then, there can be some variation...

These are the types of questions/problems that we'll be getting into if
you're so interested...

Who's up for this?  We'll get date and time once we have a set of interested
people...

This particular problem is basic compared to many that we deal with. But, it
shows you the kind of tediousness - and the kind of data that is out there.


Cheers,



Glen
-- 
Whatever you can do or imagine, begin it;
boldness has beauty, magic, and power in it.

-- Goethe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/noisebridge-discuss/attachments/20100512/977ea01d/attachment-0002.html>


More information about the Noisebridge-discuss mailing list