Thursday, May 19, 2011

Hello from the Pre-Coding Period!

Thanks for visiting my GSoC blog! Starting next Tuesday, I'm going to be coding an open-source program that will display, manipulate and output (in various formats) large multi-locus datasets. "What for?" you say? I came up with this project (tentatively called "lociNGS") because I see a need for a simple way to keep track of the type of datasets genome-enrichment techniques and next-generation sequencing produce. It's the wave of the future (according to, you know, me). Empiricists can now generate millions of sequence fragments and when the genome is reduced enough, you can get the same fragments across individuals in a sample. This results in hundreds to thousands of anonymous loci - a virtually unprecedented amount of data. And (I think) we need a way to view summaries of these datasets and to reformat these data for use in programs that can take multiple loci.
My project will do just that. See the figure below for a visualization for what I'm talking about. I want to be able to display the major aspects of a dataset, link demographic variables to a set of sequence reads, and output (1) the sequences used to call a locus and (2) useful formats for population genetic analyses - NEXUS, IMa2 and migrate to start with.
Since I was accepted to GSoC, I've been trying to get a semi-functional grasp on MongoDB (for databasing functions) and git / GitHub (for sharing code). Additionally, I've been giving myself a crash course on Python, a language I've been wanting to learn it for a while but lacked the impetus for actually starting. Now, with the promise of incorporating all the tools of Biopython, it has begun.
Next Tuesday, the coding officially begins. I'm being mentored by Jeremy Brown and Brad Chapman and am really excited to get started. I hope I'm up to snuff and this project turns out as awesome and useful as it already is in my brain.

The VISION for my GSoC Project

