Pageviews

Monday, May 30, 2011

bam read counts

i've been thinking about organization and the database a lot and i think i've come up with a working plan for not only getting the data i want from the sam/bam files, but putting it in the right place for future access. i've also been reading/researching how to query and update mongoDB and utilize pysam and pymongo, so that implementation of this plan is as smooth as possible.


PLAN OF ATTACK:


  • open BAMFILE (one per individual)
    • query mongoDB for all documents that contain BAMFILE in "individuals" value array
    • foreach mongoDB document returned
      • query BAMFILE for locus (samfile.fetch)
      • count reads that align to locus 
      • push count to Individual in locus in mongoDB


i.e., a document that looks like this:
{
"locus" : "loc010",
"individuals" : [
"R01" 
"R02"
"R03"
   ],
"length" : 332
}

will be:
{
"locus" : "loc010",
"individuals" : [
"R01" : 3 ,
"R02" : 15 ,
"R03" : 6
   ],
"length" : 332
}

thus, each element of the individuals array is now a key/value pair. (right?) now i just need to figure out how to do each step. it's going to be a fun week!


Friday, May 27, 2011

first friday

i feel like i spent a lot of thursday spinning my wheels due to various barriers to coding. in the morning, i was working on the locusIndexMDB.py script and it stopped working on my office desktop. i ran it once, it worked, i changed the input file folder and it started throwing an error from within the seqlite_mod.py script ("map() requires two arguments", or something like that.) i spent a fruitless... half hour? hour? ... trying to figure out how i had broken the script that one minute prior had been working but then gave up and went home to my laptop (which ran the exact same script beautifully).

after that,  i was beginning to figure out how to "re-access" fasta files once a path is saved in mongoDB, which led me to trying to parse json formatted data, which led me to trying to figure out how to get the module simplejson, which lea to setuptools, which led to frustration and a fundamental lack of understanding. 

that took care of thursday and when i went in to my office this morning, the locusIndexMDB.py script was throwing the same error and i'm not real sure what to do about that. i don't understand what's happening that could cause an error originating in the seqlite_mod.py script to show up on one machine and not the other. the office machine had some problems with mongoDB because i accidentally installed it more than once and the poor computer was getting its wires crossed, so it may have something to do with that, but i don't know how. 

again, i came home to my laptop and i have made the most satisfying progress in 48 hours this afternoon. i finally successfully installed setuptools and was thus able to get simplejson. i also installed pysam, for sam/bam files. so i've written two very small scripts that [1] can create sequence objects from a fasta file whose path is in mongoDB and [2] print data from a subset of reads in an indexed bam file that align to a given locus. although they are clumsy and there are probably better ways to achieve both these goals, i feel good about having these rather important initial steps taken. 

OH. and technically the 48 hours of frustration began on wednesday afternoon when i tried to come up with a protocol for committing code to github. what a disaster! i managed to brute force my way through it and gain some understanding of git/github but it wasn't pretty. my account looks very silly, with files that are tagged incorrectly and lead to nothing and i haven't figured out how to correct them yet. i think i'll spend some time this afternoon trying to commit my two new baby scripts and hopefully i can do it quicker and more correctly than wednesday! 

Wednesday, May 25, 2011

day 2: SNP#/locus now in database!

today has been another pretty good day! this morning, brad had left three comments on github and i was able to use and incorporate all three suggestions relatively easily. these included getting the path to the folder of loci from the command-line, stripping newlines from a list and getting the length of a sequence object by accessing the first element of the list (something i banged my head against for about an hour yesterday). 

after that, i switched my efforts to including a "SNP" category in the database. i found a python script online that tallies up variable sites from an aligned fasta file (seqlite.py, written by steve haddock); i did some modifications to the script and import it as a function into my inputLocusMDB.py script. shockingly enough, it worked! so now each locus has its own document in a mongoDB collection and each document contains the following fields: _id, locus, length, SNPs, path, individuals[]. script output (to the screen) looks like this:

-----------------------------------------------------------------

Got this folder: fasta
locus =  RAILmatic_1_aln ; number alleles =  36 ; length =  301 ; path =  fasta/RAILmatic_1_aln.fasta ; SNPs =  7
locus =  RAILmatic_2_aln ; number alleles =  34 ; length =  287 ; path =  fasta/RAILmatic_2_aln.fasta ; SNPs =  10
locus =  RAILmatic_3_aln ; number alleles =  34 ; length =  322 ; path =  fasta/RAILmatic_3_aln.fasta ; SNPs =  12
locus =  RAILmatic_4_aln ; number alleles =  24 ; length =  330 ; path =  fasta/RAILmatic_4_aln.fasta ; SNPs =  3
locus =  RAILmatic_5_aln ; number alleles =  40 ; length =  276 ; path =  fasta/RAILmatic_5_aln.fasta ; SNPs =  5
locus =  RAILmatic_6_aln ; number alleles =  40 ; length =  323 ; path =  fasta/RAILmatic_6_aln.fasta ; SNPs =  11
locus =  RAILmatic_7_aln ; number alleles =  30 ; length =  365 ; path =  fasta/RAILmatic_7_aln.fasta ; SNPs =  12
locus =  RAILmatic_8_aln ; number alleles =  14 ; length =  317 ; path =  fasta/RAILmatic_8_aln.fasta ; SNPs =  2

total of  8 loci

...reading the names file into a list...

name:  R01 found in:  8 loci
name:  R02 found in:  5 loci
name:  R03 found in:  5 loci
name:  R04 found in:  6 loci
name:  R05 found in:  7 loci
name:  R06 found in:  7 loci
name:  R07 found in:  8 loci
name:  R08 found in:  8 loci
name:  R09 found in:  4 loci
name:  R10 found in:  4 loci
name:  R11 found in:  7 loci
name:  R12 found in:  7 loci
name:  R13 found in:  7 loci
name:  R14 found in:  8 loci
name:  R15 found in:  6 loci
name:  R16 found in:  4 loci
name:  R17 found in:  6 loci
name:  R18 found in:  7 loci
name:  R19 found in:  7 loci
name:  R20 found in:  5 loci

-----------------------------------------------------------------

i had some trouble yesterday figuring out how to commit my code to github, so i think making sure i have a functioning code-committing protocol is next on my to do list. i'm also going to think hard about whether the database is set-up in the most functional way and how the metadata from the SAM/BAM files will be incorporated once i start on that.

Tuesday, May 24, 2011

knock on wood

coding has been going relatively well this morning/afternoon (knock on wood). starting with a folder of multi-fasta locus files, i've been able to read each of the files into a SeqIO.index and store the name of the locus and an array of the individuals represented in the locus to a MongoDB collection. i've also been able to output how many of the loci each individual has and the length of each locus. for my first python script ever, i feel not too bad about it. here's the screen output when i run the script on a folder with 5 loci and ≤ 20 individuals (R01..R20, up to 40 alleles since these guys are diploid):

-----------------------------------------------------------------
locus =  RAILmatic_1 ; number alleles =  34 length =  301
locus =  RAILmatic_2 ; number alleles =  32 length =  287
locus =  RAILmatic_3 ; number alleles =  30 length =  322
locus =  RAILmatic_4 ; number alleles =  12 length =  330
locus =  RAILmatic_5 ; number alleles =  40 length =  274
total of  5 loci

...reading the names file into a list...

name:  R01 found in:  5 loci
name:  R02 found in:  3 loci
name:  R03 found in:  3 loci
name:  R04 found in:  4 loci
name:  R05 found in:  4 loci
name:  R06 found in:  5 loci
name:  R07 found in:  5 loci
name:  R08 found in:  4 loci
name:  R09 found in:  2 loci
name:  R10 found in:  2 loci
name:  R11 found in:  3 loci
name:  R12 found in:  5 loci
name:  R13 found in:  4 loci
name:  R14 found in:  5 loci
name:  R15 found in:  4 loci
name:  R16 found in:  2 loci
name:  R17 found in:  4 loci
name:  R18 found in:  4 loci
name:  R19 found in:  4 loci
name:  R20 found in:  2 loci

-----------------------------------------------------------------
the MongoDB collection has a document like this for each locus:

{
"_id" : ObjectId("4ddc0f9a10d1b1a91c000003"),
"length" : 330,
"individuals" : [
"R07.02",
"R07.01",
"R14.02",
"R14.01",
"R12.01",
"R06.01",
"R06.02",
"R12.02",
"R01.01",
"R01.02",
"R19.01",
"R19.02"
],
"locus" : "RAILmatic_4.fasta"
}

-----------------------------------------------------------------
(the .01 and .02 refer to the 2 alleles each individual has)

next tasks = commit code to GitHub (not that it's particularly special, but want to get into the habit); work on updating documents in MongoDB; import SAM files and save metadata in appropriate MongoDB document

Monday, May 23, 2011

one...more...day...

the coding period officially begins tomorrow so i'm spending today getting my ducks in a row. i'm realizing more and more that my "simple" GUI is quite complex - mongodb already seems like a great choice for keeping all the pieces organized and flexible. i've updated my schematic to include which input file each column of data is coming from - it's incredible to me how soothing adding colored text to a figure can be (or is that just me?).

tomorrow: first order of business - importing the locus/fasta files. brad suggested i only store the metadata for the locus files in the database then access the primary data via a linked file - he suggested Biopython's SeqIO.index_db in particular (i think i said all that right...) anyway, that's what i'll be starting on first thing tomorrow morning - i'm really just posting to get in the habit and upload my new, improved figure:


Thursday, May 19, 2011

Hello from the Pre-Coding Period!

Thanks for visiting my GSoC blog! Starting next Tuesday, I'm going to be coding an open-source program that will display, manipulate and output (in various formats) large multi-locus datasets. "What for?" you say? I came up with this project (tentatively called "lociNGS") because I see a need for a simple way to keep track of the type of datasets genome-enrichment techniques and next-generation sequencing produce. It's the wave of the future (according to, you know, me). Empiricists can now generate millions of sequence fragments and when the genome is reduced enough, you can get the same fragments across individuals in a sample. This results in hundreds to thousands of anonymous loci - a virtually unprecedented amount of data. And (I think) we need a way to view summaries of these datasets and to reformat these data for use in programs that can take multiple loci.
My project will do just that. See the figure below for a visualization for what I'm talking about. I want to be able to display the major aspects of a dataset, link demographic variables to a set of sequence reads, and output (1) the sequences used to call a locus and (2) useful formats for population genetic analyses - NEXUS, IMa2 and migrate to start with.
Since I was accepted to GSoC, I've been trying to get a semi-functional grasp on MongoDB (for databasing functions) and git / GitHub (for sharing code). Additionally, I've been giving myself a crash course on Python, a language I've been wanting to learn it for a while but lacked the impetus for actually starting. Now, with the promise of incorporating all the tools of Biopython, it has begun.
Next Tuesday, the coding officially begins. I'm being mentored by Jeremy Brown and Brad Chapman and am really excited to get started. I hope I'm up to snuff and this project turns out as awesome and useful as it already is in my brain.



The VISION for my GSoC Project