Sarah's (G)SoC 2011: as chris farley would say...

holy shnikes! i've made some very satisfying progress since yesterday morning. i've reworked locusIndexMDB.py and samIndexMDB.py and now the database (for two entries) looks like this:

{ "_id" : ObjectId("4de7b8f210d1b1f7de000002"),
"locusNumber" : "102",
"indInFasta" : [
"R01",
"R03",
"R02"
],
"SNPs" : 3,
"locusFasta" : "r1_102_aln.fasta",
"length" : 309,
"individuals" : {
"R01" : 8,
"R03" : 5,
"R02" : 5
},
"path" : "/Users/shird/Documents/alignedLoci/r1_102_aln.fasta" }

{ "_id" : ObjectId("4de7b8f210d1b1f7de000003"),
"locusNumber" : "103",
"indInFasta" : [
"R01",
"R03"
],
"SNPs" : 4,
"locusFasta" : "r1_103_aln.fasta",
"length" : 316,
"individuals" : {
"R01" : 7,
"R03" : 3,
"R02" : 1
},
"path" : "/Users/shird/Documents/alignedLoci/r1_103_aln.fasta" }

breakdown of categories:

"_id" is a unique tag mongoDB gives every entry.
"locusNumber" is the number that identifies a locus (the PRGmatic pipeline sequentially numbers loci it calls; this number can then be associated with SAM files).
"indInFasta" is an array of individuals that were called for each particular locus.
"SNPs" refers to how many variable sites are contained within a locus.
"locusFasta" is the official file name.
"length" is the length of the locus.
"individuals" is an embedded document where each individual is given a value for how many reads from that individual aligned to each particular locus. it looks like a repeat of the "indInFasta" category, but it's not. a minimum coverage needs to be met before a locus can be called, but that doesn't mean 0 reads from an individual aligned to a locus. the second locus in the database above has two individuals in the "indInFasta" category and three in the "individuals" category. this is because "R02" only has one read that aligned to the reference locus and that is below the threshold for calling a locus for a individual.
"path" refers to the path where the locus file came from.

locusIndexMDB.py creates the database and inputs all information except the values within the "individuals" category. each of these categories is important for the final product and i'm particularly excited about the "individuals" document, because how much coverage an individual has for a locus is super important information and i haven't been able to find a way to get that information easily.
i'll be pushing the scripts to github this afternoon and then working on retrieving any subset of the reads referenced above.

Sarah's (G)SoC 2011

Pageviews

Thursday, June 2, 2011

as chris farley would say...

No comments:

Post a Comment