Sarah's (G)SoC 2011: welcome june

yesterday was a pretty good day. i worked on the locusIndexMDB.py script in the morning and was able to:
1. determine and fix the reason the script was throwing an error on my desktop (reading in a directory will include hidden files!)
2. turn what was an array of individuals in each locus document into an embedded document such that a number can now be the value and the individual name is the key - again, within the individual key within a locus document. so what was:
{
"_id" : ObjectId("4ddc0f9a10d1b1a91c000003"),
"length" : 330,
"individuals" : [
"R07.02",
"R07.01",
"R14.02",
"R14.01",
"R12.01",
"R06.01",
"R06.02",
"R12.02",
"R01.01",
"R01.02",
"R19.01",
"R19.02"
],
"locus" : "RAILmatic_4.fasta"
}
is now:
{
"_id" : ObjectId("4de56d9010d1b1ee13000007"),
"SNPs" : 3,
"locus" : "RAIL_8.fasta",
"length" : 317,
"individuals" : {
"R01" : 0,
"R17" : 0,
"R14" : 0,
"R05" : 0,
"R07" : 0,
"R11" : 0,
"R08" : 0
}
"path" : "fasta/RAIL_8.fasta"
}

this is a bigger accomplishment than it might look like, at first. not only is the "individuals" value now an embedded (searchable and modifiable) document, the names are no longer alleles, but the individual names. this might be tricky to scale up for a broader array of input formats, but for now, it's good. i also updated the screen output (since the individual names are now a document, i had to reformat how to count up how many documents each individual is in...) i also learned/began to understand more about json formatting and database organization, which is good.

yesterday afternoon, i worked on the samInputMDB.py script to count up the reads from a samfile that stick to each locus for each individual (to assign values to the individuals above). i've been able to get a list of the loci out of a samfile as well as format an input file to get individual names that match the individuals above and last week i was able to figure out how to count reads that align to each reference locus. next step is iterating over all individuals within all documents to update their read counts with the correct count. but first, coffee...

Sarah's (G)SoC 2011

Pageviews

Wednesday, June 1, 2011

welcome june

No comments:

Post a Comment