Sarah's (G)SoC 2011: June 2011

Wednesday, June 29, 2011

slow and steady

i have begun the data format conversion from fasta to IMa2 (a widely-used, coalescent-based program that uses multi-locus data to estimate demographic parameters for a set of populations). this is going to be as big of a beast as i expected; since the program does so much, it's input format is kind of heinous. this might not make sense, but the IMa2 format looks like this (where the things in purple/square brackets are just descriptors i've inserted so you know what everything means);

Description of data
2 [npops]
POP1 POP2 [pop names in order]
((0,1):5,(2,3):4):6 [population tree]
2 [nloci]
locus1Name 2 [nIndPop1] 2 [nIndPop2] 10 [length] H [mutModel] 1.0 [inhScalar] 0.0008 [mutRate] 0.0005 - 0.0009 [mutRateRange]
indPOP1_1 ACGTGACGTA
indPOP1_2 ACGTGACGTA
indPOP2_1 AGGTGACGTA
indPOP2_2 AGGTGACGTA
locus2Name 2 [nIndPop1] 2 [nIndPop2] 12 [length] H [mutModel] 1.0 [inhScalar] 0.0008 [mutRate] 0.0005 - 0.0009 [mutRateRange]
indPOP1_1 CCGACGTGCAGG
indPOP1_2 CCGACGTGCAGG
indPOP2_1 CCGATGTGCAGG
indPOP2_2 CCGACGTGCAGG

my plan of attack is:
(1) user is going to have to provide an input file to set some of the parameters that IMa2 requires (mutation model, inheritance scalar, population tree, etc.). my first task is to input that file and place the correct lines in the correct spot in the output file. yesterday, i created what i think the input file will be and have successfully imported it and put the data into a dictionary (so that i can output the lines based on keys later). i can also track number of loci, another input parameter.
(2) data from other scripts. getPopFromMDB.py creates a list of (locus) files that have at least one individual from a set of populations. from this script, i record number of populations and their names, as well as have the actual files to format.
(3) correctly integrate pieces. sequences need to be renamed and ordered correctly. each locus gets a header line that changes.

slow and steady wins the race on this one.

yesterday, husband/fellow grad student noah reid suggested that instead of "reads" associated with each individual for each locus in the db.loci database, it might be more helpful to display "average coverage". i modified samIndexMDB.py to record this change and i agree that it will be more helpful. (with the test data i have "reads" basically is "average coverage" but that won't be true for users where a full read = one locus. so this is more robust. yay.) i also put the finishing touches on a toNexus.py that takes the list of files generated by getPopsFromMDB.py as input and reformats them into nexus format. this was really easy since biopython already has a module to do it. now it's a function in my scripts.

Monday, June 27, 2011

hooray - populationsInFasta

after a very painful afternoon of not understanding why seemingly flawless code was misbehaving, i have been able to add a "populationsInFasta" collection to db.loci. i added code to textDataIndexMDB.py that basically scans all documents for a match to an array of the individuals belonging to a population and if found, adds the population to the populationsInFasta array. i'm going to push the code to GitHub and hopefully do some robustness testing today/tomorrow with other datasets. then with the rest of the week i begin format conversions. gosh - i just LOVE when troublesome code is resolved!

the db.loci collections look like this:

{ "SNPs" : 7,
"_id" : ObjectId("4e08edef10d1b18af7000017"),
"indInFasta" : [
"J11",
"J02",
"J05",
"J09",
"J08",
"J07",
"J18",
"J19",
"J17",
"J14",
"J15",
"J12",
"J13",
"J10",
"J06"],
"individuals" : {
"J01" : 2,
"J02" : 8,
"J03" : 3,
"J04" : 2,
"J05" : 6,
"J06" : 5,
"J07" : 3,
"J08" : 9,
"J09" : 7,
"J10" : 10,
"J11" : 9,
"J12" : 5,
"J13" : 7,
"J14" : 5,
"J15" : 11,
"J16" : 1,
"J17" : 10,
"J18" : 8,
"J19" : 14,
"J20" : 2 },
"length" : 296,
"locusFasta" : "JUNCOmatic_121_aln.fasta",
"locusNumber" : "121",
"path" : "/Users/shird/Documents/Dropbox/lociNGS1/juncoLoci/JUNCOmatic_121_aln.fasta",
"populationsInFasta" : [
"POP3",
"POP2",
"POP1",
"POP4" ] }

so much useful information! i can now query for loci that contain at least one individual from a given set of populations.

Friday, June 24, 2011

finally back to it!

so excited to be back at work on my SoC project. this morning i've decided that i'm going to add a "populations" collection to db.loci - this way, i can simply query the populations i want (instead of query the individuals for membership to the given populations). that's all for now - just letting everyone know i'm happily back on the grid!

(oh! and the evolution conference was great. exhausting, as per usual, but great. i got to meet fellow SoC student Laurel Yohe and NESCent guru Karen Cranston. and my talk went ok - which was the minimum i was hoping for. there was some interest in PRGmatic, which is pretty nice/intimidating. hope to see you all next year in ottawa!)

Wednesday, June 15, 2011

hold up

in order to do IMa2 or migrate formatting, i have to be able to query populations, not just a list of individuals. duh! i'm switching gears and focusing on writing to the terminal all the loci that contain any individuals from a given subset of populations, regardless of individual membership. from there i can create the input list of files for the converters and work on those scripts next (with proper input!). i'm a little surprised i'm just putting this together now but excited to have a plan of action. i knew i couldn't really be ahead of schedule!

Tuesday, June 14, 2011

mostly thinking out loud

this week has been kind of weird because i'm a little preoccupied with the talk i'm giving at the evolution meeting AND the fact that i'm a little ahead of schedule and needing to begin on the next big task: the file conversions. (i'm not actually sure i'm ahead of schedule - outputting the data to the terminal screen has been straightforward, so i'm moving on to file format conversions and reserving the right to go back and re-examine outputting individual/demographic data.)

since one of the main motivations for the program is to be able to generate files given a subset of data (i.e., a specific list of individuals), i've been thinking about how to accomplish this. i've decided that it'll be a set of scripts, the first of which generates a list of files that contain the input list of individuals and that script will call the relevant "converter scripts" to do the leg work in converting file formats on the list generated by the first script. phew.

i've written a "new" script, getFastaFromMDB.py that generates the file list from a list of individuals (given at the command line). it's basically the combined pieces of scripts i've been messing with lately. since Biopython has a fasta to nexus converter, it's really easy to do that conversion (and is already done). the next step for me is to work on the IMa2 converter script, which is probably not trivial.

on a side note, john mccormack, a post-doc in the brumfield lab, asked me if the program would be able to find loci that contain at least some number of individuals from each population but where the identity of the individuals is unknown/unimportant (i.e., *BEAST requires at least one individual per population for each locus). that is a really good idea and seems like something i can do and the program should do. the back of my brain is working on an algorithm for making this happen. in the mean time, i'm going to forge ahead with the IMa2 converter.

Thursday, June 9, 2011

outputting data to terminal

regarding brad's advice that i should try to get desired output onto the terminal screen: i have made some progress and indeed feel better about the GUI-filled future. (cue homer simpson: "mmmm. gooey-filled future".)
outputting the demographic data in the db.demographic collection was easy. this data is strictly {key:value} pairs, so nothing fancy. with a new script, displayMDBdata.py, i can get out all the data that i put in and in the original format (which sounds slightly circular but is good). the beautiful script:

the beautiful output:

Setup cursor: <pymongo.cursor.Cursor object at 0x10155c250>

Individual: J01 , Location: NoPlace, TX , Total Number of Reads: 6281 , Number of Loci: 148

Individual: J02 , Location: NoPlace, TX , Total Number of Reads: 5570 , Number of Loci: 126

...etc...

outputting the locus information has revealed an intellectual hurdle that i failed to jump over (a la america's funniest home videos), regarding cursors. i tried the formula that worked for outputting the demographic data: set up a cursor and print the value for each key. but since the list of individuals within each fasta locus is stored as an array and the counts associated with each individual within a locus is stored as an embedded document, the not as beautiful output looks like this:

Locus: JUNCOmatic_100_aln.fasta , Individuals Present: [u'J09', u'J08', u'J10', u'J18', u'J19', u'J01', u'J17', u'J03', u'J06', u'J07', u'J11'] , Individuals with Read Counts: {u'J15': 1, u'J12': 2, u'J20': 2, u'J14': 5, u'J13': 2, u'J09': 10, u'J08': 4, u'J16': 1, u'J10': 8, u'J18': 9, u'J19': 4, u'J01': 4, u'J17': 3, u'J03': 5, u'J11': 3, u'J05': 1, u'J04': 2, u'J07': 4, u'J06': 8}

i don't know what i was expecting but that isn't it. the u'data' format confuses me (unicode?) and i haven't been able to wrap my brain around json formatted data and translating with python (even with a module called simplejson!)

so near as i can tell, the "Individuals Present" array and "Individuals with Read Counts" documents are not "cursored" - whatever dictionary-creating magic happens with saving the results of the find() as a cursor doesn't apply to embedded data. something tells me this should be an easy fix, it's just a matter of finding it. so that's up next...

Tuesday, June 7, 2011

brainstorm: counts

this morning i had a couple of brainstorms involving storing the total number of (high quality) reads an individual has (i.e., present in the sam/bam file), storing the number of loci an individual has (i.e. how many locus files each individual is present in) and storing the bam directory path so that the user won't have to enter it more than once. so i've updated locusIndexMDB.py, samIndexMDB.py and textDataIndexMDB.py, and i'll commit them later today. (i also incorporated suggestions jeremy made and am in the process of adding more comments throughout the scripts.)

db.demographic has as many documents as there are individuals in the dataset and they now all look something like this:

{ "_id" : ObjectId("4dee6475318a120c74000000"),
"Individual" : "J01",
"Latitude" : "45.678",
"Location" : "NoPlace, TX",
"Longitude" : "-109.876",
"Population" : "POP1",
"Species" : "Junco hyemalis",
"numLoci" : 148,
"totalReads" : "6281" }

db.loci has as many documents as there are loci (plus one for the bamPath document, which contains just the path to the bam directory). they now all look something like this:

{ "_id" : ObjectId("4dee6482318a120c78000295"),

"SNPs" : 10,

"indInFasta" : [ "J01",

"J17",

"J19" ],

"individuals" : { "J01" : 3,

"J02" : 2,

"J03" : 1,

"J04" : 2,

"J05" : 2,

"J06" : 1,

"J11" : 1,

"J12" : 2,

"J13" : 2,

"J14" : 2,

"J15" : 1,

"J17" : 3,

"J19" : 6 },

"length" : 266,

"locusFasta" : "JUNCOmatic_719_aln.fasta",

"locusNumber" : "719",

"path" : "/Users/shird/Documents/Dropbox/lociNGS1/juncoLoci/JUNCOmatic_719_aln.fasta" }

i'm quite satisfied with the state of the database right now, but that might be because i haven't done much with it. i've gotten reasonably good at adding/updating things, but haven't really retrieved/displayed much. i am a little scared about how well i'll be able to incorporate database info into the GUI and what's going to happen when i apply these scripts to datasets i haven't personally created. that's next, i suppose. ONWARD!

Monday, June 6, 2011

demographic data done! (for now, anyway)

this morning i was able to very quickly import tab-delimited data into mongoDB (i'm learning things!). i decided to put the demographic data into it's own collection, so my database now has two collections: db.loci and db.demographic. an input of (tab-delimited) demographic data like this:

Individual Species Location Latitude Longitude Population

J01 Junco hyemalis NoPlace, TX 45.678 -109.876 POP1

J02 Junco hyemalis NoPlace, TX 45.678 -109.876 POP1

J11 Junco phaeonotus NoPlace, MS 49.800 -110.424 POP2

J12 Junco phaeonotus NoPlace, MS 49.800 -110.424 POP2

looks like this in the db.demographic collection:

{ "_id" : ObjectId("4ded0248318a12036f000000"), "Longitude" : "-109.876", "Individual" : "J01", "Location" : "NoPlace, TX", "Latitude" : "45.678", "Species" : "Junco hyemalis", "Population" : "POP1" }

{ "_id" : ObjectId("4ded0248318a12036f000001"), "Longitude" : "-109.876", "Individual" : "J02", "Location" : "NoPlace, TX", "Latitude" : "45.678", "Species" : "Junco hyemalis", "Population" : "POP1" }

{ "_id" : ObjectId("4ded0248318a12036f00000a"), "Longitude" : "-110.424", "Individual" : "J11", "Location" : "NoPlace, MS", "Latitude" : "49.800", "Species" : "Junco phaeonotus", "Population" : "POP2" }

{ "_id" : ObjectId("4ded0248318a12036f00000b"), "Longitude" : "-110.424", "Individual" : "J12", "Location" : "NoPlace, MS", "Latitude" : "49.800", "Species" : "Junco phaeonotus", "Population" : "POP2" }

hooray! (and the code is only 30 lines!)

now i'm trying to figure out the most intelligent way to proceed. my project plan has this week allotted for the demographic input task (textDataIndexMDB.py), quality data input and retrieving SAM/BAM sequence files. i'm rethinking whether quality data is worth putting into the database (since all reads used to call a locus should already be high quality). the retrieval of SAM/BAM data should just require saving the path to their directory in the loci collection (much like the retrieval of sequence data from locus fasta files). i'm worried i haven't allotted enough time for the data formatting and GUI construction - the latter especially so since i've never done it before! i suppose i'll try to get the official tasks done by tomorrow, then update the project plan to reflect an unanticipated half week bonus. that is, of course, assuming all goes according to plan..

Thursday, June 2, 2011

as chris farley would say...

holy shnikes! i've made some very satisfying progress since yesterday morning. i've reworked locusIndexMDB.py and samIndexMDB.py and now the database (for two entries) looks like this:

{ "_id" : ObjectId("4de7b8f210d1b1f7de000002"),
"locusNumber" : "102",
"indInFasta" : [
"R01",
"R03",
"R02"
],
"SNPs" : 3,
"locusFasta" : "r1_102_aln.fasta",
"length" : 309,
"individuals" : {
"R01" : 8,
"R03" : 5,
"R02" : 5
},
"path" : "/Users/shird/Documents/alignedLoci/r1_102_aln.fasta" }

{ "_id" : ObjectId("4de7b8f210d1b1f7de000003"),
"locusNumber" : "103",
"indInFasta" : [
"R01",
"R03"
],
"SNPs" : 4,
"locusFasta" : "r1_103_aln.fasta",
"length" : 316,
"individuals" : {
"R01" : 7,
"R03" : 3,
"R02" : 1
},
"path" : "/Users/shird/Documents/alignedLoci/r1_103_aln.fasta" }

breakdown of categories:

"_id" is a unique tag mongoDB gives every entry.
"locusNumber" is the number that identifies a locus (the PRGmatic pipeline sequentially numbers loci it calls; this number can then be associated with SAM files).
"indInFasta" is an array of individuals that were called for each particular locus.
"SNPs" refers to how many variable sites are contained within a locus.
"locusFasta" is the official file name.
"length" is the length of the locus.
"individuals" is an embedded document where each individual is given a value for how many reads from that individual aligned to each particular locus. it looks like a repeat of the "indInFasta" category, but it's not. a minimum coverage needs to be met before a locus can be called, but that doesn't mean 0 reads from an individual aligned to a locus. the second locus in the database above has two individuals in the "indInFasta" category and three in the "individuals" category. this is because "R02" only has one read that aligned to the reference locus and that is below the threshold for calling a locus for a individual.
"path" refers to the path where the locus file came from.

locusIndexMDB.py creates the database and inputs all information except the values within the "individuals" category. each of these categories is important for the final product and i'm particularly excited about the "individuals" document, because how much coverage an individual has for a locus is super important information and i haven't been able to find a way to get that information easily.
i'll be pushing the scripts to github this afternoon and then working on retrieving any subset of the reads referenced above.

Wednesday, June 1, 2011

welcome june

yesterday was a pretty good day. i worked on the locusIndexMDB.py script in the morning and was able to:
1. determine and fix the reason the script was throwing an error on my desktop (reading in a directory will include hidden files!)
2. turn what was an array of individuals in each locus document into an embedded document such that a number can now be the value and the individual name is the key - again, within the individual key within a locus document. so what was:
{
"_id" : ObjectId("4ddc0f9a10d1b1a91c000003"),
"length" : 330,
"individuals" : [
"R07.02",
"R07.01",
"R14.02",
"R14.01",
"R12.01",
"R06.01",
"R06.02",
"R12.02",
"R01.01",
"R01.02",
"R19.01",
"R19.02"
],
"locus" : "RAILmatic_4.fasta"
}
is now:
{
"_id" : ObjectId("4de56d9010d1b1ee13000007"),
"SNPs" : 3,
"locus" : "RAIL_8.fasta",
"length" : 317,
"individuals" : {
"R01" : 0,
"R17" : 0,
"R14" : 0,
"R05" : 0,
"R07" : 0,
"R11" : 0,
"R08" : 0
}
"path" : "fasta/RAIL_8.fasta"
}

this is a bigger accomplishment than it might look like, at first. not only is the "individuals" value now an embedded (searchable and modifiable) document, the names are no longer alleles, but the individual names. this might be tricky to scale up for a broader array of input formats, but for now, it's good. i also updated the screen output (since the individual names are now a document, i had to reformat how to count up how many documents each individual is in...) i also learned/began to understand more about json formatting and database organization, which is good.

yesterday afternoon, i worked on the samInputMDB.py script to count up the reads from a samfile that stick to each locus for each individual (to assign values to the individuals above). i've been able to get a list of the loci out of a samfile as well as format an input file to get individual names that match the individuals above and last week i was able to figure out how to count reads that align to each reference locus. next step is iterating over all individuals within all documents to update their read counts with the correct count. but first, coffee...

Pageviews