Sarah's (G)SoC 2011: 2011

Wednesday, August 24, 2011

The End (sort of)

the summer of code is officially over and i just wanted to post this last entry to say a big

THANK YOU

to Google (especially Carol Smith), NESCent, Dr. Karen Cranston, Dr. Hilmar Lapp, Dr. Jeremy Brown and Dr. Brad Chapman. this summer was awesome and i feel very fortunate to have participated. i hope this program continues ad infinitum - it really was a pleasure.

the final resting place for the code to lociNGS is on GitHub (here). if you have any questions about lociNGS or you're considering participating in summer of code in the future and want someone to tell you how great it is, please feel free to contact me!

Friday, August 19, 2011

final friday

unfortunately, it took most of this last week for me to identify and fix the installation weirdness. i think it's all straightened out now, but it wasn't a very satisfying finish for the summer (other than the fact that the installation appears to work now!). the README and github account have been updated to reflect the changes that were made and it isn't that much more difficult to install.

tonight/this weekend i'm going to finish working on the website for lociNGS and finalize the code (delete commented out lines, make sure each file has a heading, etc.). and THEN - well, i don't know what. i hope people use the program and give me feedback on how to improve it.

the tentative lociNGS logo

Thursday, August 18, 2011

blindspot!

sooooo... turns out i've been fooling myself into thinking my package very easily gets and installs all the required pieces to run lociNGS. it just appeared that way because i already had everything that's needed installed on my machines. for the past couple of days i thought it was just a mac 0SX 10.5 problem, but it's actually a problem for everyone.
the problem, as i understand it, is this: numpy needs to be installed before biopython and numpy does not install with setuptools. so i've changed the setup.py script to only look for the pysam, pymongo and simplejson packages. then i'm going to have the user install Distribute, then easy_install numpy, then biopython and that should be everything. it's a little more complicated than just running the setup.py script, but doesn't require going to the internet and downloading things, only typing a couple extra lines into a terminal window. this has been driving me bonkers, so i'm moderately satisfied that i have some way to relatively easily instruct users on how to get it working at all.
these changes will appear on github and in the readme this evening (due to an unforeseen afternoon obligation). *sigh*

Tuesday, August 16, 2011

it's the final countdown...

i'm actually pretty sad this is the last week of summer of code - it has really been fun and productive. yesterday felt rather unproductive, though, as i spent a lot of time creating a github webpage for the program (shird.github.com/lociNGS). i don't know much about HTML (or anything), so it required a lot of tinkering. i also updated the README to include a short "common errors" section, that will be expanded as issues arise.
this morning i updated the code to be a little more efficient and to ensure that both SAM and BAM formatted data can be accepted (although SAM is much slower). i did a complete import/export run through with SAM data (for the first time) and fixed a couple spots of code where there were issues. i also added three display columns to the summary screen: number of total reads gathered per individual, number of reads that align to the reference (per individual), and the percentage of the total reads the used reads comprise. i think this is pretty useful information to display and i was basically already recording it, so why not?
tomorrow morning i'm going to download, install and run the program on a new computer to make sure it goes as planned (using my README as the guide). then i'll hopefully get around to coding in one/several error message(s) for if the user has an issue importing/exporting/displaying.

Friday, August 12, 2011

I LOVE PYTHON!

i've been cleaning code and fixing little bugs. i can't believe i'm still finding them. i have learned so much about python this summer and i love it. the cleaning of the code is very satisfying - i can almost feel my brain transitioning from perl-like to python-like.

i realized today that i've been recording the number of SNPs that each locus has but wasn't displaying that info, so i added that to the locus screen. i also added some color to the output on the main screen. and i finally fixed the line endings like brad suggested.

how the gui looks with the test data:

Wednesday, August 10, 2011

wednesday help files

today's major task was working on a couple of help files for the gui. one displays the readme file that comes with the package (which took me perhaps longer than it should have to display properly). the other describes what the buttons on the summary and locus screens do. i played around with using a floating tooltip attached to the mouse, but my understanding of how it worked was not sophisticated enough to make it anything other than annoying. i also added a folder of test data to the package on GitHub. bring on thursday - it's been a good first half of the week!

Tuesday, August 9, 2011

major update - hooray

today i resolved to make the label-update-as-functions-completed thing work instead of the multiple popup windows that i had as a workaround. and i succeeded! i uploaded the majorly revised run_lociNGS.py to GitHub and am very pleased with the new output (assuming there's not some currently undiscovered bug). i'm spending the afternoon on the README file.

Monday, August 8, 2011

minor updates - working well

i downloaded the package from github this morning and ran it with some new data and discovered the raw data wasn't displaying correctly. so i spent most of today troubleshooting and fixing that. the good news is that it's all working well now - and i've incorporated brad's suggestion that users are informed when things import correctly and after outputting the raw reads, a popup window in the GUI lets them know. i couldn't figure out how to update a label after a function is completed (which was actually brad's suggestion and would be actually much better) - but the popup is a functioning workaround for now.

everything is up to date on github - i need to edit the readme to reflect the much easier installation with setup.py. i'm also going to keep working on a few other suggestions brad made for improving the code. my husband is out of town for the week at the ecology meeting so i'm hoping this last week before the soft "pencils down" date is super productive!

Thursday, August 4, 2011

setup.py, github and segmentation faults

this morning i *think* i got a setup.py file working for downloading necessary components and installing on users' machines. what an amazing invention - the setup.py scripts are brilliant! i then radically reorganized my GitHub account to work with the setup.py script.
for the second half of today, i TRIED to get a popup window to display "Successful Import" when lociNGS imports data (using a tkMessageBox). this led to about a million segmentation faults (i think because the popup isn't connected to the appropriate frame or something. but i could be totally wrong about that because my brain ain't workin' so good right now).
plan for tomorrow is to take a fresh look at the popup issue (or make a work around) and continue to clean up code. and i should upload some test data for interested parties. tgi(almost)f.

Wednesday, August 3, 2011

readme excitement

kind of a ho-hum week so far - working on putting together a comprehensive readme for the program and trying to get the whole package together to force some friends to troubleshoot. i'm going to edit/revise the readme in the morning then post it (and the cleaned up code) to GitHub...

Friday, July 29, 2011

the final stretch

the good news is the gui is more or less fully functional now. the bad news is that i think it's really specific to input format and i'm not sure what to do about that. my intention was to make a robust but simple way of viewing a large multi-locus dataset...the back parts of my brain are mulling over how to remove some of the most restrictive input requirements (like numerical loci) so that more people can use this program.
but back to the good news. import menu, export menu, summary screen, locus screen, output formats, raw data output - all up and running. it's not the prettiest thing you've ever seen but it is fast and pretty straightforward.

the welcome screen with directions

the import menu

the summary screen displayed when 'Display the data' is pressed in the File menu

when one of the 'numLoci' buttons is pressed on the summary screen, a second screen with info about all the loci that a particular individual has is displayed.

when the 'Coverage_This_Ind' or 'Coverage_Total' buttons are pressed, a fasta file is created that contains all the original reads pertaining to a locus, from either a single individual or all individuals that have that locus (respectively). to export loci in the 3 different formats, first a set of the data must be selected. users may select individuals:

or users may select populations:

then users select the output formats:

all loci that are output either contain at least all individuals selected or at least one indivdiual from each population selected.

so that's where i'm at now. this weekend = continue thinking about how to make the program more robust and beginning the process of (fingers crossed) getting this into a format readable/displayable by the Galaxy Project. man this week went quick!

Thursday, July 28, 2011

export menu - done!

it has been a pretty productive start of the week. the export menu is complete - users select a subset of populations or individuals, then the output format(s) they would like, and the program searches for all the loci that contain the given subset, formats those loci and outputs them accordingly. wahoo! the new code is on GitHub.

i realized i'm not very good at reusing code (because i haven't fully grasped the OOP mentality and the finer points of python/tkinter are still rather difficult for me) but i think i'll be able to clean up the code as i continue to learn more, without losing functionality. i'm pleased that everything is more or less working well right now. the two remaining goals for this week/weekend are the get the raw data output associated with a button and to make the categories shown on the summary screen read from the input file instead of hard set by my code. i think it's doable, but we'll see...

Monday, July 25, 2011

finally - scrollbars!

the title says all i really want to say - the two main display screens for the GUI are now equipped with scrollbars! next up, make the "Coverage_This_Ind" button output the raw data that align to a particular locus and work on the export menu's callbacks.

Friday, July 22, 2011

GUI progress

this week i've been making progress with the Tkinter GUI for lociNGS (still not sure about the name...). i'm attaching some screen shots to better illustrate where i'm at, but in words, i have a fully functional input menu. i have a summary screen and a locus specific screen working, but only if they don't contain more rows of data than a screen can fit. i've been working for the last 24 hours or so on scrollbars (the natural solution to this problem), but adding them to the code i already have has been, er, difficult. i'm going to keep mulling over that in the back of my brain and go back to working on the export menu/function.
i'm uploading the GUI script to GitHub even though it's a total mess and it won't work on anyone else's computer because it's dependent on inputting the right data in the right formats. useful, i know.

menu shot and the summary screen. 20 individuals, 4 populations, 2 species, made up locations, and the number of loci that each individual has called for it. (does that last one make sense? with next-gen data, the number of loci obtained for each individual in a dataset is variable. this column shows how many high quality loci were obtained for that individual.)

this is what the loci data screen should look like. i've restricted the output to only 10 loci so that you can read them on the screen.

this is what the loci screen actually looks like right now because it's cramming way too many lines of data onto a single screen and i can't figure out the best way to add scrollbars (which will have to go on the summary screen too).

Monday, July 18, 2011

kind of a rough week

this past week has been kind of rough. i sprained my ankle pretty bad AND i have been lost in python land trying to convert my scripts to intercommunicating modules. i spent many hours trying to figure this particular problem out:

i created a class to make a row of checkboxes (class Checkbar) and a method for that class that reports a vector of 0s and 1s for whether a box is unchecked or checked (respectively, method called state). i have a separate function (called POPMENU) that creates an object of this class and this method has an internal method that scans all objects of the Checkbar class and reports the vectors for them all (called allstates). when i run POPMENU, this works fine. but when i embed it in a GUI widget (a Button), the vector always returns zeroes. i figure it has something/everything to do with passing variables or when the instances are created, but i can't quite make it work. i'm uploading the isolated problem in a script called confused.py (to GitHub) if anyone wants to look at it.

this week i've spent a lot of time reading about lambda functions, classes (and their methods) and functions. on the plus side, i've written the "input" portion of the GUI - in other words, i can create a drop down menu that will load the locus files, SAM/BAM files and demographic data into MongoDB. that was rather satisfying and is located in the menu.py script. two other new scripts, called constructionMDB.py and convertingLociNGS.py are being uploaded too. (constructionMDB.py is used by menu.py for the input of files. convertingLociNGS.py isn't being used just yet, and may have to be rewritten, but was a first pass at getting all the reformatting scripts together.)

Friday, July 8, 2011

that's a little better

i've been trying to organize my brain and formulate the next step for this project. brad suggested i think about packaging some scripts into importable modules, which requires an intelligent plan for how the scripts will be used. i've been reading about GUIs and slowly piecing together how the various scripts i have will interact with each other and the user in the end product. i can't say i've settled on an optimal module design (should i put all the conversion scripts into one module so they can be accessed together? how about the scripts that upload the various data types to mongoDB? those seem more like stand alone pieces, but a user will need most or all of them at once, so...) still processing.
i edited my original GUI plan/figure to reflect the current state of my brain. feels good to transfer such things to paper. i anticipate the rest of today will be devoted to putting the converter scripts together in a module, since i'm pretty sure that makes sense. thoughts will continue to percolate as well.

Wednesday, July 6, 2011

it's also alive!

after a couple of hours cursing Migrate this morning, i found an error in my code and the output from toMigrate.py now runs perfectly. (i was shocked when the root of my frustration was entirely my fault. shocked, i say! just kidding.)

that means that the scripts to database/retrieve data and convert .fasta files to the three desired formats are done. (tentatively done.) i've spent this afternoon reading Programming Python's chapters on Tkinter (their GUI-er) and copying the short scripts. this i'll probably do tomorrow as well, on top of making several new test datasets to error check my scripts. i also need to investigate Galaxy.

Tuesday, July 5, 2011

it's alive!

the first IMa2 input file my toIMa2.py script produced successfully loaded and ran(in IMa2!) this morning. wahoo! i followed that up by working on toMigrate.py, which I've written to take an IMa2 file as input. i know this means that users will have to generate an IMa2 file, even if they're only interested in a Migrate file, but these files are small and fast. also, the two formats are very similar and the IMa2 files have all the components necessary for Migrate format, so it was easy to code. i'm currently working on getting the toMigrate.py output files successfully running with Migrate, which is proving tricky. off to GitHub then more fiddling with Migrate. good start to the week!

Friday, July 1, 2011

IMa2 format - done?

i can't believe how relatively easy writing this formatting script turned out to be. I LOVE PYTHON! i still need to run the final files through IMa2 to be certain they're good to go, but even if there is something that needs to be fixed, the bulk of the script (toIMa2.py) is done! and already loaded on GitHub.

the output looks something like this (with sequences truncated and only two loci of 149 shown):

Wednesday, June 29, 2011

slow and steady

i have begun the data format conversion from fasta to IMa2 (a widely-used, coalescent-based program that uses multi-locus data to estimate demographic parameters for a set of populations). this is going to be as big of a beast as i expected; since the program does so much, it's input format is kind of heinous. this might not make sense, but the IMa2 format looks like this (where the things in purple/square brackets are just descriptors i've inserted so you know what everything means);

Description of data
2 [npops]
POP1 POP2 [pop names in order]
((0,1):5,(2,3):4):6 [population tree]
2 [nloci]
locus1Name 2 [nIndPop1] 2 [nIndPop2] 10 [length] H [mutModel] 1.0 [inhScalar] 0.0008 [mutRate] 0.0005 - 0.0009 [mutRateRange]
indPOP1_1 ACGTGACGTA
indPOP1_2 ACGTGACGTA
indPOP2_1 AGGTGACGTA
indPOP2_2 AGGTGACGTA
locus2Name 2 [nIndPop1] 2 [nIndPop2] 12 [length] H [mutModel] 1.0 [inhScalar] 0.0008 [mutRate] 0.0005 - 0.0009 [mutRateRange]
indPOP1_1 CCGACGTGCAGG
indPOP1_2 CCGACGTGCAGG
indPOP2_1 CCGATGTGCAGG
indPOP2_2 CCGACGTGCAGG

my plan of attack is:
(1) user is going to have to provide an input file to set some of the parameters that IMa2 requires (mutation model, inheritance scalar, population tree, etc.). my first task is to input that file and place the correct lines in the correct spot in the output file. yesterday, i created what i think the input file will be and have successfully imported it and put the data into a dictionary (so that i can output the lines based on keys later). i can also track number of loci, another input parameter.
(2) data from other scripts. getPopFromMDB.py creates a list of (locus) files that have at least one individual from a set of populations. from this script, i record number of populations and their names, as well as have the actual files to format.
(3) correctly integrate pieces. sequences need to be renamed and ordered correctly. each locus gets a header line that changes.

slow and steady wins the race on this one.

yesterday, husband/fellow grad student noah reid suggested that instead of "reads" associated with each individual for each locus in the db.loci database, it might be more helpful to display "average coverage". i modified samIndexMDB.py to record this change and i agree that it will be more helpful. (with the test data i have "reads" basically is "average coverage" but that won't be true for users where a full read = one locus. so this is more robust. yay.) i also put the finishing touches on a toNexus.py that takes the list of files generated by getPopsFromMDB.py as input and reformats them into nexus format. this was really easy since biopython already has a module to do it. now it's a function in my scripts.

Monday, June 27, 2011

hooray - populationsInFasta

after a very painful afternoon of not understanding why seemingly flawless code was misbehaving, i have been able to add a "populationsInFasta" collection to db.loci. i added code to textDataIndexMDB.py that basically scans all documents for a match to an array of the individuals belonging to a population and if found, adds the population to the populationsInFasta array. i'm going to push the code to GitHub and hopefully do some robustness testing today/tomorrow with other datasets. then with the rest of the week i begin format conversions. gosh - i just LOVE when troublesome code is resolved!

the db.loci collections look like this:

{ "SNPs" : 7,
"_id" : ObjectId("4e08edef10d1b18af7000017"),
"indInFasta" : [
"J11",
"J02",
"J05",
"J09",
"J08",
"J07",
"J18",
"J19",
"J17",
"J14",
"J15",
"J12",
"J13",
"J10",
"J06"],
"individuals" : {
"J01" : 2,
"J02" : 8,
"J03" : 3,
"J04" : 2,
"J05" : 6,
"J06" : 5,
"J07" : 3,
"J08" : 9,
"J09" : 7,
"J10" : 10,
"J11" : 9,
"J12" : 5,
"J13" : 7,
"J14" : 5,
"J15" : 11,
"J16" : 1,
"J17" : 10,
"J18" : 8,
"J19" : 14,
"J20" : 2 },
"length" : 296,
"locusFasta" : "JUNCOmatic_121_aln.fasta",
"locusNumber" : "121",
"path" : "/Users/shird/Documents/Dropbox/lociNGS1/juncoLoci/JUNCOmatic_121_aln.fasta",
"populationsInFasta" : [
"POP3",
"POP2",
"POP1",
"POP4" ] }

so much useful information! i can now query for loci that contain at least one individual from a given set of populations.

Friday, June 24, 2011

finally back to it!

so excited to be back at work on my SoC project. this morning i've decided that i'm going to add a "populations" collection to db.loci - this way, i can simply query the populations i want (instead of query the individuals for membership to the given populations). that's all for now - just letting everyone know i'm happily back on the grid!

(oh! and the evolution conference was great. exhausting, as per usual, but great. i got to meet fellow SoC student Laurel Yohe and NESCent guru Karen Cranston. and my talk went ok - which was the minimum i was hoping for. there was some interest in PRGmatic, which is pretty nice/intimidating. hope to see you all next year in ottawa!)

Wednesday, June 15, 2011

hold up

in order to do IMa2 or migrate formatting, i have to be able to query populations, not just a list of individuals. duh! i'm switching gears and focusing on writing to the terminal all the loci that contain any individuals from a given subset of populations, regardless of individual membership. from there i can create the input list of files for the converters and work on those scripts next (with proper input!). i'm a little surprised i'm just putting this together now but excited to have a plan of action. i knew i couldn't really be ahead of schedule!

Tuesday, June 14, 2011

mostly thinking out loud

this week has been kind of weird because i'm a little preoccupied with the talk i'm giving at the evolution meeting AND the fact that i'm a little ahead of schedule and needing to begin on the next big task: the file conversions. (i'm not actually sure i'm ahead of schedule - outputting the data to the terminal screen has been straightforward, so i'm moving on to file format conversions and reserving the right to go back and re-examine outputting individual/demographic data.)

since one of the main motivations for the program is to be able to generate files given a subset of data (i.e., a specific list of individuals), i've been thinking about how to accomplish this. i've decided that it'll be a set of scripts, the first of which generates a list of files that contain the input list of individuals and that script will call the relevant "converter scripts" to do the leg work in converting file formats on the list generated by the first script. phew.

i've written a "new" script, getFastaFromMDB.py that generates the file list from a list of individuals (given at the command line). it's basically the combined pieces of scripts i've been messing with lately. since Biopython has a fasta to nexus converter, it's really easy to do that conversion (and is already done). the next step for me is to work on the IMa2 converter script, which is probably not trivial.

on a side note, john mccormack, a post-doc in the brumfield lab, asked me if the program would be able to find loci that contain at least some number of individuals from each population but where the identity of the individuals is unknown/unimportant (i.e., *BEAST requires at least one individual per population for each locus). that is a really good idea and seems like something i can do and the program should do. the back of my brain is working on an algorithm for making this happen. in the mean time, i'm going to forge ahead with the IMa2 converter.

Thursday, June 9, 2011

outputting data to terminal

regarding brad's advice that i should try to get desired output onto the terminal screen: i have made some progress and indeed feel better about the GUI-filled future. (cue homer simpson: "mmmm. gooey-filled future".)
outputting the demographic data in the db.demographic collection was easy. this data is strictly {key:value} pairs, so nothing fancy. with a new script, displayMDBdata.py, i can get out all the data that i put in and in the original format (which sounds slightly circular but is good). the beautiful script:

the beautiful output:

Setup cursor: <pymongo.cursor.Cursor object at 0x10155c250>

Individual: J01 , Location: NoPlace, TX , Total Number of Reads: 6281 , Number of Loci: 148

Individual: J02 , Location: NoPlace, TX , Total Number of Reads: 5570 , Number of Loci: 126

...etc...

outputting the locus information has revealed an intellectual hurdle that i failed to jump over (a la america's funniest home videos), regarding cursors. i tried the formula that worked for outputting the demographic data: set up a cursor and print the value for each key. but since the list of individuals within each fasta locus is stored as an array and the counts associated with each individual within a locus is stored as an embedded document, the not as beautiful output looks like this:

Locus: JUNCOmatic_100_aln.fasta , Individuals Present: [u'J09', u'J08', u'J10', u'J18', u'J19', u'J01', u'J17', u'J03', u'J06', u'J07', u'J11'] , Individuals with Read Counts: {u'J15': 1, u'J12': 2, u'J20': 2, u'J14': 5, u'J13': 2, u'J09': 10, u'J08': 4, u'J16': 1, u'J10': 8, u'J18': 9, u'J19': 4, u'J01': 4, u'J17': 3, u'J03': 5, u'J11': 3, u'J05': 1, u'J04': 2, u'J07': 4, u'J06': 8}

i don't know what i was expecting but that isn't it. the u'data' format confuses me (unicode?) and i haven't been able to wrap my brain around json formatted data and translating with python (even with a module called simplejson!)

so near as i can tell, the "Individuals Present" array and "Individuals with Read Counts" documents are not "cursored" - whatever dictionary-creating magic happens with saving the results of the find() as a cursor doesn't apply to embedded data. something tells me this should be an easy fix, it's just a matter of finding it. so that's up next...

Tuesday, June 7, 2011

brainstorm: counts

this morning i had a couple of brainstorms involving storing the total number of (high quality) reads an individual has (i.e., present in the sam/bam file), storing the number of loci an individual has (i.e. how many locus files each individual is present in) and storing the bam directory path so that the user won't have to enter it more than once. so i've updated locusIndexMDB.py, samIndexMDB.py and textDataIndexMDB.py, and i'll commit them later today. (i also incorporated suggestions jeremy made and am in the process of adding more comments throughout the scripts.)

db.demographic has as many documents as there are individuals in the dataset and they now all look something like this:

{ "_id" : ObjectId("4dee6475318a120c74000000"),
"Individual" : "J01",
"Latitude" : "45.678",
"Location" : "NoPlace, TX",
"Longitude" : "-109.876",
"Population" : "POP1",
"Species" : "Junco hyemalis",
"numLoci" : 148,
"totalReads" : "6281" }

db.loci has as many documents as there are loci (plus one for the bamPath document, which contains just the path to the bam directory). they now all look something like this:

{ "_id" : ObjectId("4dee6482318a120c78000295"),

"SNPs" : 10,

"indInFasta" : [ "J01",

"J17",

"J19" ],

"individuals" : { "J01" : 3,

"J02" : 2,

"J03" : 1,

"J04" : 2,

"J05" : 2,

"J06" : 1,

"J11" : 1,

"J12" : 2,

"J13" : 2,

"J14" : 2,

"J15" : 1,

"J17" : 3,

"J19" : 6 },

"length" : 266,

"locusFasta" : "JUNCOmatic_719_aln.fasta",

"locusNumber" : "719",

"path" : "/Users/shird/Documents/Dropbox/lociNGS1/juncoLoci/JUNCOmatic_719_aln.fasta" }

i'm quite satisfied with the state of the database right now, but that might be because i haven't done much with it. i've gotten reasonably good at adding/updating things, but haven't really retrieved/displayed much. i am a little scared about how well i'll be able to incorporate database info into the GUI and what's going to happen when i apply these scripts to datasets i haven't personally created. that's next, i suppose. ONWARD!

Monday, June 6, 2011

demographic data done! (for now, anyway)

this morning i was able to very quickly import tab-delimited data into mongoDB (i'm learning things!). i decided to put the demographic data into it's own collection, so my database now has two collections: db.loci and db.demographic. an input of (tab-delimited) demographic data like this:

Individual Species Location Latitude Longitude Population

J01 Junco hyemalis NoPlace, TX 45.678 -109.876 POP1

J02 Junco hyemalis NoPlace, TX 45.678 -109.876 POP1

J11 Junco phaeonotus NoPlace, MS 49.800 -110.424 POP2

J12 Junco phaeonotus NoPlace, MS 49.800 -110.424 POP2

looks like this in the db.demographic collection:

{ "_id" : ObjectId("4ded0248318a12036f000000"), "Longitude" : "-109.876", "Individual" : "J01", "Location" : "NoPlace, TX", "Latitude" : "45.678", "Species" : "Junco hyemalis", "Population" : "POP1" }

{ "_id" : ObjectId("4ded0248318a12036f000001"), "Longitude" : "-109.876", "Individual" : "J02", "Location" : "NoPlace, TX", "Latitude" : "45.678", "Species" : "Junco hyemalis", "Population" : "POP1" }

{ "_id" : ObjectId("4ded0248318a12036f00000a"), "Longitude" : "-110.424", "Individual" : "J11", "Location" : "NoPlace, MS", "Latitude" : "49.800", "Species" : "Junco phaeonotus", "Population" : "POP2" }

{ "_id" : ObjectId("4ded0248318a12036f00000b"), "Longitude" : "-110.424", "Individual" : "J12", "Location" : "NoPlace, MS", "Latitude" : "49.800", "Species" : "Junco phaeonotus", "Population" : "POP2" }

hooray! (and the code is only 30 lines!)

now i'm trying to figure out the most intelligent way to proceed. my project plan has this week allotted for the demographic input task (textDataIndexMDB.py), quality data input and retrieving SAM/BAM sequence files. i'm rethinking whether quality data is worth putting into the database (since all reads used to call a locus should already be high quality). the retrieval of SAM/BAM data should just require saving the path to their directory in the loci collection (much like the retrieval of sequence data from locus fasta files). i'm worried i haven't allotted enough time for the data formatting and GUI construction - the latter especially so since i've never done it before! i suppose i'll try to get the official tasks done by tomorrow, then update the project plan to reflect an unanticipated half week bonus. that is, of course, assuming all goes according to plan..

Thursday, June 2, 2011

as chris farley would say...

holy shnikes! i've made some very satisfying progress since yesterday morning. i've reworked locusIndexMDB.py and samIndexMDB.py and now the database (for two entries) looks like this:

{ "_id" : ObjectId("4de7b8f210d1b1f7de000002"),
"locusNumber" : "102",
"indInFasta" : [
"R01",
"R03",
"R02"
],
"SNPs" : 3,
"locusFasta" : "r1_102_aln.fasta",
"length" : 309,
"individuals" : {
"R01" : 8,
"R03" : 5,
"R02" : 5
},
"path" : "/Users/shird/Documents/alignedLoci/r1_102_aln.fasta" }

{ "_id" : ObjectId("4de7b8f210d1b1f7de000003"),
"locusNumber" : "103",
"indInFasta" : [
"R01",
"R03"
],
"SNPs" : 4,
"locusFasta" : "r1_103_aln.fasta",
"length" : 316,
"individuals" : {
"R01" : 7,
"R03" : 3,
"R02" : 1
},
"path" : "/Users/shird/Documents/alignedLoci/r1_103_aln.fasta" }

breakdown of categories:

"_id" is a unique tag mongoDB gives every entry.
"locusNumber" is the number that identifies a locus (the PRGmatic pipeline sequentially numbers loci it calls; this number can then be associated with SAM files).
"indInFasta" is an array of individuals that were called for each particular locus.
"SNPs" refers to how many variable sites are contained within a locus.
"locusFasta" is the official file name.
"length" is the length of the locus.
"individuals" is an embedded document where each individual is given a value for how many reads from that individual aligned to each particular locus. it looks like a repeat of the "indInFasta" category, but it's not. a minimum coverage needs to be met before a locus can be called, but that doesn't mean 0 reads from an individual aligned to a locus. the second locus in the database above has two individuals in the "indInFasta" category and three in the "individuals" category. this is because "R02" only has one read that aligned to the reference locus and that is below the threshold for calling a locus for a individual.
"path" refers to the path where the locus file came from.

locusIndexMDB.py creates the database and inputs all information except the values within the "individuals" category. each of these categories is important for the final product and i'm particularly excited about the "individuals" document, because how much coverage an individual has for a locus is super important information and i haven't been able to find a way to get that information easily.
i'll be pushing the scripts to github this afternoon and then working on retrieving any subset of the reads referenced above.

Wednesday, June 1, 2011

welcome june

yesterday was a pretty good day. i worked on the locusIndexMDB.py script in the morning and was able to:
1. determine and fix the reason the script was throwing an error on my desktop (reading in a directory will include hidden files!)
2. turn what was an array of individuals in each locus document into an embedded document such that a number can now be the value and the individual name is the key - again, within the individual key within a locus document. so what was:
{
"_id" : ObjectId("4ddc0f9a10d1b1a91c000003"),
"length" : 330,
"individuals" : [
"R07.02",
"R07.01",
"R14.02",
"R14.01",
"R12.01",
"R06.01",
"R06.02",
"R12.02",
"R01.01",
"R01.02",
"R19.01",
"R19.02"
],
"locus" : "RAILmatic_4.fasta"
}
is now:
{
"_id" : ObjectId("4de56d9010d1b1ee13000007"),
"SNPs" : 3,
"locus" : "RAIL_8.fasta",
"length" : 317,
"individuals" : {
"R01" : 0,
"R17" : 0,
"R14" : 0,
"R05" : 0,
"R07" : 0,
"R11" : 0,
"R08" : 0
}
"path" : "fasta/RAIL_8.fasta"
}

this is a bigger accomplishment than it might look like, at first. not only is the "individuals" value now an embedded (searchable and modifiable) document, the names are no longer alleles, but the individual names. this might be tricky to scale up for a broader array of input formats, but for now, it's good. i also updated the screen output (since the individual names are now a document, i had to reformat how to count up how many documents each individual is in...) i also learned/began to understand more about json formatting and database organization, which is good.

yesterday afternoon, i worked on the samInputMDB.py script to count up the reads from a samfile that stick to each locus for each individual (to assign values to the individuals above). i've been able to get a list of the loci out of a samfile as well as format an input file to get individual names that match the individuals above and last week i was able to figure out how to count reads that align to each reference locus. next step is iterating over all individuals within all documents to update their read counts with the correct count. but first, coffee...

Monday, May 30, 2011

bam read counts

i've been thinking about organization and the database a lot and i think i've come up with a working plan for not only getting the data i want from the sam/bam files, but putting it in the right place for future access. i've also been reading/researching how to query and update mongoDB and utilize pysam and pymongo, so that implementation of this plan is as smooth as possible.

PLAN OF ATTACK:

open BAMFILE (one per individual)

query mongoDB for all documents that contain BAMFILE in "individuals" value array
foreach mongoDB document returned

query BAMFILE for locus (samfile.fetch)
count reads that align to locus
push count to Individual in locus in mongoDB

i.e., a document that looks like this:
{
"locus" : "loc010",
"individuals" : [
"R01"
"R02"
"R03"
],
"length" : 332
}

will be:
{
"locus" : "loc010",
"individuals" : [
"R01" : 3 ,
"R02" : 15 ,
"R03" : 6
],
"length" : 332
}

thus, each element of the individuals array is now a key/value pair. (right?) now i just need to figure out how to do each step. it's going to be a fun week!

Friday, May 27, 2011

first friday

i feel like i spent a lot of thursday spinning my wheels due to various barriers to coding. in the morning, i was working on the locusIndexMDB.py script and it stopped working on my office desktop. i ran it once, it worked, i changed the input file folder and it started throwing an error from within the seqlite_mod.py script ("map() requires two arguments", or something like that.) i spent a fruitless... half hour? hour? ... trying to figure out how i had broken the script that one minute prior had been working but then gave up and went home to my laptop (which ran the exact same script beautifully).

after that, i was beginning to figure out how to "re-access" fasta files once a path is saved in mongoDB, which led me to trying to parse json formatted data, which led me to trying to figure out how to get the module simplejson, which lea to setuptools, which led to frustration and a fundamental lack of understanding.

that took care of thursday and when i went in to my office this morning, the locusIndexMDB.py script was throwing the same error and i'm not real sure what to do about that. i don't understand what's happening that could cause an error originating in the seqlite_mod.py script to show up on one machine and not the other. the office machine had some problems with mongoDB because i accidentally installed it more than once and the poor computer was getting its wires crossed, so it may have something to do with that, but i don't know how.

again, i came home to my laptop and i have made the most satisfying progress in 48 hours this afternoon. i finally successfully installed setuptools and was thus able to get simplejson. i also installed pysam, for sam/bam files. so i've written two very small scripts that [1] can create sequence objects from a fasta file whose path is in mongoDB and [2] print data from a subset of reads in an indexed bam file that align to a given locus. although they are clumsy and there are probably better ways to achieve both these goals, i feel good about having these rather important initial steps taken.

OH. and technically the 48 hours of frustration began on wednesday afternoon when i tried to come up with a protocol for committing code to github. what a disaster! i managed to brute force my way through it and gain some understanding of git/github but it wasn't pretty. my account looks very silly, with files that are tagged incorrectly and lead to nothing and i haven't figured out how to correct them yet. i think i'll spend some time this afternoon trying to commit my two new baby scripts and hopefully i can do it quicker and more correctly than wednesday!

Wednesday, May 25, 2011

day 2: SNP#/locus now in database!

today has been another pretty good day! this morning, brad had left three comments on github and i was able to use and incorporate all three suggestions relatively easily. these included getting the path to the folder of loci from the command-line, stripping newlines from a list and getting the length of a sequence object by accessing the first element of the list (something i banged my head against for about an hour yesterday).

after that, i switched my efforts to including a "SNP" category in the database. i found a python script online that tallies up variable sites from an aligned fasta file (seqlite.py, written by steve haddock); i did some modifications to the script and import it as a function into my inputLocusMDB.py script. shockingly enough, it worked! so now each locus has its own document in a mongoDB collection and each document contains the following fields: _id, locus, length, SNPs, path, individuals[]. script output (to the screen) looks like this:

-----------------------------------------------------------------

Got this folder: fasta

locus = RAILmatic_1_aln ; number alleles = 36 ; length = 301 ; path = fasta/RAILmatic_1_aln.fasta ; SNPs = 7

locus = RAILmatic_2_aln ; number alleles = 34 ; length = 287 ; path = fasta/RAILmatic_2_aln.fasta ; SNPs = 10

locus = RAILmatic_3_aln ; number alleles = 34 ; length = 322 ; path = fasta/RAILmatic_3_aln.fasta ; SNPs = 12

locus = RAILmatic_4_aln ; number alleles = 24 ; length = 330 ; path = fasta/RAILmatic_4_aln.fasta ; SNPs = 3

locus = RAILmatic_5_aln ; number alleles = 40 ; length = 276 ; path = fasta/RAILmatic_5_aln.fasta ; SNPs = 5

locus = RAILmatic_6_aln ; number alleles = 40 ; length = 323 ; path = fasta/RAILmatic_6_aln.fasta ; SNPs = 11

locus = RAILmatic_7_aln ; number alleles = 30 ; length = 365 ; path = fasta/RAILmatic_7_aln.fasta ; SNPs = 12

locus = RAILmatic_8_aln ; number alleles = 14 ; length = 317 ; path = fasta/RAILmatic_8_aln.fasta ; SNPs = 2

total of 8 loci

...reading the names file into a list...