Sarah's (G)SoC 2011: slow and steady

i have begun the data format conversion from fasta to IMa2 (a widely-used, coalescent-based program that uses multi-locus data to estimate demographic parameters for a set of populations). this is going to be as big of a beast as i expected; since the program does so much, it's input format is kind of heinous. this might not make sense, but the IMa2 format looks like this (where the things in purple/square brackets are just descriptors i've inserted so you know what everything means);

Description of data
2 [npops]
POP1 POP2 [pop names in order]
((0,1):5,(2,3):4):6 [population tree]
2 [nloci]
locus1Name 2 [nIndPop1] 2 [nIndPop2] 10 [length] H [mutModel] 1.0 [inhScalar] 0.0008 [mutRate] 0.0005 - 0.0009 [mutRateRange]
indPOP1_1 ACGTGACGTA
indPOP1_2 ACGTGACGTA
indPOP2_1 AGGTGACGTA
indPOP2_2 AGGTGACGTA
locus2Name 2 [nIndPop1] 2 [nIndPop2] 12 [length] H [mutModel] 1.0 [inhScalar] 0.0008 [mutRate] 0.0005 - 0.0009 [mutRateRange]
indPOP1_1 CCGACGTGCAGG
indPOP1_2 CCGACGTGCAGG
indPOP2_1 CCGATGTGCAGG
indPOP2_2 CCGACGTGCAGG

my plan of attack is:
(1) user is going to have to provide an input file to set some of the parameters that IMa2 requires (mutation model, inheritance scalar, population tree, etc.). my first task is to input that file and place the correct lines in the correct spot in the output file. yesterday, i created what i think the input file will be and have successfully imported it and put the data into a dictionary (so that i can output the lines based on keys later). i can also track number of loci, another input parameter.
(2) data from other scripts. getPopFromMDB.py creates a list of (locus) files that have at least one individual from a set of populations. from this script, i record number of populations and their names, as well as have the actual files to format.
(3) correctly integrate pieces. sequences need to be renamed and ordered correctly. each locus gets a header line that changes.

slow and steady wins the race on this one.

yesterday, husband/fellow grad student noah reid suggested that instead of "reads" associated with each individual for each locus in the db.loci database, it might be more helpful to display "average coverage". i modified samIndexMDB.py to record this change and i agree that it will be more helpful. (with the test data i have "reads" basically is "average coverage" but that won't be true for users where a full read = one locus. so this is more robust. yay.) i also put the finishing touches on a toNexus.py that takes the list of files generated by getPopsFromMDB.py as input and reformats them into nexus format. this was really easy since biopython already has a module to do it. now it's a function in my scripts.

1 comment:

AnonymousJune 30, 2011 at 4:27 AM
Sarah;
What format were you using for the input parameter file? YAML is a nice markup language that is essentially JSON in a file and is easy to edit:

http://www.yaml.org/

The python parser converts it directly into dictionaries:

http://pypi.python.org/pypi/PyYAML/

Sounds like you're making great progress; let us know if you run into any issues. Thanks,
Brad

Sarah's (G)SoC 2011

Pageviews

Wednesday, June 29, 2011

slow and steady

1 comment: