coding has been going relatively well this morning/afternoon (knock on wood). starting with a folder of multi-fasta locus files, i've been able to read each of the files into a SeqIO.index and store the name of the locus and an array of the individuals represented in the locus to a MongoDB collection. i've also been able to output how many of the loci each individual has and the length of each locus. for my first python script ever, i feel not too bad about it. here's the screen output when i run the script on a folder with 5 loci and ≤ 20 individuals (R01..R20, up to 40 alleles since these guys are diploid):
-----------------------------------------------------------------
-----------------------------------------------------------------
the MongoDB collection has a document like this for each locus:
{
"_id" : ObjectId("4ddc0f9a10d1b1a91c000003"),
"length" : 330,
"individuals" : [
"R07.02",
"R07.01",
"R14.02",
"R14.01",
"R12.01",
"R06.01",
"R06.02",
"R12.02",
"R01.01",
"R01.02",
"R19.01",
"R19.02"
],
"locus" : "RAILmatic_4.fasta"
}
-----------------------------------------------------------------
(the .01 and .02 refer to the 2 alleles each individual has)
next tasks = commit code to GitHub (not that it's particularly special, but want to get into the habit); work on updating documents in MongoDB; import SAM files and save metadata in appropriate MongoDB document
-----------------------------------------------------------------
locus = RAILmatic_1 ; number alleles = 34 length = 301
locus = RAILmatic_2 ; number alleles = 32 length = 287
locus = RAILmatic_3 ; number alleles = 30 length = 322
locus = RAILmatic_4 ; number alleles = 12 length = 330
locus = RAILmatic_5 ; number alleles = 40 length = 274
total of 5 loci
...reading the names file into a list...
name: R01 found in: 5 loci
name: R02 found in: 3 loci
name: R03 found in: 3 loci
name: R04 found in: 4 loci
name: R05 found in: 4 loci
name: R06 found in: 5 loci
name: R07 found in: 5 loci
name: R08 found in: 4 loci
name: R09 found in: 2 loci
name: R10 found in: 2 loci
name: R11 found in: 3 loci
name: R12 found in: 5 loci
name: R13 found in: 4 loci
name: R14 found in: 5 loci
name: R15 found in: 4 loci
name: R16 found in: 2 loci
name: R17 found in: 4 loci
name: R18 found in: 4 loci
name: R19 found in: 4 loci
name: R20 found in: 2 loci
-----------------------------------------------------------------
the MongoDB collection has a document like this for each locus:
{
"_id" : ObjectId("4ddc0f9a10d1b1a91c000003"),
"length" : 330,
"individuals" : [
"R07.02",
"R07.01",
"R14.02",
"R14.01",
"R12.01",
"R06.01",
"R06.02",
"R12.02",
"R01.01",
"R01.02",
"R19.01",
"R19.02"
],
"locus" : "RAILmatic_4.fasta"
}
-----------------------------------------------------------------
(the .01 and .02 refer to the 2 alleles each individual has)
next tasks = commit code to GitHub (not that it's particularly special, but want to get into the habit); work on updating documents in MongoDB; import SAM files and save metadata in appropriate MongoDB document
Sarah -- great stuff. Really happy you are getting into this so quickly. I wrote a few notes with practical things on your GitHub commit:
ReplyDeleteCommit notes