Dealing with huge sequence files
Say you have the human genome sitting on your computer and you need to access a particular region, like chromosome 1, base pairs 8,100,201 through 8,100,438. What I usually do is make a file for each chromosome, stripping it of all non-sequence characters (spaces, carriage returns,etc.) . Then to grab the desired bit I do the following in python:
chrom1_f.seek(8100200)
seq = chrom1_f.read(8100438 - 8100201 +1)
"chrom1_f" is a filehandle to the chromosome file -- I would typically have a bunch of them open if I am working with a whole genome. And of course, random access can be used in any language -- not just python.
0 Comments:
Post a Comment
<< Home