IO module for sequence operations
$Id: SeqIO.html,v 1.1.1.1 2003/03/03 15:46:33 wresch Exp $
|
Imported modules
|
|
from Dictionaries import *
import Patterns
import Seq
from Utility import inFile
import cPickle
from copy import deepcopy
import random
import re
import string
import sys
import traceback
import types
|
|
Functions
|
|
|
|
|
|
BegEnd
|
BegEnd ( sequenceObj, v=debugLevel )
Determines the beginning and end position of
a sequence in an aligned fasta file relative to the
alignment assuming the beginning is padded with gap
characters
|
|
|
block
|
block ( str, blocksize )
Returns a string containing space-delimited substrings
of str of length blocksize
|
|
|
breakLines
|
breakLines (
s,
n,
v=debugLevel,
)
introduce linebreaks every |n| characters
in the string |s| and returns it [v] verbosity
|
|
|
codonShuffle
|
codonShuffle (
seq,
name='',
dict=backTransDictNoAmb,
)
Randomizes the codon usage of a nucleotide sequence while
maintaining the amino acid sequence.
Input is either a sequence object or a string; return type
is determined by the input type.
|
|
|
consFromList
|
consFromList (
seqList,
countGaps=0,
plu=2,
)
Returns a sequence object representing the consensus of seqList
|
|
|
consensus
|
consensus (
dict,
countGaps=0,
plu=2,
v=debugLevel,
)
Given a dictionary from tabulate representing character frequencies
at a single position, returns the most common char at
that position subject to the rules below. countGaps { 0 | 1 }
plu plurality for calling a consensus character
Special cases:
1) The most abundant character at a position is a gap
if countGaps=0, uses the next most common character
or x if all chars are gaps
if countGaps=1, puts a gap character (-) at this position
|
|
|
describe
|
describe ()
SeqIO.py
Module containing functions for reading and writing sequence information
in various formats, translation of nucleotide sequences.
Lots of useful functions in here!
|
|
|
diff
|
diff (
seq,
templateseq,
simchar=' ',
)
Compares seq and templateseq (can be Seq objects or strings) and returns
an object is of same type as seq in which characters in seq that are identical
at that position to templateseq are replaced with simchar. Return object is the
length of the shorter of seq and templateseq
|
|
|
encode
|
encode (
seq,
encodingDict,
unknown,
v=debugLevel,
)
returns a list in which each element represents
an encoding of a position of the given sequence string
or sequence object;
the encoding is directed by the given dictionary. the dictionary should have the form
dict = { aa1: encoding1, aa2: encoding2,... };
if dict = aa, the neural net aa encoding is
chosen
if dict = nucl, the standard nucleotide encoding
without ambiguities is used
if dict = anucl, the standard nucleotide encoding
with ambiguities is used
any symbols not found in the dictionary are
encoded as specified by the argument unknown
updated for SeqIO by NH 020103
|
|
|
encodeSeqList
|
encodeSeqList (
seqList,
dict,
v=debugLevel,
pad=0,
unknown=-1,
)
encodes all sequence objects in seqList using the given
dictionary or one of the buit-in dictionaries:
the encoded sequences are returned as a nested list of integers
updated for SeqIO by NH 020103
|
|
|
gapstrip
|
gapstrip (
seq,
gapchars=Patterns.singleGapsReo,
v=debugLevel,
)
seq: string or sequence object
[gapchars]: [Patterns.singleGapsReo] or any compiled regular
expression; Patterns.singleGapsReo matches r'[-.~]'
[v] verbosity Removes characters in |gapchar| expression from |s| and
returns a new degapped sequence object or string (depending on object passed).
|
|
|
gcg_checksum
|
gcg_checksum ( seq )
Calculates GCG checksum for a sequence. seq is a string
or sequence object. returns an int
|
|
|
getSeqNames
|
getSeqNames ( seqlist, lower=1 )
Get a list of names from a list of Seq objects. lower=1 forces
names to lower case
|
|
|
guessFrame
|
guessFrame (
sequenceString,
ignoreEndStop=1,
stopPercent=3.0,
debug=0,
)
translates a dna sequence string in all three frames and returns
int correctionFactor. Adding correctionFactor to the starting index will
result in an in-frame translation
|
|
|
guessType
|
guessType ( seq, minfreq=.10 )
Guess if a sequence is nucleotide or protein based in character composition.
Returns nucl or prot, list of base compositions
|
|
|
interleave
|
interleave (
file,
seqList,
width=50,
blocksize=10,
number=1,
hea=0,
gapchar='',
v=1,
)
Creates interleaved alignment from a list of sequence objects
file an open file object
blocksize 0 for no blocks
number places numbers at top of alignment if 1
gapchar all gap chars [.-~] will be replaced with this char.
set gapchar='' for no replacement
|
|
|
invertDict
|
invertDict ( dict, v=debugLevel )
return a dictionary with values and keys switched; be aware that
only one of duplicate values will be retained in the inverted
dictionary dict dictionary
[v] verbosity
|
|
|
readDnapars
|
readDnapars ( filename='', v=0 )
Reads in the output file from dnapars (phylip v3.6) and returns
[dictOut, tofromlist]. Note that dnapars treats all ambiguity
characters as (and replaces them with) N. This function only works with
dnapars files contining a single tree!
|
|
|
readFasta
|
readFasta (
filename='',
degap=0,
v=debugLevel,
output='list',
)
Read a fasta file and return a list of sequence objects
filename: filename (if none is given assume standard input)
[degap] {0|1} remove all gap characters
[v] verbosity
[output] { list | dict }; defaults to list
|
|
|
readFastaList
|
readFastaList (
fastaList,
degap=0,
v=debugLevel,
output='list',
)
Invokes readFasta on a list of filenames or file objects. Returns
a concatenated list or dictionary of sequence objects.
|
|
|
readSelex
|
readSelex ( filename )
SYNTAX: readSelex( filename )
filename: file in selex format (filename or open file object )
returns a list of Seq objects.
|
|
|
removeTrailingComma
|
removeTrailingComma ( s, v=debugLevel )
|
|
|
removeWhitespace
|
removeWhitespace ( s, v=debugLevel )
s: string
[v] verbosity
returns the string |s| with all whitespace characters
removed
|
|
|
seqToFasta
|
seqToFasta (
seq,
linelength=60,
degap=0,
acc=0,
hea=0,
begEnd=None,
triplets=0,
v=debugLevel,
replace={},
)
returns string in fasta format with hard line breaks every |linelength| charcters
seq: sequence object
[linelength]: defines length line
[degap]: {0|1} if 1, removes gap characters from string
[acc]: {0|1} printing of accession number
[hea]: {0|1} printing of header
[begEnd] 2-tuple containing 0-based beginning and end indices of sequence
[triplets] {0|1} truncate sequences, if necessary, to contain a multiple
of three characters (codons); this has to be a two tuple if given
[v] verbosity
|
|
|
seqToMSF
|
seqToMSF (
file,
seqList,
type='n',
)
Creates a MSF-format alignment from a list of sequence objects.
file is an open file object. Type = {n | 'p'}
|
|
|
seqToNexus
|
seqToNexus (
file,
seqList,
fixnames=1,
)
Writes a list of Seq objects to nexus format to open file object file.
Assumes that all sequences are the same length. fixnames, if 1, replaces all -
characters with _ (underscore), and all multple underscore characters
with a single underscore.
|
|
|
seqToPhylip
|
seqToPhylip (
seq,
name='',
superclean=0,
v=debugLevel,
)
Writes a string in phylip format from a sequence object, seq.
name is used instead of the name associated with seq if supplied.
Mercilessly truncates names to 10 characters.
|
|
|
tabulate
|
tabulate ( seqList, v=debugLevel )
calculate the abundance of each character in the columns of
aligned sequences; tallies are returned as a list of dictionaries
indexed by position (position 1 = index 0). Keys of dictionaries are
characters appearing in each position (so dictionaries are
of variable length). seqList: a list of sequence objects
returns a list of dictionaries corresponding to each position
|
|
|
testEncodeList
|
testEncodeList ( filename, dict='aa' )
an alignment of sequences supplied as the first argument is
encoded according to the dict specified in arg 2
|
|
|
testReadDnapars
|
testReadDnapars ()
|
|
|
testTabulate
|
testTabulate ( filename )
|
|
|
toSeq
|
toSeq (
str,
reo=Patterns.fastaPatternReo,
degap=0,
v=debugLevel,
)
convert a string containing n sequences in fasta format to a list of sequence objects
str: lines of a fasta file containing one or more complete sequences
reo: [Patterns.fastaPatternReo] or any other valid compiled
regular expression with the groups <name>,
<hea>, and <seq> (in that order) and used to extract sequences
from |str|. alternatively, expressions with fields <name> and <seq>
only can be used
[degap] {0|1} remove all gap characters
[v]: verbosity
|
|
|
writePhylip
|
writePhylip (
seqList,
rename=0,
superclean=0,
v=debugLevel,
)
Creates a string representing a phylip2 format sequence
alignment from a list of sequences. If renum=1, sequentially
renames sequences s1, s2 ... sN (useful for long names).
|
|
|
writeTable
|
writeTable (
NestedList,
filename='writeTableOut.txt',
presicion=2,
digits=10,
delim='\t',
pad=0,
padchar=0,
v=debugLevel,
)
A nested list containing numbers is written as an ascii matrix corrsponding
to a file with the default filename writeTableOut.txt
added 020104 NH: if both presicion=0 and digits=0, uses %s instead of
%digits.presicionf (%s should work with both ints and chars)
also, will write to an existing file object if supplied in filename
this function will write tables with rows containing different numbers of columns
|
|
Classes
|
|
|
|