Home Software LaTex CV

Table of Contents

Module: SeqIO noahprogs/SeqIO.py

IO module for sequence operations $Id: SeqIO.html,v 1.1.1.1 2003/03/03 15:46:33 wresch Exp $

Imported modules   
from Dictionaries import *
import Patterns
import Seq
from Utility import inFile
import cPickle
from copy import deepcopy
import random
import re
import string
import sys
import traceback
import types
Functions   
BegEnd
block
breakLines
codonShuffle
consFromList
consensus
describe
diff
encode
encodeSeqList
gapstrip
gcg_checksum
getSeqNames
guessFrame
guessType
interleave
invertDict
readDnapars
readFasta
readFastaList
readSelex
removeTrailingComma
removeWhitespace
seqToFasta
seqToMSF
seqToNexus
seqToPhylip
tabulate
testEncodeList
testReadDnapars
testTabulate
toSeq
writePhylip
writeTable
  BegEnd 
BegEnd ( sequenceObj,  v=debugLevel )

Determines the beginning and end position of a sequence in an aligned fasta file relative to the alignment assuming the beginning is padded with gap characters

  block 
block ( str,  blocksize )

Returns a string containing space-delimited substrings of str of length blocksize

  breakLines 
breakLines (
        s,
        n,
        v=debugLevel,
        )

introduce linebreaks every |n| characters in the string |s| and returns it

[v] verbosity

  codonShuffle 
codonShuffle (
        seq,
        name='',
        dict=backTransDictNoAmb,
        )

Randomizes the codon usage of a nucleotide sequence while maintaining the amino acid sequence. Input is either a sequence object or a string; return type is determined by the input type.

  consFromList 
consFromList (
        seqList,
        countGaps=0,
        plu=2,
        )

Returns a sequence object representing the consensus of seqList

  consensus 
consensus (
        dict,
        countGaps=0,
        plu=2,
        v=debugLevel,
        )

Given a dictionary from tabulate representing character frequencies at a single position, returns the most common char at that position subject to the rules below.

countGaps { 0 | 1 } plu plurality for calling a consensus character

Special cases: 1) The most abundant character at a position is a gap if countGaps=0, uses the next most common character or x if all chars are gaps if countGaps=1, puts a gap character (-) at this position

  describe 
describe ()

SeqIO.py

Module containing functions for reading and writing sequence information in various formats, translation of nucleotide sequences. Lots of useful functions in here!

  diff 
diff (
        seq,
        templateseq,
        simchar=' ',
        )

Compares seq and templateseq (can be Seq objects or strings) and returns an object is of same type as seq in which characters in seq that are identical at that position to templateseq are replaced with simchar. Return object is the length of the shorter of seq and templateseq

  encode 
encode (
        seq,
        encodingDict,
        unknown,
        v=debugLevel,
        )

returns a list in which each element represents an encoding of a position of the given sequence string or sequence object; the encoding is directed by the given dictionary.

the dictionary should have the form dict = { aa1: encoding1, aa2: encoding2,... }; if dict = aa, the neural net aa encoding is chosen if dict = nucl, the standard nucleotide encoding without ambiguities is used if dict = anucl, the standard nucleotide encoding with ambiguities is used

any symbols not found in the dictionary are encoded as specified by the argument unknown

updated for SeqIO by NH 020103

  encodeSeqList 
encodeSeqList (
        seqList,
        dict,
        v=debugLevel,
        pad=0,
        unknown=-1,
        )

encodes all sequence objects in seqList using the given dictionary or one of the buit-in dictionaries:

  • dict: {'aa'|'nucl'|'anucl'} 'aa': the neural net aa encoding 'nucl': standard nucleotide encoding 'anucl': standard nucleotide encoding with ambiguities verbose: {0|1}: prints out running commentary pad: {0|1}: add gap characters to end of sequences to equalize length (for alignments)

the encoded sequences are returned as a nested list of integers

updated for SeqIO by NH 020103

  gapstrip 
gapstrip (
        seq,
        gapchars=Patterns.singleGapsReo,
        v=debugLevel,
        )

seq: string or sequence object

[gapchars]: [Patterns.singleGapsReo] or any compiled regular expression; Patterns.singleGapsReo matches r'[-.~]' [v] verbosity

Removes characters in |gapchar| expression from |s| and returns a new degapped sequence object or string (depending on object passed).

  gcg_checksum 
gcg_checksum ( seq )

Calculates GCG checksum for a sequence. seq is a string or sequence object. returns an int

  getSeqNames 
getSeqNames ( seqlist,  lower=1 )

Get a list of names from a list of Seq objects. lower=1 forces names to lower case

  guessFrame 
guessFrame (
        sequenceString,
        ignoreEndStop=1,
        stopPercent=3.0,
        debug=0,
        )

translates a dna sequence string in all three frames and returns int correctionFactor. Adding correctionFactor to the starting index will result in an in-frame translation

  guessType 
guessType ( seq,  minfreq=.10 )

Guess if a sequence is nucleotide or protein based in character composition. Returns nucl or prot, list of base compositions

  interleave 
interleave (
        file,
        seqList,
        width=50,
        blocksize=10,
        number=1,
        hea=0,
        gapchar='',
        v=1,
        )

Creates interleaved alignment from a list of sequence objects file an open file object blocksize 0 for no blocks number places numbers at top of alignment if 1 gapchar all gap chars [.-~] will be replaced with this char. set gapchar='' for no replacement

  invertDict 
invertDict ( dict,  v=debugLevel )

return a dictionary with values and keys switched; be aware that only one of duplicate values will be retained in the inverted dictionary

dict dictionary [v] verbosity

  readDnapars 
readDnapars ( filename='',  v=0 )

Reads in the output file from dnapars (phylip v3.6) and returns [dictOut, tofromlist]. Note that dnapars treats all ambiguity characters as (and replaces them with) N. This function only works with dnapars files contining a single tree!

  readFasta 
readFasta (
        filename='',
        degap=0,
        v=debugLevel,
        output='list',
        )

Read a fasta file and return a list of sequence objects

filename: filename (if none is given assume standard input)

[degap] {0|1} remove all gap characters [v] verbosity [output] { list | dict }; defaults to list

  readFastaList 
readFastaList (
        fastaList,
        degap=0,
        v=debugLevel,
        output='list',
        )

Invokes readFasta on a list of filenames or file objects. Returns a concatenated list or dictionary of sequence objects.

  readSelex 
readSelex ( filename )

SYNTAX: readSelex( filename ) filename: file in selex format (filename or open file object ) returns a list of Seq objects.

  removeTrailingComma 
removeTrailingComma ( s,  v=debugLevel )

  removeWhitespace 
removeWhitespace ( s,  v=debugLevel )

s: string [v] verbosity returns the string |s| with all whitespace characters removed

  seqToFasta 
seqToFasta (
        seq,
        linelength=60,
        degap=0,
        acc=0,
        hea=0,
        begEnd=None,
        triplets=0,
        v=debugLevel,
        replace={},
        )

returns string in fasta format with hard line breaks every |linelength| charcters

seq: sequence object

[linelength]: defines length line [degap]: {0|1} if 1, removes gap characters from string [acc]: {0|1} printing of accession number [hea]: {0|1} printing of header [begEnd] 2-tuple containing 0-based beginning and end indices of sequence [triplets] {0|1} truncate sequences, if necessary, to contain a multiple of three characters (codons); this has to be a two tuple if given [v] verbosity

  seqToMSF 
seqToMSF (
        file,
        seqList,
        type='n',
        )

Creates a MSF-format alignment from a list of sequence objects. file is an open file object. Type = {n | 'p'}

  seqToNexus 
seqToNexus (
        file,
        seqList,
        fixnames=1,
        )

Writes a list of Seq objects to nexus format to open file object file. Assumes that all sequences are the same length. fixnames, if 1, replaces all - characters with _ (underscore), and all multple underscore characters with a single underscore.

  seqToPhylip 
seqToPhylip (
        seq,
        name='',
        superclean=0,
        v=debugLevel,
        )

Writes a string in phylip format from a sequence object, seq. name is used instead of the name associated with seq if supplied. Mercilessly truncates names to 10 characters.

  tabulate 
tabulate ( seqList,  v=debugLevel )

calculate the abundance of each character in the columns of aligned sequences; tallies are returned as a list of dictionaries indexed by position (position 1 = index 0). Keys of dictionaries are characters appearing in each position (so dictionaries are of variable length).

seqList: a list of sequence objects

returns a list of dictionaries corresponding to each position

  testEncodeList 
testEncodeList ( filename,  dict='aa' )

an alignment of sequences supplied as the first argument is encoded according to the dict specified in arg 2

  testReadDnapars 
testReadDnapars ()

  testTabulate 
testTabulate ( filename )

  toSeq 
toSeq (
        str,
        reo=Patterns.fastaPatternReo,
        degap=0,
        v=debugLevel,
        )

convert a string containing n sequences in fasta format to a list of sequence objects

str: lines of a fasta file containing one or more complete sequences reo: [Patterns.fastaPatternReo] or any other valid compiled regular expression with the groups <name>, <hea>, and <seq> (in that order) and used to extract sequences from |str|. alternatively, expressions with fields <name> and <seq> only can be used

[degap] {0|1} remove all gap characters [v]: verbosity

  • return: list of sequence objects everything of the header line will be in hea field, except for name only sequences with non-empty sequence fields are included all whitespace is removed from sequences and names sequences are made all upper case

Exceptions   
NoSeqsFoundError
  writePhylip 
writePhylip (
        seqList,
        rename=0,
        superclean=0,
        v=debugLevel,
        )

Creates a string representing a phylip2 format sequence alignment from a list of sequences. If renum=1, sequentially renames sequences s1, s2 ... sN (useful for long names).

  writeTable 
writeTable (
        NestedList,
        filename='writeTableOut.txt',
        presicion=2,
        digits=10,
        delim='\t',
        pad=0,
        padchar=0,
        v=debugLevel,
        )

A nested list containing numbers is written as an ascii matrix corrsponding to a file with the default filename writeTableOut.txt

  • presicion: number of digits after period [default 2] digits: number of total digits [default 10] delim: field delimiter [default tab] pad: pad all rows to length of row 1 with padchar verbose: verbose mode [default 1]

added 020104 NH: if both presicion=0 and digits=0, uses %s instead of %digits.presicionf (%s should work with both ints and chars) also, will write to an existing file object if supplied in filename

this function will write tables with rows containing different numbers of columns

Classes   

NoSeqsFoundError


Table of Contents

This document was automatically generated on Thu Feb 27 16:52:00 2003 by HappyDoc version 2.1