Python scripts

 

Sequence analysis classes/scripts/programs

A set of python programs/classes developed by Noah Hoffman and myself can be found on Noah Hoffman's page. Also, there is automatically generated documentation for the code. This package includes mostly utility programs for analyzing, sorting, classifying, and examinining sequences in diffenent formats and can be used to simplify batch handling of moderately large numbers of sequences.

Below are some additional programs some of which rely on the Seq and SeqIO classes/scripts that can be obtained from Noah's page:

siRNA.py Identifies sequence motifs in fasta format sequence files that are suited as potential targets for siRNA mediated expression knockdown.
USAGE:  siRNA.py  fastafile

Reads in a fasta format sequence,  identifies all AAN(19)TT and 
AAN(21) motifs, and outputs a table with the motif positions
and GC-content.

$Id: software.html,v 1.2 2003/08/06 18:12:23 wresch Exp $"""
					
breaklines.awk Read a fasta format file and print file to stdout with lines no longer than 60 characters. Requires awk. Modify first line of script to reflect path to awk.
USAGE: breaklines fastafile
countAll.awk Reads in a fasta format sequence file and generates a table of amino acid composition information, with one row for each sequence in the fasta file.
USAGE: countAll.awk fastafile
quickChange.py Script to find optimal primers for Stratagene QuickChange mutagenesis. Accepts a fasta file with one or more sequences and a list of desired mutations. For each mutation a outfile is generated with the top 5 oligonucleotides. Click here for sample output.
USAGE:  quickChange.py fastaFile mutationFile
given a fastafile with one or more sequences with unique names
and a list containing a desired mutation per line in the following format:

sequenceName pos  wt mut [pos wt mut]
For example:
a20r  120 A C
a20r   240 C G  320  A C

The mutations in each line have to be able to be mutated with one primer,
otherwise the mutation if omitted from analysis

The following guidlines for primer design are applied:
1)  >=1 G or C on both ends of the target
2)  25-50 bases
3)  Tm = 81.5 + 0.41*(%GC) - 675/N - %mismatch  >= 78°C
4)  mutation should be in the middle of the primer
5)   %GC >= 40%

These rules rules are applied in a strict sense (SCORE) and in 
a more fuzzy sense (fSCORE; place more emphasis on melting temp;
allow deviations; linear scoring models)
					

Bibliography scripts

pubmed.py Search ncbi pubmed and fetch xml-format ArticleSet file containing all or a fraction of the matching articles. Uses the ncbi eutils described here.
USAGE:  pubmed.py outfile

Fetch articles in xml format from ncbi pubmed and saves
them. Respect ncbi resources !
Uses utf-8 encoding

$Id: software.html,v 1.2 2003/08/06 18:12:23 wresch Exp $
					
medxml2bib.py Convert a xml articleset fetched from ncbi pubmed into a bibtex format file. During the process it converts some common unicode characters ( mostly vowels with accents) into their latex equivalent. I am not very familiar with parsing xml, so this uses regular expressions, not the python modules intended for xml parsing (i.e. it's quick and dirty).
medxml2bib.py xmlfile outfile

Read xmlfile generated by ncbi interface of pubmed (summary xml file) and write
the data as a bibtex format file.  This program will fix the truncated notation
of page ranges, but will not convert special characters to their latex equivalents.
Output format is UTF-8.

$Id: software.html,v 1.2 2003/08/06 18:12:23 wresch Exp $
					

General Utilities

Matlab code

Supporting material, including trained neural networks and matlab support code for the manuscript "Improved success of phenotype prediction of the human immunodeficiency virus type 1 from Env V3 sequences using neural networks", Resch W, Hoffman N, and Swanstrom R., Virology 288:51-62 (2001). can be found on a Swanstrom lab page.