Sequence analysis classes/scripts/programs
A set of python programs/classes developed by Noah Hoffman and myself can be found on Noah Hoffman's page. Also, there is automatically generated documentation for the code. This package includes mostly utility programs for analyzing, sorting, classifying, and examinining sequences in diffenent formats and can be used to simplify batch handling of moderately large numbers of sequences.
Below are some additional programs some of which rely on the Seq and SeqIO classes/scripts that can be obtained from Noah's page:
| siRNA.py |
Identifies sequence motifs in fasta format sequence files that
are suited as potential targets for siRNA mediated expression knockdown.
USAGE: siRNA.py fastafile Reads in a fasta format sequence, identifies all AAN(19)TT and AAN(21) motifs, and outputs a table with the motif positions and GC-content. $Id: software.html,v 1.2 2003/08/06 18:12:23 wresch Exp $""" |
| breaklines.awk |
Read a fasta format file and print file to stdout with
lines no longer than 60 characters. Requires awk. Modify
first line of script to reflect path to awk. USAGE: breaklines fastafile |
| countAll.awk |
Reads in a fasta format sequence file and generates a
table of amino acid composition information, with one
row for each sequence in the fasta file. USAGE: countAll.awk fastafile |
| quickChange.py |
Script to find optimal primers for Stratagene QuickChange
mutagenesis. Accepts a fasta file with one or more sequences
and a list of desired mutations. For each mutation a
outfile is generated with the top 5 oligonucleotides. Click
here for sample output.
USAGE: quickChange.py fastaFile mutationFile given a fastafile with one or more sequences with unique names and a list containing a desired mutation per line in the following format: sequenceName pos wt mut [pos wt mut] For example: a20r 120 A C a20r 240 C G 320 A C The mutations in each line have to be able to be mutated with one primer, otherwise the mutation if omitted from analysis The following guidlines for primer design are applied: 1) >=1 G or C on both ends of the target 2) 25-50 bases 3) Tm = 81.5 + 0.41*(%GC) - 675/N - %mismatch >= 78°C 4) mutation should be in the middle of the primer 5) %GC >= 40% These rules rules are applied in a strict sense (SCORE) and in a more fuzzy sense (fSCORE; place more emphasis on melting temp; allow deviations; linear scoring models) |
Bibliography scripts
| pubmed.py |
Search ncbi pubmed and fetch xml-format ArticleSet file containing
all or a fraction of the matching articles. Uses the ncbi eutils
described
here.
USAGE: pubmed.py outfile Fetch articles in xml format from ncbi pubmed and saves them. Respect ncbi resources ! Uses utf-8 encoding $Id: software.html,v 1.2 2003/08/06 18:12:23 wresch Exp $ |
| medxml2bib.py |
Convert a xml articleset fetched from ncbi pubmed into a bibtex format
file. During the process it converts some common unicode characters (
mostly vowels with accents) into their latex equivalent. I am not
very familiar with parsing xml, so this uses regular expressions, not
the python modules intended for xml parsing (i.e. it's quick and dirty).
medxml2bib.py xmlfile outfile Read xmlfile generated by ncbi interface of pubmed (summary xml file) and write the data as a bibtex format file. This program will fix the truncated notation of page ranges, but will not convert special characters to their latex equivalents. Output format is UTF-8. $Id: software.html,v 1.2 2003/08/06 18:12:23 wresch Exp $ |
General Utilities
Matlab code
Supporting material, including trained neural networks and matlab support code for the manuscript "Improved success of phenotype prediction of the human immunodeficiency virus type 1 from Env V3 sequences using neural networks", Resch W, Hoffman N, and Swanstrom R., Virology 288:51-62 (2001). can be found on a Swanstrom lab page.