Updated April 8, 2002
All of these tools are written in perl. Many of the scripts have module dependencies, which are listed at the ends of the descriptions. They have all been tested under Win2K Pro running ActiveState Perl (build 631). I have also packaged the scripts into windows executables using perlapp of the Perl Developer's Kit 4.0. The executables already have the modules prepackaged, and thus work as standalone programs. Most of the scripts should run on any system which supports Perl and the required modules.
All of the software is copyrighted under the terms of the GNU General Public License. You can view this license at http://www.gnu.org/licenses/gpl.txt. If you have not read my disclaimer yet, please do so here.
Please e-mail cckim47@gmail.com with any bugs, suggestions, and success stories (encourages me to maintain and write more programs!).
A brief summary of file
formats
used in the following programs.
Updated 3/25/03
Adds annotation from a file to a PCL (pre-clustering) data file. The
annotation file requires an ID and Name column. The ID's are read from
the first column of the PCL file, and the annotations are added to the
second column. Does not affect the data. Intended for situations in
which a different annotation is desired, such as when an annotation is
published for a microarray constructed pre-annotation.
Requires: Tk
Current Version: 1.0
Last Update: 3/25/02
Source
Windows Executable
This tool takes an input PCL or CDT (pre- or post-clustering) file and
converts all values to a binary scale. The cutoff is user-defined.
Primarily intended for genomotyping analysis. Also see GACK below.
Requires: Tk
Current Version: 1.0
Last Update: 3/25/02
Source
Windows Executable
This tool takes multiple PCL files and creates a single aggregate
PCL file. Intended for joining datasets which have been created on
separate occassions, and which may have different ID values (normally
requiring the creation of a database to join the data fields).
Requires: Tk
Current Version: 1.1
Last Update: 3/25/02
Source
Windows Executable
Load a list of IDs, and filter or retrieve the associated data from a
.pcl or .cdt file. Version 1.0 (5/14/02) Retrieve IDs counted improperly; fixed in version 1.1.
Requires: Tk
Current Version: 1.1
Last Update: 6/18/02
Source
Windows Executable
Dynamically chooses cutoffs for grouping into present/divergent genes based on the shape of the distribution. More resistant to variation in hybridization data than CCACK.
If you use GACK, please cite:
Kim CC, Joyce EA, Chan K, Falkow S.
Improved analytical methods for microarray-based genome-composition
analysis.
Genome Biol. 2002 Oct 29;3(11):RESEARCH0065.
A simple filtering tool which demands
that a certain percentage of datapoints be present in a PCL file. Spots
which fail to contain the user-specified percentage of good data points
are removed from the dataset.
Requires: Tk
Current Version: 1.0
Last Update: 3/25/02
Source
Windows Executable
This tool takes an input PCL or CDT file and generates a graphical
histogram file in JPEG format. Allows rapid stimultaneous viewing of
histograms for multiple datasets for quality assessment. Primarily
intended for DNA and RNA hybridizations.
Requires: Tk, GD, Math-Round, POSIX
Current Version: 1.4
Last Update: 3/25/02
Source
Windows Executable
I often do a SAM or Cluster analysis and see a common theme in some
of the significant genes. However, I'm usually not sure if it's
really overrepresented, or just a product of observer bias. This
program addresses whether or not a theme is actually overrepresented
in your significant genes list. The program takes a list of significant
genes and a list of user-specified search terms, and counts the number of
genes which contain one of the search terms. Then, the program takes a
random set of genes of the same size as the significant set from a genome
annotation file and counts the hits. This process is repeated a
user-specified number of times so that statistics regarding the randomness
of the frequencies can be calculated. Statistics and histogram data
is output to a text file. NOTE: PLEASE DO NOT USE VERSIONS OF THIS
SOFTWARE OLDER THAN VERSION 2.02. THE STATISTICAL ANALYSIS HAS BEEN
CHANGED.
If you use LACK, please cite:
Kim CC, Falkow S.
Significance analysis of lexical bias in microarray data.
BMC Bioinformatics. 2003 Apr 3;4(1):12.
ALACK: Automated LACK
This is an automated version of LACK which does not require advance
generation of a word list. However, this version is limited to
single-word analysis; LACK must be used for multiple search-term
analyses. NOTE THAT THIS SOFTWARE HAS BEEN INTEGRATED INTO LACK 4.2 ABOVE.
Previous Version: 0.1
Last Update: 03/23/03
Perl Source
Windows Executable
Averages data values for a PCL file, but only if the Name (second)
column is identical. Intended for averaging data values when multiple
spots are present on a single array for a given gene.
Requires: Tk, File::Basename
Current Version: 1.2
Last Update: 3/25/02
Source
Windows Executable
Removes duplicate lines from a PCL file. Actually, it removes
duplicate lines from any text file, but was intended for use with PCL
files. Does exactly what it says; if the lines are not 100% identical,
they are not removed.
Requires: Tk
Current Version: 1.0
Last Update: 3/25/02
Source
Windows Executable
You've done a SAM analysis and have your lists of significantly
upregulated and downregulated genes. You now want a more visual
representation, or you want to see if there is even more detailed
substructure within these genes by using Cluster. Samster will take an
Excel spreadsheet or text files and extract the raw data into a text
output file, which can be fed directly into Cluster or opened in Treeview.
This program circumvents the need to create databases each time you wish
to accomplish this task.
Update: Version 1.4 did not work with Cluster 3. It has been
updated and some additional minor bugs have been fixed; the newer
version 1.5 works with both Eisen's Cluster and Cluster 3.
If you use SAMster, please cite:
Mueller A, O'Rourke J, Chu P, Kim CC, Sutton P, Lee A, Falkow S.
Protective immunity against Helicobacter is characterized by a unique
transcriptional signature.
Proc Natl Acad Sci U S A. 2003 Oct 14;100(21):12289-94.
One of the measures of virulence of different strains of an organism is
the competitive index. In this model, a mixed infection is performed in a
single host with the assumption that the more fit strain will outperform
the other. The current standard is to report competitive index (CI), the
ratio of recovered colony forming units of one strain to the other.
While this information is informative, it discards overall organ-load
information due to the one-dimensional nature of the CI. I've developed a
special plot type and a tool, COP, for generation of plots which preserve
this information. Paired t-test statisitcs are calculated.
Requires: Tk, Tk::NumEntry, Statistics::Descriptive,
Statistics::Distributions
Current version: 0.3
Last update: 4/1/03
Source
Windows Executable
Sample data file
Input a tab-delimited text file of surviving mice. LD50 values
calculated by Reed-Muench (command-line interface only) or Moving Average
Interpolation (graphical interface).
If you use the LD50 calculators, please cite:
Kim CC, Monack D, Falkow S.
Modulation of virulence by two acidified nitrite-responsive loci of
Salmonella enterica serovar Typhimurium.
Infect Immun. 2003 Jun;71(6):3196-205.
This program takes up to 6 FASTA DNA sequence files as input and
outputs a tab-delimited text file containing sizes of restriction
fragments (which can be opened in a spreadsheet program). Several options
are available, including 4-6 base cutters and running in differential or
list-all cutters mode. Differential analysis will analyze the sequences
and only output restriction enzymes which distinguish between the
sequences, while listing all cutters will list even those cutters which do
not distinguish between the sequences. This program's primary purpose was
to automatically choose sites to distinguish plasmid clones with an insert
in two possible orientations.
Requires: Tk, Bioperl
Current Version: 3.1
Last Update: 3/25/02
Source
Windows Executable
Analyzes a FASTA sequence for n-mer frequency (specified by user).
Useful for identifying common or rare restriction sites, etc.
Command-line interface.
Requires: Bioperl, File::IO
Current version: 1.0
Last update: 11/14/04
Source
Windows Executable (Recompiled 11/14/04)
Locates user-specified oligonucleotide patterns within a larger
sequence. For example, restriction digest sites and fragment sizes can
be determined for a plasmid or genome.
Requires: Tk, Win32::FileOp, Statistics::Descriptive, GD
Current version: 1.2
Last update: 11/14/04
Source
Windows Executable
A simple motif searcher. No matrices, nothing fancy, just
simple searching for base strings. Allows degenerate bases to be
used.
Requires: Tk, Bioperl
Current Version: 0.2
Last Update: 10/17/03
Source
Windows Executable NOTE:
This was not working, but has now been updated and tested 1/22/04