This manual is a reference for using the BTSVQ software. BTSVQ is a computational tool to analyze and visualize microarray gene expression data. BTSVQ clustering and visualizing methodology is given in Sultan at el. The software requires Matlab version 6.0 release 12.1. BTSVQ uses SOM toolbox for computing Self-Organizing Maps and visualization of the data. The SOM toolbox can be found at http://www.cis.hut.fi/projects/somtoolbox/.
First
time users, please follow following analysis steps:
Load data à Plot data à Apply normalizationà Plot data, if required normalize again, or select another normalization, and plot data à Select “Specimens” or “Gene” tab to decide which space you are going to cluster à Decide SOM topology, or press topology tab for software to suggest topology. à Compute SOM, if the component planes are not visually distinguishable and are homogenous, select some different topology or do other normalization, and repeat all the above steps. à Press “Partitive k-means” tab for partitive clustering. à Finally hit “BTSVQ” tab to generate clustering results.
All the steps can be repeated in any order, after completing first round, except BTSVQ. To generate another set of results “Partitive k-means” tab should pressed before “BTSVQ”.

Figure
(1)
BTSVQ offers an easy interface to load data in several different formats. There is no limit to the number of rows and columns in the data.

figure
(2)
Following file formats are supported.
A typical Microarray text file is shown in the Figure (3), First row and first column are taken as the specimen labels and gene labels respectively. The file with more than one row or columns of the descriptors of gene labels or specimen labels can also be loaded. You will be asked of the total number of columns in the text file. Also you will be prompted for the number of gene label columns and specimen label rows in the file, and only first row and first columns will be taken as Specimen label and Gene ID, the rest of the header lines will be discarded, and the data will be loaded beyond that range, as shown in figure (3), the yellow lines will not be included in the analysis.

Figure (3)
To load
ASCII text files, total number of columns in the file and number of text
columns and header lines should be specified.

Figure(4)
The preferred format of the excel files is shown in Figure
(5). If there are more than one text columns or rows for gene ID’s and specimen
labels, only first row and column will be considered in the analysis and the
rest would be discarded. Also note that specimen and gene labels should be text or Alphanumeric, numeric
entries should be changed to alphanumeric. Column A, and Row 1 should have text
in it otherwise an XLS read error will appear.

Figure(5)
Coma
separated files are also loaded like text files, by specifying total number of
columns in the file and number of text columns and header lines.
Mat files
with following variables present can also be loaded
·
cnames
Cell strings of specimen names;
size = (no_of_specimens X 1)
·
labels
Cell strings of Gene ID's column;
size = (no_of_genes X 1)
·
data
Matrix; size = (no_of_genes x no_of_samples)
or
·
sD (SOM data-struct) Information
about SOM data structs can be found at http://www.cis.hut.fi/projects/somtoolbox/.
If the mat
file is saved using the GUI, it automatically save above listed variables.
Note:
If the numbers of specified columns exceed the columns in the data file, empty columns will be added to the data.
cDNA
Microarray data are log ratios of intensities. It is very important to have
some visual look of the data before applying any clustering or normalization technique.
Surface plots of data offer good visualization of gene expression data in three
dimensions. Some times due to outliers, or some very high ratios in the data
makes it very skewed, as shown in the figure (6), the raw data has some very
high values and rest of the data is more or less uniform. Any clustering method
applied on such data will be biased towards the high values.

Figure(6)

Figure(7)
Normalizations are used to transform data to remove various types of noise, biases and outliers. This often results in a new range of the data that is easier to work with in further analysis. The transformation may introduce several distortions and biases, some of which improve the information content, while others may eliminate existing valid patterns. Microarray data is generally log-normalized to provide an equal spread between up and down-regulated genes.
BTSVQ provides three important normalizations, log, variance
and range.

Figure(8)
Partitive k-menas is splitting hierarchical clustering
method. It starts with the whole data set as a single cluster, which is
partitioned into disjoint subsets
and
,
where the inter clusters distance
is maximized. The subsets
and
are further subdivided into
and
,
etc., thus, building a binary tree.
Several techniques have been used to visualize this highly
multi-dimensional data. The self-organizing map (SOM) algorithm (Kohonen 2001) is an efficient tool for the
visualization of multidimensional data. SOMs have previously been shown to be
effective for the exploratory analysis of gene expression data. SOMs are neural
network algorithms widely used in data analysis and vector quantization. The
algorithm is similar to k-means clustering, with the additional constraint that
cluster centres are restricted to lie in a two dimensional manifold. SOMs show
two main characteristics; they realize a
quantization of a high-dimensional space, as other vector quantization
techniques such as LBG (Gersho and Gray 1992) and k-means, and they exhibit
a topological property which allows one to analyze the ordering of centroids.
Component planes of SOMs are the planes of Voronoi Tessellations, each
representing a specimen in a microarray experiment. Figure (8) presents quantized gene expression visualizations by
component planes of SOM.

Figure (8)
Figure 9, shows component planes of the whole data set.
Figure (9)
BTSVQ merges the results of SOM (genes space), and Partitive k-means (specimen space). The algorithm uses vector quantization and self-organizing capabilities of SOMs in finding significant gene centers in gene space (high dimensionality and large number of clusters), and the effectiveness of k-means in experiment space (medium dimensionality and low number of clusters). The resulting binary tree of specimens with SOM component planes of Specimens at nodes is shown in figure 10.

Figure (10)
Output is
generated in a subdirectory named after current time (DD-MM-YYY_hh.mm) in the
parent directory from where the data file is loaded.
Output
directory contains
·
LOG
file listing the data file name, last “Normalization” applied on data.
·
Partitive
clustering results (ptree.txt) file.
·
BTSVQ
clustering results. (Children of BTSVQ tree, labeled with the Level and Child)