SOTA 2001. Parameter description

File of Profiles

This parameter specifies the name of the SOTA input file containing the gene expression profile data. The file format is plain ASCII text. Each line of the file represents a gene, with the gene name being followed by the expression values for each condition, separated by tabs. Each line should have a maximum length of 10,240 characters, including the gene name and expression values. A line beginning with the <#> symbol will be ignored (ie. treated as a comment).

The gene expression profiles in the SOTA input file can be grouped by being assigned to gene classes. To do this, put a '@' symbol followed by a number after the gene name. In doing this you are assigning this gene to the gene class specified by the number. You can do this for all the profiles, thereby assigning every gene to a gene class.

To assign the first three genes to gene class 0 and the rest to gene class 45 (consecutive numbers are not required), the SOTA input file should appear as follows:

This is useful when you have many genes and you need to see quickly where certain genes are located within the clustering. In this way, you can see the results without having to read all the text labels in the image, as you can simply view the pattern of colours in the image. This parameter specifies the key to be displayed at the bottom of the image which gives the description of each gene class. The format of the input for this parameter is a string of repetitions of the following sequence:[email protected]<n> ";<s>" where <n> is a number specifying a gene class and <s> is a string describing this gene class.

These are some files that you can run with SOTA:

Distance Function

This parameter specifies the function to be used to represent the "distance" between two gene expression profiles. The distance functions currently supported are:

Variability Threshold

The variability of a node is the maximum distance among all the profiles associated with this node. If the variability of a particular node falls below the threshold specified here, then this node will not be split any further. The variability threshold can therefore be used to determine convergence of the network (see "Variability Criteria" parameter). This parameter can be expressed as a percentage over all profiles or as an absolute value. The default value is 90%.

Resource Threshold

The resource of a node is the mean distance between this node and its associated profiles. If the resource of a particular node falls below the threshold specified here, then this node will not be split any further. The resource threshold is therefore used to determine convergence of the network. The defauld value is zero.

Relative Error Threshold

This parameter determines cycle convergence. A cycle will and when either the relative error falls below the threshold specified here, or the maximum epoch number is reached. The default value is 0.0001.

Max. Cycles

As each cycle corresponds to the splitting of a node, by specifying a value for the maximum number of cycles, you can force training of the network to stop at a certain number of clusters. By specifying a maximum of N cycles, you will obtain at most (N + 1) clusters at the end of the process.

Max. Epochs

This parameter indicates the maximum number of epochs allowed per cycle and hence can determine cycle convergence. Each cycle consists of a series of epochs, where a node is trained until the relative error of the network falls below its threshold or the maximum number of epochs is reached (the default value is 1000). If when the maximum epoch number is reached some nodes have no associated profiles, then the training ends.


This indicates the level of assignment of profiles when a node is not to be split any further, ie. the number of levels in the hierarchy whose leaf nodes are targets for the presentation of the profiles. With a value of zero, profiles will be assigned to one of the nodes above the current node. For higher values, profiles will be assigned to every other node at levels as far back as indicated by this parameter. The default value is zero.

Update Factors

When a profile is assigned to a node (the "winning" node), this node together with its mother and sister nodes are updated. The effect of this update is to reduce the distance between each of these nodes and the assigned profile by a certain proportion. The update factors specify this proportion separately for each of the winning, mother and sister nodes.

Variability Criteria

With this parameter you can change the way in which the network grows by specifying that variability should be used as the convergence criterion (rather than resource). Setting this option means that both the resource and variability values for each node will be written to the codebook vectors file.

Standard Newick Output

This parameter specifies that the SOTA output file containing the clustered tree structure should be in standard Newick format (rather than rich, full Newick format). This may be useful if an external viewer is to be used, but may imply some loss of information.

SOTATree and TreeView. Parameter description


This parameter specifies the file containing the SOTA output, ie. the generated cluster tree, which should be in rich, full Newick format.

Profile Step

This specifies the field width for display of the tree.


This specifies the size of the margins between the various elements of the image, ie. both separation between the tree and the left-hand side of the image, as well as the separation between the tree and the circles representing the clusters (SOTATree).

Tree Width

This specifies the width of the final tree in the image.


This specifies the size of the margins between the various elements of the image, eg both separation between the tree and the left hand side of the image as well as the separation between the tree an the cluster circles.

Vertical Separation

This parameter is the vertical separation between the genes represented in the image.

Draw Scale

With this parameter you indicate if you wish the scale to appear below the image.


In the SOTAaTree view, you can view the clusters as circles with absolute or relative mode, or with no circles.

Text Labels

(SOTATree) This parameter indicates whether to display the gene class information for each cluster.


(SOTATree) This sets the number of nodes to be displayed. If you select just one node for display, the only node you will see will be the initial node with all the profiles associated to it, ie. the state prior to clustering. Other values will force display of the nodes chosen in the list.

Codebook Vectors

(SOTATree) This parameter specifies the name of a file which can be built to record the tree structure and hold the expression profiles associated with each node in a human-readable form. The file will contain a line for each node of the tree which specifies the child nodes for this node and which is then followed by a line giving the expression profile values for this node.

You can specify a file to hold the expresion profiles asociated with each node of the tree, including the tree structure. The format of the file is the node name followed by nodes that are childen of this, and so on. Then, you must specified the profile gene asociated to this node in the next line. When you specified this file, you must write each node of the tree in pre-order, first you write the node, next the left tree and last the right tree, until you finish with all the tree. See that it is a binary tree.
An a example of this file is:

Draw profiles as

(SOTATree) This parameter specifies the way that the expression profiles are to be displayed, ie. as a line graph or as a histogram.

Adjust profile scale

(SOTATree) This parameter indicates that the scale for displaying the expression profiles is to be adjusted to the maximum and minimum expression values of each profile, with the origin always present as a reference.

Set the same scale for all profiles

(SOTATree) This specifies that the profiles should all be displayed using the same scale.

Draw Cuts

(SOTATree) This is specified as a sequence of pairs: <n1> <s1> <n2> <s2><n3> <s3>..., where <n1> is a number indicating that the first <n1> expression values of a profile are to be grouped together, and <s1> is a string giving the name of this group of values. The next <n2> values are then also to be grouped, and so on. In this way, vertical lines will be displayed separating these groups within each profile, with the appropriate name appearing below each group. An example input string for this parameter is: 10 GroupA 30 GroupB 4 GroupC 17 GroupD

Distance Used

(SOTATree) With this parameter you can specify the definition of distance to be used in drawing the tree. This option is useful when the SOTA output file contains both the resource and variability distances (included by setting the Variability criteria parameter on the SOTA form).

Name Width

(TreeView) This specifies the field width used for displaying the gene names.

Label Width

(TreeView) This specifies the width of the coloured bars used to represent gene classes.

Max. Scale

(TreeView) This parameter specifies the maximum value of the scale to be used in displaying the gene expression values of the profiles. By setting this, all profiles will be represented using a scale of [-MaxScale ... MaxScale].

Gene Names

(TreeView) This option specifies whether the gene names are to be displayed or not.


(TreeView) This option specifies whether the coloured bars representing gene classes are to be displayed or not.