Presentation


What does it do?

A distance between two vectors quantifies the dissimilarity between them, identical vectors will have a distance of 0. In biological applications, for instance, it is frequent to compare gene expression profiles triggered by various treatments using such a metrics.

However comparing multiple data vectors via their pairwise distances requires these vectors to be normalized, which is not always desirable. Another limitation is that there is no easy way to associate a p. value to a distance value in order to assess the significance of that particular distance. A common method to evaluate this significance is to perform a permutation (also known as random Sampling) analysis. Assuming that two linked vectors yield a distance D, then if the pairwise relationship is broken (by shuffling the dataset) the newly computed distance D’ should be greater than the original one.

By iterating many times the process of shuffling / computing D’, we can evaluate what is the space of possible distances to be obtained by chance, and then compare the original “true” distance D to the distribution of D’. The probability of getting, by chance, a distance better than the original one can be approximated as the p. value.



Walkthrough


New analysis

In this tutorial, we will submit a simple job using a dummy dataset and basic options for faster generation of the results page

  1. First, click on Analysis > Start a new analysis in the menu bar.
  2. Load one of the exemple dataset by click on Load a dummy dataset (faster).
  3. Submit this dataset by clicking the green button.
    A green flash message should confirm that your dataset is valid.
  4. Click in the Review your dataset tab.
  5. You can apply a heatmap on this dataset by checking the corresponding checkbox (Display Heatmap, right panel). Feel free to change the color scheme if you would like.
  6. Click on the Analysis options tab.
  7. [OPTIONAL] We suggest you enter your email address for easier retrieval of this analysis, as well as a name (test) and a description (This is the tutorial job submission). Also, click the Quick Retrieval button.
  8. You actually are ready to roll, and you can directly launch this analysis by clicking the green submission button.
  9. The next page confirms the job submission, and will redirect you to the results page after 10 seconds. If you have provided your email address, you should receive a confirmation email.
  10. The results page will automatically refresh until the run is finished. After job completion, a second email will also be sent to you.
  11. The R graphs tab presents the distribution of the distances computed after random permutation (black), and the original distances (red). They are displayed either as boxplots (upper right coverflow panel) or histograms (lower right coverflow panel). In the Filter field, type in “b vs g”. This will update the lower coverflow (histograms) to focus on the graph showing the distance between entity B and entity G (red line) as well as the distribution of the distances obtained after random sampling. You can display the full resolution picture by clicking on the thumbnail (or just hit enter from the Filter field).
    This particular graph shows that the original distance between condition B and condition G (0.804) is much smaller than any of the distances computed after randomization (ranging from ca 1.1 to 2.2), giving a “raw” p. value of 0.
    In the boxplots panel (upper right panel), scroll until you reach graph “B” (or just type “bpb” in the Filter field). Click on it to display the graph at full resolution. This representation allows you to immediately identify the different entities that are significantly close to B: D and G. The black boxplot represents the distribution of the distances obtained after random permutation, while the red star indicates the original distance.
  12. Click on the Adjacency matrix tab. Pairwise p. values are displayed as a color-coded matrix, which you can interact with. First, lower the significance cut-off to 0.01 and hit enter, to only display p. values lower than or equal to 0.01 (it will also display p. value greater than or equal to 0.99 if there are some).
    Next, re-order the matrix using the lower left drop-down menu. Select Complete linkage – threshold p.val = 0.01 as the layout. This highlights a cluster of 3 entities (B, D and G) that are significantly close to one-another. Clicking in any of the colored squares will display the p. value of this particular pair.
  13. In the Network tab, the same cluster is even more also visible with the color scheme matching the clustering.
  14. Click on the Dendrogram tab, and then click on one of the tree icons to display the graph. The drop-down menu on the left allows you switch between linkage methods.

Retrieve a previous job

To work with a more complexe analysis (e.g. hormone treatments on Arabidopsis seedlings), first retrieve it from the demo account:

  1. Click on Analysis > Retrieve all my previous analyses.
  2. Type in Demo, and click on Retrieve.
    Note that if you were to use your email address, a link would be sent to you.
  3. Click on the analysis named hormones.
    This will display a summary of this analysis, a quick preview of the dataset that was used, and links to edit / delete this job. As you can see in the summary, for this analysis the distribution density of the distances after permutation were calculated, the dataset was normalized, and the p. value ~ distance function was computed.
  4. Click on the green link to get to the results page.
    You can first notice that all histograms (R Graphs tab) now show the density curves (blue line), that were used for computing the p. values (thereafter called density-based p. values).
    Also, in the Matrix, Network and Dendrogram tabs, you can use either the density-based or the raw p. values. You can also decide to use the normalized distances to generate dendrograms.
  5. A new tab (Sigmoid) is also now available, which features the p. value ~ distance function. You can see that the computed equation fits the dataset with correlation coefficient (R2) of 0.983, and allows for direct calculation of a p. value corresponding to a particular distance, or of a distance given a specific p. value (rightmost panel).


Job submission


Although you can customize the way your dataset is being analyzed, it can also be very fast and easy (just a couple of clicks) to submit a basic job.

To start a new analysis, go to the menu bar and click on Analysis > Start a new analysis.

Dataset

You can upload your dataset by either copying/pasting it into the text box, or by selecting a text file located in your computer. Alternatively, you can also load one of our example dataset by just clicking on the corresponding links on the right.

Your dataset must meet the following criteria:

  1. It must be in tab-delimited format.
  2. It must only contain numbers. However the first column and the first row may contain non-numerical data.
  3. Blank cells are allowed.
  4. The first row should correspond to the columns names, and therefore should be constituted of unique names.
  5. Datasets of more than 60 columns are not allowed

Also, keep in mind that:

  1. Entities to be compared (biological treatments for instance) are the columns.
  2. If the first column contains non-numerical data, or if the first row is one cell off, then the first column will be considered as the rows names.
  3. If your dataset contains more than 45 columns (990 comparisons), density graphs (histograms) will not be rendered. Only boxplots panels will be generated, as well as the in-browser (JavaScript) graphs.

Analysis options

For simple, faster evaluation of the significance of the distance values, you can just click Launch this analysis. This will use the Euclidean distance as the default one. However you are still advised to enter, at least, your email address for easier retrieval.

Distance

This is the distance to be used to compare the data vectors:

  1. Euclidean

    This is the default distance, to be used in most cases.

  2. Manhattan
  3. Minkowski
  4. Canberra
  5. Maximum (Chebyshev)

Number of iterations

Number of shuffling iterations to compute the distribution the distances obtained after randomization of the vectors. You can enter any number between 1,000 and 100,000.

The first column is row names

Checkbox that indicates whether or not the first column should be ignored. If the first column contains non-numerical values, then this parameter is automatically set to ON.

Compute density

This is where you tell RS WebTool that you want the density of the distances distribution to be computed.

Density Kernel

For more advanced users who wish to select the density kernel to be used. Gaussian is the default. You can also choose to use rectangular, triangular, epanechnikov, biweight, cosine or optcosine.

Density Bandwidth

By default, the R Nrd density bandwidth calculation method is used (see the R documentation for more details on how the density is calculated). However this method reaches its limitations when distance values after randomization have a low standard deviation. You can therefore:

  • Set a minimum value for the bandwidth. Nrd will still be used unless it is lower than your threshold.
  • Set the bandwidth value to be used.
  • If you normalize the dataset, you can use the optimized bandwidth. This process empirically determines the bandwidth to be used, based on the standard deviation of the density. It raises the bandwidth and re-computes the density as long as the standard deviation of the density is below 1.
Distribution of the distances calculated after random permutation, no density calculation.
The calculated "raw" p. value is 0.008.
Distribution of the distances calculated after random permutation, density calculation using the Nrd method for bandwidth determination
(bandwidth = 2.85e-14).
The calculated p. value is 1.
Distribution of the distances calculated after random permutation, density calculation using the optimized Nrd method for bandwidth determination
(bandwidth = 1.09e-01).
The calculated p. value is 8.899e-03.

Normalize the dataset

While dataset normalization will not affect the computed p-values, it will change the distances values between two data vectors.

Fit a p.val ~ distance function

Will try to fit a p.val ~ distance sigmoid. Only available for normalized datasets.

Optional fields

For easier retrieval and job management, you can provide your email address, an analysis name and a description. If you check Quick Retrieval, you will be able to retrieve easily your analyses by just providing its name and your email address. See How do I retrieve a previous job? for more information.

Review your dataset

This tab allows you to review your dataset and use a HeatMapping plugin.



How do I retrieve the results of a previous job?


All submitted jobs will remain on the server for an unlimited time. You can retrieve any of your previous analyses using one of the following ways:

Bookmarked page

The results pages are permalink. Once the analysis has been launched, you will be redirected to the results page which you can bookmark for later access.

Results listing

If you have provided your email address when launching your analysis, you can request a link towards a summary of all the analyses matching that same email address. To ensure all your analyses remain private, this link will be emailed. The link that will be sent to you will remain valid forever, and can be re-used for future access of your jobs summary.

In the menu bar, click on Analyses > Retrieve all my previous analyses.

Enter your email address and click Retrieve.

Note that only the analyses for which you have provided your address will be displayed in the summary.

If quick retrieval has been checked

If you have provided your email address AND a name to a particular analysis, you will be able to directly access the results page of this particular job by clicking on:

Analyses > Retrieve a previous analysis

Enter your email address and analysis name, and click Retrieve.

If this analysis exists, you will be redirected to the results page, and if several jobs match the same criteria (email address and name) you will be asked to choose from a list. However this quick retrieval of an analysis is only enabled if you have checked "quick retrieval" at the time you launched the job.



The results page


R Graphs

Lets you visualize all the R-generated graphs. Thumbnails in the coverflow can be clicked to display full resolution, publication-quality graphs.

Histograms of distance distributions

Each histogram shows the actual distance between the two data vectors to be compared (red line), and the distribution of the distances obtained after randomization of the two data vectors to be compared (black histogram). If the density has been computed, it is displayed in blue. The actual distance and the p. value are indicated at the bottom of the graph.

Boxplots panels

For each entity (= column in the dataset) a boxplots panel summarizes the distance between this particular data vector and every other (red star), with the distribution of the distances after vector randomization being displayed as a boxplot.

Adjacency matrix

To clearly visualize distance significance between all entities, p. values can be displayed as an adjacency matrix.

The p. values to be used can be selected from the upper drop-down menu (Dataset). The "raw" p. values are always available, while the density-based ones are only available if they have been computed. Each square can be clicked to display its attributes (entities compared and p. value). The significance threshold to be applied for visualization can be set (Display) to hide / show p. values matching particular criteria. Values lower than the threshold are displayed in red, and values greater than 1-threshold are displayed in blue. If the applied threshold correspond to a significance cut-off you wish to apply (0.01 for instance), then red squares indicate significantly close vectors, and blue squares indicate vectors that are significantly different. Setting a cut-off at 0.5 will display all p. values.

You can adjust how the matrix is rendered by modifying its size (Matrix), as well as the font size.

The position of the rows/columns in the matrix can be changed to match pre-computed clusters. These clusters were generated by hierarchical clustering in R (Average linkage, Single linkage or Complete linkage) followed by tree cutting at various height (0.05, 0.01, 1e-3, 1e-4, 1e-5, 1e-6).

Network view

A force-layout network can be rendered with the entities as nodes and their pairwise distance p.values as edges. You can choose to display edges only corresponding to p.values lower than your favorite significance threshold, and color-code the nodes with respect to the pre-computed cluseters in R. These clusters were generated by hierarchical clustering in R (Average linkage, Single linkage or Complete linkage) followed by tree cutting at various height (0.05, 0.01, 1e-3, 1e-4, 1e-5, 1e-6).

A SIF file is also generated for direct copy/paste use of the data in any network-rendering software (cytoscape, gephi, …).

Dendrogram

R-generated Newick files. You can choose between any of the available datasets ("raw" p. values, density-based p. values, and normalized distances), and any of the following linkage method: Complete, Average, Single, Ward, McQuitty, Centroid, Median.

The corresponding Newick files is displayed for easy copy/paste use of it in any tree-rendering software (TreeDyn, …).

Sigmoid

If requested at the time you launched the job, the p. value ~ distance function has been computed (only available if you have normalized the dataset). This panel displays two p. value ~ distance scatterplots, the leftmost one containing all the points from your dataset, and the rightmost one displays the sigmoid between the distances corresponding to p. value = 1e-5 and p. value = 0.99999.

The Fit panel allows you to toggle on and off the computed sigmoid (in red), and displays the R2 regression coefficient.

The Equation panel displays the parameters of the sigmoid.

The converter panel allows for p. value calculation from a distance value and vice-versa, based on the computed equation.

Download

This is where you can retrieve all of the generated files:

Data

  • Always includes the following files:
    • distances.original.dat: the original distances, as a matrix.
    • distances.all.dat: all the distances (original and after permutation).
    • pval.raw.dat: all the "raw" p. values, as a matrix.
  • May also include:
    • dataset.normalized.dat: the normalized dataset.
    • pval.den.dat: all the density-based p. values, as a matrix.

R data

  • Always includes the following files:
    • variables.R: the parmeters file, sourced from the script
    • Rimage.RData: the R workspace image, saved from the script. It contains all the necessary variables to re-process and re-analyze the dataset by the user.

R graphs

  • Always includes the following files:
    • The boxplots panels
  • May also include:
    • The histgrams
    • The sigmoid

Newick files

  • Always includes the following files:
    • newick.raw.{linkage_method}.tre: newick files for "raw" p. values.
  • May also include:
    • newick.den.{linkage_method}.tre: newick files for density-based p. values.
    • newick.dist.{linkage_method}.tre: newick files for the normalized distances.

SIF files

  • Always includes the following files:
    • network.raw.sif: network file for "raw" p. values.
  • May also include:
    • network.den.sif: network file for density-based p. values.

JSON files

  • Always includes the following files:
    • clusters.raw.json: json (javascript object notation) file for "raw" p. values.
  • May also include:
    • clusters.den.json: json (javascript object notation) file for density-based p. values.