A distance between two vectors quantifies the dissimilarity between them, identical vectors will have a distance of 0. In biological applications, for instance, it is frequent to compare gene expression profiles triggered by various treatments using such a metrics.
However comparing multiple data vectors via their pairwise distances requires these vectors to be normalized, which is not always desirable. Another limitation is that there is no easy way to associate a p. value to a distance value in order to assess the significance of that particular distance. A common method to evaluate this significance is to perform a permutation (also known as random Sampling) analysis. Assuming that two linked vectors yield a distance D, then if the pairwise relationship is broken (by shuffling the dataset) the newly computed distance D’ should be greater than the original one.
By iterating many times the process of shuffling / computing D’, we can evaluate what is the space of possible distances to be obtained by chance, and then compare the original “true” distance D to the distribution of D’. The probability of getting, by chance, a distance better than the original one can be approximated as the p. value.
In this tutorial, we will submit a simple job using a dummy dataset and basic options for faster generation of the results page
To work with a more complexe analysis (e.g. hormone treatments on Arabidopsis seedlings), first retrieve it from the demo account:
Although you can customize the way your dataset is being analyzed, it can also be very fast and easy (just a couple of clicks) to submit a basic job.
To start a new analysis, go to the menu bar and click on Analysis > Start a new analysis.
You can upload your dataset by either copying/pasting it into the text box, or by selecting a text file located in your computer. Alternatively, you can also load one of our example dataset by just clicking on the corresponding links on the right.
Your dataset must meet the following criteria:
Also, keep in mind that:
For simple, faster evaluation of the significance of the distance values, you can just click Launch this analysis. This will use the Euclidean distance as the default one. However you are still advised to enter, at least, your email address for easier retrieval.
This is the distance to be used to compare the data vectors:
This is the default distance, to be used in most cases.
Number of shuffling iterations to compute the distribution the distances obtained after randomization of the vectors. You can enter any number between 1,000 and 100,000.
Checkbox that indicates whether or not the first column should be ignored. If the first column contains non-numerical values, then this parameter is automatically set to ON.
This is where you tell RS WebTool that you want the density of the distances distribution to be computed.
For more advanced users who wish to select the density kernel to be used. Gaussian is the default. You can also choose to use rectangular, triangular, epanechnikov, biweight, cosine or optcosine.
By default, the R Nrd density bandwidth calculation method is used (see the R documentation for more details on how the density is calculated). However this method reaches its limitations when distance values after randomization have a low standard deviation. You can therefore:
Distribution of the distances calculated after random permutation, no density calculation.
The calculated "raw" p. value is 0.008. |
Distribution of the distances calculated after random permutation, density calculation using the Nrd method for bandwidth determination
(bandwidth = 2.85e-14). The calculated p. value is 1. |
Distribution of the distances calculated after random permutation, density calculation using the optimized Nrd method for bandwidth determination
(bandwidth = 1.09e-01). The calculated p. value is 8.899e-03. |
While dataset normalization will not affect the computed p-values, it will change the distances values between two data vectors.
Will try to fit a p.val ~ distance sigmoid. Only available for normalized datasets.
For easier retrieval and job management, you can provide your email address, an analysis name and a description. If you check Quick Retrieval, you will be able to retrieve easily your analyses by just providing its name and your email address. See How do I retrieve a previous job? for more information.
This tab allows you to review your dataset and use a HeatMapping plugin.
All submitted jobs will remain on the server for an unlimited time. You can retrieve any of your previous analyses using one of the following ways:
The results pages are permalink. Once the analysis has been launched, you will be redirected to the results page which you can bookmark for later access.
If you have provided your email address when launching your analysis, you can request a link towards a summary of all the analyses matching that same email address. To ensure all your analyses remain private, this link will be emailed. The link that will be sent to you will remain valid forever, and can be re-used for future access of your jobs summary.
In the menu bar, click on Analyses > Retrieve all my previous analyses.
Enter your email address and click Retrieve.
Note that only the analyses for which you have provided your address will be displayed in the summary.
If you have provided your email address AND a name to a particular analysis, you will be able to directly access the results page of this particular job by clicking on:
Analyses > Retrieve a previous analysis
Enter your email address and analysis name, and click Retrieve.
If this analysis exists, you will be redirected to the results page, and if several jobs match the same criteria (email address and name) you will be asked to choose from a list. However this quick retrieval of an analysis is only enabled if you have checked "quick retrieval" at the time you launched the job.
Lets you visualize all the R-generated graphs. Thumbnails in the coverflow can be clicked to display full resolution, publication-quality graphs.
To clearly visualize distance significance between all entities, p. values can be displayed as an adjacency matrix.
The p. values to be used can be selected from the upper drop-down menu (Dataset). The "raw" p. values are always available, while the density-based ones are only available if they have been computed. Each square can be clicked to display its attributes (entities compared and p. value). The significance threshold to be applied for visualization can be set (Display) to hide / show p. values matching particular criteria. Values lower than the threshold are displayed in red, and values greater than 1-threshold are displayed in blue. If the applied threshold correspond to a significance cut-off you wish to apply (0.01 for instance), then red squares indicate significantly close vectors, and blue squares indicate vectors that are significantly different. Setting a cut-off at 0.5 will display all p. values.
You can adjust how the matrix is rendered by modifying its size (Matrix), as well as the font size.
The position of the rows/columns in the matrix can be changed to match pre-computed clusters. These clusters were generated by hierarchical clustering in R (Average linkage, Single linkage or Complete linkage) followed by tree cutting at various height (0.05, 0.01, 1e-3, 1e-4, 1e-5, 1e-6).
A force-layout network can be rendered with the entities as nodes and their pairwise distance p.values as edges. You can choose to display edges only corresponding to p.values lower than your favorite significance threshold, and color-code the nodes with respect to the pre-computed cluseters in R. These clusters were generated by hierarchical clustering in R (Average linkage, Single linkage or Complete linkage) followed by tree cutting at various height (0.05, 0.01, 1e-3, 1e-4, 1e-5, 1e-6).
A SIF file is also generated for direct copy/paste use of the data in any network-rendering software (cytoscape, gephi, …).
R-generated Newick files. You can choose between any of the available datasets ("raw" p. values, density-based p. values, and normalized distances), and any of the following linkage method: Complete, Average, Single, Ward, McQuitty, Centroid, Median.
The corresponding Newick files is displayed for easy copy/paste use of it in any tree-rendering software (TreeDyn, …).
If requested at the time you launched the job, the p. value ~ distance function has been computed (only available if you have normalized the dataset). This panel displays two p. value ~ distance scatterplots, the leftmost one containing all the points from your dataset, and the rightmost one displays the sigmoid between the distances corresponding to p. value = 1e-5 and p. value = 0.99999.
The Fit panel allows you to toggle on and off the computed sigmoid (in red), and displays the R2 regression coefficient.
The Equation panel displays the parameters of the sigmoid.
The converter panel allows for p. value calculation from a distance value and vice-versa, based on the computed equation.
This is where you can retrieve all of the generated files: