Similarity Search Results Guide
This guide provides an overview of the similarity search algorithm developed in our recent research, currently available as a pre-print. For a detailed description, please refer to [1].
How It Works
- Gridded 3×3 μm² ROIs are divided into 30 nm² bins to generate cumulative frequency histograms of localisations.
- A Kolmogorov-Smirnov (K-S) test [2] is used to compute a dissimilarity value (λ) between ROIs.
- The K-S test measures the maximum distance between two cumulative frequency histograms.
- λ = 0 indicates identical ROIs, while higher values indicate greater dissimilarity .
- The mean dissimilarity ( ) and standard deviation are calculated across datasets for comparison.
When interpreting similarity scores, a score greater than 1 (λ > 1) indicates that two datasets are significantly different. However, a score less than 1 (λ > 1) does not necessarily imply that the datasets are highly similar — it simply means that their differences are not statistically significant. If comparing similarity scores across multiple datasets to establish a ranking of similarity, while some scores may fall below 1, the relative ordering of scores remains informative, helping to rank datasets by their degree of similarity or dissimilarity.
Thinned vs. Unthinned Data
Dissimilarity is computed before and after thinning datasets to a uniform density of 100 localisations/μm², allowing users to choose between:
- Unthinned dissimilarity scores (dependent on localisation density)
- Thinned dissimilarity scores (independent of localisation density)
Important Considerations
- In unthinned data, dissimilarity scores are influenced by localisation density: datasets with lower densities tend to have lower dissimilarity scores.
- It is recommended to use thinned dissimilarity scores when comparing datasets with significantly different localisation densities.
Comparing Different Datasets
- Users can filter results based on metadata tags to ensure meaningful comparisons. For example, filtering by protein type ensures that only datasets containing the same protein are compared.
- Filtering is essential—comparing your dataset to all others on the site may not be meaningful. Users should refine their comparisons based on their specific research question.
- Dissimilarity scores are relative, not absolute measures of similarity. They provide guidance when testing hypotheses, but should be supplemented with further post-analysis, such as cluster analysis, spatial correlation studies, or functional validation.
References
- Shirgill, Sandeep, et al. "Nano-org, a functional resource for single-molecule localisation microscopy data." bioRxiv (2024): 2024-08.
- Jr, Frank J. "The Kolmogorov-Smirnov test for goodness of fit." Journal of the American Statistical Association 46.253 (1951): 68-78.