.. _User-similarity-search: Similarity search ******************** There are many tools for similarity search in nucleic acid sequences. In this section, we describe how and why our similarity search tools were selected. A detailed :ref:`User-similarity-search-measurement` is also given. .. _User-similarity-search-blast: Blast ===== .. note:: See :ref:`User-similarity-search-measurement` for BLAST speed measurement results. `Blast `_ was selected as the first approach to similarity search because it is a de facto standard in bioinformatics. We used 2.2.28+ Blast version which was current version when we integrated Blast into our poject. On 6th January 2014 was released Blast 2.2.29+ with some minor changes (in terms of Blast usage; of course there are improvements in this version). Currently, we use still 2.2.28+ because it is sufficient for us. Also, there is proposed XML format change in summer 2014, which will be reflected later. Blast deployment ---------------- We selected one program from the Blast suite: ``blastn``, which is meant for searching in **n**\ucleotide databases using a nucleotide query. However, there were still choices to be made, as Blast offers multiple deployment options. A special database (not a full-scale DB, just several files in the proper formats) must be generated which Blast will query, and the deployment options are differentiated by what database is used and how. The performance problems we encountered was not only in the response time but especially when building and indexing the Blast local database. Generally, there are three options for running Blast: First, it is possible to query a remote server which holds entire database. The advantage of this approach is that we don't need to maintain our own database. However, there are significant disadvantages. First of all, we don't have control over the database, so it might contains data that are not desirable for our purposes. This is both a performance problem (we could search faster if only rRNA sequences were in the central database) and an implemnetation challenge: how to recognize results from the central Blast database that are not in rData? A second performance problem is that querying over a network may increase response time. Depending on how exactly ``blastn`` running locally on our server implements querying the remote database, this may be a considerable issue, especially combined with the fact that rRNA sequences are only a small subset of the central Blast database. Reliability must also be factored in as a potential issue, although we believe that the NCBI maintain their servers better than we do, and maintenance once the database undergoes changes incompatible with our version of Blast. Second, Blast offers the option to download pre-formatted databases using the ``fastacmd`` utility. The advantage of this approach is that once the databases are downloaded, Blast can query locally instead of over the network. Also, these pre-formatted databases are smaller than if we donwloaded fasta files and added them into our own database on our local server. There is also a maintenance Perl script ``update_blastdb.pl`` for downloading the up-to-date versions of these pre-formatted databases, simplifying maintenance. However, there is still the significant disadvantage that we do not control the *content* of these files. Third, we could download the ``.fasta`` files from the `Silva database `_ used to populate rPredictorDB and create our own local Blast database using the utility ``makeblastdb``. This deployment option has the advantage that only data rPredictorDB holds will be used for making the database and the querying will still only happen locally on our server. Although setting up Blast and integrating it with rPredictorDB will be slightly more complicated and additional care must be taken that the Blast database is kept synchronized with rData, we chose this option, as it offers best performance. We therefore chose the third option. However, even with the local database, we encountered some performance problems: the search ran a bit too slowly. Therefore, we started to look for some faster similarity search tools that can be used as a complement or alternative for Blast tool. .. _User-similarity-search-blat: Blat ===== .. note:: See :ref:`User-similarity-search-measurement` for speed measurement results. While BLAST is a *de facto* standard for similarity search in the bioinformatics field, there are faster tools available today which are based on a similar approach. One of these is **Blat**, introduced in 2002 by `W. J. Kent `_. Blat is a heuristic algorithm for finding areas (in a *target database*) which are similar to a sequence under examination (a *query*). It computes similarity by deriving homology (i.e. existence of common ancestors between a pair of structures) for determining biological function of sequences. Blat tries to quickly detect short sequences which are most likely homologous, align them and further expand homologous areas in both directions. The algorithm is similar to Blast in the sense that it first searches for short segments (seeds) in the database and query sequences that have a certain number of identical elements. Blat maintains an indexed list (hash table) for the target database. The target database is indexed by coordinates of non-overlapping k-mers (a k-mer is simply a word with ``k`` letters), apart from very frequent k-mers. This non-overlapping property is the approximation that sets Blat apart from Blast. Then, Blat builds a list of overlapping k-mers from the query sequence and searches for these k-mers in the target database. For example: * We will set Blat to use k-mers of length 3 * Our data is ``SIMILARITY SEARCH TOOL`` Then the indexes in the database will be:: SIM|ILA|RIT|Y S|EAR|CH T|OOL We search for:: SIMILAR Blat will build the following k-mers from the query:: SIM|IMI|MIL|ILA|LAR And the following k-mers will be found:: SIM, ILA There are several strategies for searching candidates for areas of homology: exact match (for example 4 k-mers are exactly the same), non-exact match allowing one mismatch between two k-mers or more exact matches near each other. Blat uses the third method when aligning nucleotides and requires two exact matches with size 11 (11-mers). Blat maintains the index of k-mers in memory, which seems to be the biggest speed advantage against Blast. Blast and Blat use different strategies for extending homologous areas - Blat can extend on more exact and almost exact matches while Blast extends a potential homologous area only if he has one or two exact matches close to each other. Blat is slightly more sensitive than Blast: it may include as similar some sequences that should get a lower similarity score (or would get one, under Blast). It was measured that Blat is approximately four times faster than Blast. .. _User-similarity-search-measurement: Similarity search tools speed measurement ================================================ This document describes how similarity search tool speed was measured. We needed to look for some complements/alternatives to Blast because its speed was not completely satisfactory. Some other heuristics tools with desirable properties (control over data sources, local installation possibility) were chosen for performance testing. What tools were measured? -------------------------- Blast was measured as the baseline. We measured speed of three Blat versions: * standalone Blat without the ``-ooc`` parameter set, * standalone Blat with the ``-ooc`` parameter set and * client/server Blat. .. note:: Client/server Blat does not support the ``-ooc`` flag to be set. We measured also SSAHA, comparing it to Blat results. Additional information such as time taken by preparation of data for tools was also measured. If source FASTA files change, we need to download new files and prepare them for certain tools. For now, we have only two fasta files as sources which seems to be quite static, but the number of files may grow in the future and chages to them may be also more frequent than today. So we have to take this into account during decision process. All of the tested similarity search tools are heuristics, because the amount of data to be processed is typically huge and the user will not wait for exact results which may be only just a little bit better than results from heuristics methods. There are also tools that take less time to set up: the Blast utility ``makeblastdb``, which converts FASTA format files into the Blast format database, takes quite a long time, as can be seen in measurement result page. Preparation process ------------------- Every tool needs different preparations before it can be used for searching. * Blast - ``makeblastdb`` to make locally acessed database * Blat standalone without ``-ooc`` set - nothing * Blat standalone with -ooc set -.occ file needs to be generated * Blat client/server - the starting server and indexing * ssaha - preparation of the hash table Measurement parameters ---------------------- Measurements were made for 50 000 records in the database file and 5 and then 20 queries performed. We measured performance on the SSU and LSU Silva datasets, which are currently in use by rPredictorDB. For increased robustness we also measured results on a mouse dataset which contains much more sequences. 5 tests ^^^^^^^^^^ The assumption that on big data (the mouse dataset) will be results still the same in sense which tool is the best was not confirmed. 20 tests ^^^^^^^^^^^^ The assumption that on big data will be results still the same in sense which tool is the best was now confirmed. How measurement is implemented --------------------------------- We take the data and in the beginning of the measurement script set two variables - ``NUM_OF_RECORDS_IN_DB_FILE`` and ``NUM_OF_FILES`` (indicates also number of tests to be taken). These values are not provided as script parameters for clarity purposes. Script usage ^^^^^^^^^^^^ :: python measure.py {mouse|ssu| lsu| clean_mouse| clean_ssu| clean_lsu| clean| generate_mouse| generate_ssu| generate_lsu| measure_mouse| measure_ssu| measure_lsu} where * ``mouse|ssu|lsu``: indicates database file used for generating test files, this option generates data and performs measurement. * ``clean_mouse|clean_ssu|clean_lsu``: indicates which generated files will be deleted. * ``clean``: deletes all generated files. * ``generate_mouse|generate_ssu|generate_lsu``: generate database file and do not measure. * ``measure_mouse|measure_ssu|measure_lsu``: do pairings on files generated previously and measure results. Inputs ^^^^^^ We have several databases and several queries. Databases and queries are chosen (pseudo) randomly. Databases will have different selected size (small, medium, big, original size), we have three fasta files (SSU, LSU and mouse) and database files will be generated from them. Note: To improve the speed of measurements itself, we create symlink to whole fasta file when the number of records in database file is less than number of records in input fasta file. We randomly choose N sequences (where N is the ``NUM_OF_FILES`` parameter) from them to form one test file (database). Then we randomly choose from N pre-prepared queries. These sequences will be taken from the database files. This is done by a similar method as generating database files. Tools used ^^^^^^^^^^^ To be able to work with sequences we use the Biopython package (version 1.63 or later) For measuring we use the ``time`` utility from Unix (``os.times`` in python). Measurements results --------------------- .. note:: Blat server has restricted index size, but it was tested that both SSU and LSU can be held by one server. Twenty tests were taken, number of records in generated database file set to 50 000. All results are given in seconds. SSU ^^^^ Results for standalone Blat without ``-ooc`` set: ========================== ================================= Measurement Value ========================== ================================= Average time 0.608 Minimal time 0.13 Maximum time 2.69 Median time 0.13 Variation 0.957750489428 ========================== ================================= Results for standalone Blat with ``-ooc`` set: ========================== ================================= Measurement Value ========================== ================================= Average time 0.13 Minimal time 0.12 Maximum time 0.14 Median time 0.13 Variation 0.00547722557505 ========================== ================================= Results for client/server Blat: ========================== ================================= Measurement Value ========================== ================================= Average time 0.0365 Minimal time 0.0 Maximum time 0.19 Median time 0.0 Variation 0.0732307995313 ========================== ================================= Results for SSAHA: ========================== ================================= Measurement Value ========================== ================================= Average time 0.053 Minimal time 0.04 Maximum time 0.09 Median time 0.05 Variation 0.0167630546142 ========================== ================================= Measurement of Blast for comparison ========================== ================================= Measurement Value ========================== ================================= Average time 0.1565 Minimal time 0.0 Maximum time 0.73 Median time 0.02 Variation 0.280629203755 ========================== ================================= **Additional measurement for server/hash table/db preparation:** Results for starting Blat server and preparing index: ========================== ================================= Measurement Value ========================== ================================= Average time 0.1685 Minimal time 0.16 Maximum time 0.18 Median time 0.17 Variation 0.00653834841531 ========================== ================================= Results for precomputing hash table for SSAHA: ========================== ================================= Measurement Value ========================== ================================= Average time 0.4455 Minimal time 0.43 Maximum time 0.46 Median time 0.44 Variation 0.00864580823289 ========================== ================================= Results for Blast db preparation: ========================== ================================= Measurement Value ========================== ================================= Average time 0.5725 Minimal time 0.55 Maximum time 0.6 Median time 0.575 Variation 0.0113468057179 ========================== ================================= Best result in average case gives client/server: Blat. And a simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, can be seen that Blat is approximatelly four times faster than Blast. If we take into account also (average) times needed for data prepararion, we have: **Blast:** 0.1565 + 0.5725 = 0.729 **Blat:** 0.0365 + 0.1685 = 0.205 LSU ^^^^ Results for standalone Blat without ``-ooc`` set: ========================== ================================= Measurement Value ========================== ================================= Average time 5.447 Minimal time 1.49 Maximum time 11.6 Median time 3.905 Variation 3.62403242259 ========================== ================================= Results for standalone Blat with ``-ooc`` set: ========================== ================================= Measurement Value ========================== ================================= Average time 5.4345 Minimal time 1.49 Maximum time 11.57 Median time 3.925 Variation 3.61336750829 ========================== ================================= Results for client/server Blat: ========================== ================================= Measurement Value ========================== ================================= Average time 0.5825 Minimal time 0.27 Maximum time 0.74 Median time 0.605 Variation 0.128018553343 ========================== ================================= Results for SSAHA: ========================== ================================= Measurement Value ========================== ================================= Average time 0.8935 Minimal time 0.0599999999999 Maximum time 2.57 Median time 0.625 Variation 0.731144821496 ========================== ================================= Measurement of Blast for comparison: ========================== ================================= Measurement Value ========================== ================================= Average time 2.3635 Minimal time 0.87 Maximum time 3.51 Median time 2.565 Variation 0.852439294026 ========================== ================================= **Additional measurement for server/hash table/db preparation:** Results for starting Blat server and preparing index ========================== ================================= Measurement Value ========================== ================================= Average time 1.9285 Minimal time 1.88 Maximum time 2.31 Median time 1.91 Variation 0.0883897618506 ========================== ================================= Results for precomputing hash table for SSAHA: ========================== ================================= Measurement Value ========================== ================================= Average time 5.533 Minimal time 5.4 Maximum time 6.54 Median time 5.475 Variation 0.238644086455 ========================== ================================= Results for Blast db preparation: ========================== ================================= Measurement Value ========================== ================================= Average time 6.068 Minimal time 5.94 Maximum time 6.7 Median time 6.02 Variation 0.164304595188 ========================== ================================= As it can be seen, best result in average case gives client/server Blat. And a simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, shows that Blat is approximately four times faster than Blast. If we take into account also (average) times needed for data prepararion, we have: **Blast:** 2.3635 + 6.068 = 8.4315 **Blat:** 0.5825 + 1.9285 = 2.511 Mouse ^^^^^^ Results for standalone Blat without ``-ooc`` set: ========================== ================================= Measurement Value ========================== ================================= Average time 0.2315 Minimal time 0.19 Maximum time 0.29 Median time 0.24 Variation 0.0276179289593 ========================== ================================= Results for standalone Blat with ``-ooc`` set ========================== ================================= Measurement Value ========================== ================================= Average time 0.2355 Minimal time 0.19 Maximum time 0.31 Median time 0.235 Variation 0.0321675302129 ========================== ================================= Results for client/server Blat: ========================== ================================= Measurement Value ========================== ================================= Average time 0.0005 Minimal time 0.0 Maximum time 0.00999999999999 Median time 0.0 Variation 0.00217944947177 ========================== ================================= Results for SSAHA: ========================== ================================= Measurement Value ========================== ================================= Average time 0.055 Minimal time 0.04 Maximum time 0.07 Median time 0.055 Variation 0.00670820393251 ========================== ================================= Measurement of Blast for comparison: ========================== ================================= Measurement Value ========================== ================================= Average time 0.0395 Minimal time 0.02 Maximum time 0.0599999999999 Median time 0.04 Variation 0.00920597631975 ========================== ================================= **Additional measurement for server/hash table/db preparation:** Results for starting Blat server and preparing index: ========================== ================================= Measurement Value ========================== ================================= Average time 0.2285 Minimal time 0.18 Maximum time 0.31 Median time 0.24 Variation 0.0330567693521 ========================== ================================= Results for precomputing hash table for SSAHA: ========================== ================================= Measurement Value ========================== ================================= Average time 0.604 Minimal time 0.48 Maximum time 0.73 Median time 0.63 Variation 0.0689492567038 ========================== ================================= Results for Blast db preparation: ========================== ================================= Measurement Value ========================== ================================= Average time 1.3615 Minimal time 1.17 Maximum time 1.48 Median time 1.42 Variation 0.107994212808 ========================== ================================= The best result in average case comes from client/server Blat. A simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, shows that Blat is even more than four times faster than Blast in the case of data currently not in use. If we take into account also (average) times needed for data prepararion, we have: **Blast:** 0.0395 + 1.3615 = 1.401 **Blat:** 0.0005 + 0.2285 = 0.229