There are many tools for similarity search in nucleic acid sequences. In this section, we describe how and why our similarity search tools were selected. A detailed Similarity search tools speed measurement is also given.
Note
See Similarity search tools speed measurement for BLAST speed measurement results.
Blast was selected as the first approach to similarity search because it is a de facto standard in bioinformatics.
We used 2.2.28+ Blast version which was current version when we integrated Blast into our poject. On 6th January 2014 was released Blast 2.2.29+ with some minor changes (in terms of Blast usage; of course there are improvements in this version). Currently, we use still 2.2.28+ because it is sufficient for us.
Also, there is proposed XML format change in summer 2014, which will be reflected later.
We selected one program from the Blast suite: blastn, which is meant for searching in nucleotide databases using a nucleotide query. However, there were still choices to be made, as Blast offers multiple deployment options. A special database (not a full-scale DB, just several files in the proper formats) must be generated which Blast will query, and the deployment options are differentiated by what database is used and how.
The performance problems we encountered was not only in the response time but especially when building and indexing the Blast local database.
Generally, there are three options for running Blast:
First, it is possible to query a remote server which holds entire database. The advantage of this approach is that we don’t need to maintain our own database. However, there are significant disadvantages. First of all, we don’t have control over the database, so it might contains data that are not desirable for our purposes. This is both a performance problem (we could search faster if only rRNA sequences were in the central database) and an implemnetation challenge: how to recognize results from the central Blast database that are not in rData?
A second performance problem is that querying over a network may increase response time. Depending on how exactly blastn running locally on our server implements querying the remote database, this may be a considerable issue, especially combined with the fact that rRNA sequences are only a small subset of the central Blast database. Reliability must also be factored in as a potential issue, although we believe that the NCBI maintain their servers better than we do, and maintenance once the database undergoes changes incompatible with our version of Blast.
Second, Blast offers the option to download pre-formatted databases using the fastacmd utility. The advantage of this approach is that once the databases are downloaded, Blast can query locally instead of over the network. Also, these pre-formatted databases are smaller than if we donwloaded fasta files and added them into our own database on our local server. There is also a maintenance Perl script update_blastdb.pl for downloading the up-to-date versions of these pre-formatted databases, simplifying maintenance. However, there is still the significant disadvantage that we do not control the content of these files.
Third, we could download the .fasta files from the Silva database used to populate rPredictorDB and create our own local Blast database using the utility makeblastdb. This deployment option has the advantage that only data rPredictorDB holds will be used for making the database and the querying will still only happen locally on our server. Although setting up Blast and integrating it with rPredictorDB will be slightly more complicated and additional care must be taken that the Blast database is kept synchronized with rData, we chose this option, as it offers best performance.
We therefore chose the third option. However, even with the local database, we encountered some performance problems: the search ran a bit too slowly. Therefore, we started to look for some faster similarity search tools that can be used as a complement or alternative for Blast tool.
Note
See Similarity search tools speed measurement for speed measurement results.
While BLAST is a de facto standard for similarity search in the bioinformatics field, there are faster tools available today which are based on a similar approach. One of these is Blat, introduced in 2002 by W. J. Kent.
Blat is a heuristic algorithm for finding areas (in a target database) which are similar to a sequence under examination (a query). It computes similarity by deriving homology (i.e. existence of common ancestors between a pair of structures) for determining biological function of sequences.
Blat tries to quickly detect short sequences which are most likely homologous, align them and further expand homologous areas in both directions. The algorithm is similar to Blast in the sense that it first searches for short segments (seeds) in the database and query sequences that have a certain number of identical elements. Blat maintains an indexed list (hash table) for the target database. The target database is indexed by coordinates of non-overlapping k-mers (a k-mer is simply a word with k letters), apart from very frequent k-mers. This non-overlapping property is the approximation that sets Blat apart from Blast. Then, Blat builds a list of overlapping k-mers from the query sequence and searches for these k-mers in the target database.
For example:
Then the indexes in the database will be:
SIM|ILA|RIT|Y S|EAR|CH T|OOL
We search for:
SIMILAR
Blat will build the following k-mers from the query:
SIM|IMI|MIL|ILA|LAR
And the following k-mers will be found:
SIM, ILA
There are several strategies for searching candidates for areas of homology: exact match (for example 4 k-mers are exactly the same), non-exact match allowing one mismatch between two k-mers or more exact matches near each other. Blat uses the third method when aligning nucleotides and requires two exact matches with size 11 (11-mers).
Blat maintains the index of k-mers in memory, which seems to be the biggest speed advantage against Blast. Blast and Blat use different strategies for extending homologous areas - Blat can extend on more exact and almost exact matches while Blast extends a potential homologous area only if he has one or two exact matches close to each other. Blat is slightly more sensitive than Blast: it may include as similar some sequences that should get a lower similarity score (or would get one, under Blast).
It was measured that Blat is approximately four times faster than Blast.
This document describes how similarity search tool speed was measured.
We needed to look for some complements/alternatives to Blast because its speed was not completely satisfactory. Some other heuristics tools with desirable properties (control over data sources, local installation possibility) were chosen for performance testing.
Blast was measured as the baseline. We measured speed of three Blat versions:
Note
Client/server Blat does not support the -ooc flag to be set.
We measured also SSAHA, comparing it to Blat results.
Additional information such as time taken by preparation of data for tools was also measured. If source FASTA files change, we need to download new files and prepare them for certain tools. For now, we have only two fasta files as sources which seems to be quite static, but the number of files may grow in the future and chages to them may be also more frequent than today. So we have to take this into account during decision process.
All of the tested similarity search tools are heuristics, because the amount of data to be processed is typically huge and the user will not wait for exact results which may be only just a little bit better than results from heuristics methods. There are also tools that take less time to set up: the Blast utility makeblastdb, which converts FASTA format files into the Blast format database, takes quite a long time, as can be seen in measurement result page.
Every tool needs different preparations before it can be used for searching.
Measurements were made for 50 000 records in the database file and 5 and then 20 queries performed.
We measured performance on the SSU and LSU Silva datasets, which are currently in use by rPredictorDB. For increased robustness we also measured results on a mouse dataset which contains much more sequences.
The assumption that on big data (the mouse dataset) will be results still the same in sense which tool is the best was not confirmed.
The assumption that on big data will be results still the same in sense which tool is the best was now confirmed.
We take the data and in the beginning of the measurement script set two variables - NUM_OF_RECORDS_IN_DB_FILE and NUM_OF_FILES
(indicates also number of tests to be taken). These values are not provided as script parameters for clarity purposes.
python measure.py {mouse|ssu|
lsu|
clean_mouse|
clean_ssu|
clean_lsu|
clean|
generate_mouse|
generate_ssu|
generate_lsu|
measure_mouse|
measure_ssu|
measure_lsu}
where
We have several databases and several queries. Databases and queries are chosen (pseudo) randomly. Databases will have different selected size (small, medium, big, original size), we have three fasta files (SSU, LSU and mouse) and database files will be generated from them. Note: To improve the speed of measurements itself, we create symlink to whole fasta file when the number of records in database file is less than number of records in input fasta file.
We randomly choose N sequences (where N is the NUM_OF_FILES parameter) from them to form one test file (database).
Then we randomly choose from N pre-prepared queries. These sequences will be taken from the database files. This is done by a similar method as generating database files.
To be able to work with sequences we use the Biopython package (version 1.63 or later)
For measuring we use the time utility from Unix (os.times in python).
Note
Blat server has restricted index size, but it was tested that both SSU and LSU can be held by one server.
Twenty tests were taken, number of records in generated database file set to 50 000.
All results are given in seconds.
Results for standalone Blat without -ooc set:
Measurement | Value |
---|---|
Average time | 0.608 |
Minimal time | 0.13 |
Maximum time | 2.69 |
Median time | 0.13 |
Variation | 0.957750489428 |
Results for standalone Blat with -ooc set:
Measurement | Value |
---|---|
Average time | 0.13 |
Minimal time | 0.12 |
Maximum time | 0.14 |
Median time | 0.13 |
Variation | 0.00547722557505 |
Results for client/server Blat:
Measurement | Value |
---|---|
Average time | 0.0365 |
Minimal time | 0.0 |
Maximum time | 0.19 |
Median time | 0.0 |
Variation | 0.0732307995313 |
Results for SSAHA:
Measurement | Value |
---|---|
Average time | 0.053 |
Minimal time | 0.04 |
Maximum time | 0.09 |
Median time | 0.05 |
Variation | 0.0167630546142 |
Measurement of Blast for comparison
Measurement | Value |
---|---|
Average time | 0.1565 |
Minimal time | 0.0 |
Maximum time | 0.73 |
Median time | 0.02 |
Variation | 0.280629203755 |
Additional measurement for server/hash table/db preparation:
Results for starting Blat server and preparing index:
Measurement | Value |
---|---|
Average time | 0.1685 |
Minimal time | 0.16 |
Maximum time | 0.18 |
Median time | 0.17 |
Variation | 0.00653834841531 |
Results for precomputing hash table for SSAHA:
Measurement | Value |
---|---|
Average time | 0.4455 |
Minimal time | 0.43 |
Maximum time | 0.46 |
Median time | 0.44 |
Variation | 0.00864580823289 |
Results for Blast db preparation:
Measurement | Value |
---|---|
Average time | 0.5725 |
Minimal time | 0.55 |
Maximum time | 0.6 |
Median time | 0.575 |
Variation | 0.0113468057179 |
Best result in average case gives client/server: Blat.
And a simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, can be seen that Blat is approximatelly four times faster than Blast. If we take into account also (average) times needed for data prepararion, we have:
Blast: 0.1565 + 0.5725 = 0.729
Blat: 0.0365 + 0.1685 = 0.205
Results for standalone Blat without -ooc set:
Measurement | Value |
---|---|
Average time | 5.447 |
Minimal time | 1.49 |
Maximum time | 11.6 |
Median time | 3.905 |
Variation | 3.62403242259 |
Results for standalone Blat with -ooc set:
Measurement | Value |
---|---|
Average time | 5.4345 |
Minimal time | 1.49 |
Maximum time | 11.57 |
Median time | 3.925 |
Variation | 3.61336750829 |
Results for client/server Blat:
Measurement | Value |
---|---|
Average time | 0.5825 |
Minimal time | 0.27 |
Maximum time | 0.74 |
Median time | 0.605 |
Variation | 0.128018553343 |
Results for SSAHA:
Measurement | Value |
---|---|
Average time | 0.8935 |
Minimal time | 0.0599999999999 |
Maximum time | 2.57 |
Median time | 0.625 |
Variation | 0.731144821496 |
Measurement of Blast for comparison:
Measurement | Value |
---|---|
Average time | 2.3635 |
Minimal time | 0.87 |
Maximum time | 3.51 |
Median time | 2.565 |
Variation | 0.852439294026 |
Additional measurement for server/hash table/db preparation:
Results for starting Blat server and preparing index
Measurement | Value |
---|---|
Average time | 1.9285 |
Minimal time | 1.88 |
Maximum time | 2.31 |
Median time | 1.91 |
Variation | 0.0883897618506 |
Results for precomputing hash table for SSAHA:
Measurement | Value |
---|---|
Average time | 5.533 |
Minimal time | 5.4 |
Maximum time | 6.54 |
Median time | 5.475 |
Variation | 0.238644086455 |
Results for Blast db preparation:
Measurement | Value |
---|---|
Average time | 6.068 |
Minimal time | 5.94 |
Maximum time | 6.7 |
Median time | 6.02 |
Variation | 0.164304595188 |
As it can be seen, best result in average case gives client/server Blat.
And a simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, shows that Blat is approximately four times faster than Blast.
If we take into account also (average) times needed for data prepararion, we have:
Blast: 2.3635 + 6.068 = 8.4315
Blat: 0.5825 + 1.9285 = 2.511
Results for standalone Blat without -ooc set:
Measurement | Value |
---|---|
Average time | 0.2315 |
Minimal time | 0.19 |
Maximum time | 0.29 |
Median time | 0.24 |
Variation | 0.0276179289593 |
Results for standalone Blat with -ooc set
Measurement | Value |
---|---|
Average time | 0.2355 |
Minimal time | 0.19 |
Maximum time | 0.31 |
Median time | 0.235 |
Variation | 0.0321675302129 |
Results for client/server Blat:
Measurement | Value |
---|---|
Average time | 0.0005 |
Minimal time | 0.0 |
Maximum time | 0.00999999999999 |
Median time | 0.0 |
Variation | 0.00217944947177 |
Results for SSAHA:
Measurement | Value |
---|---|
Average time | 0.055 |
Minimal time | 0.04 |
Maximum time | 0.07 |
Median time | 0.055 |
Variation | 0.00670820393251 |
Measurement of Blast for comparison:
Measurement | Value |
---|---|
Average time | 0.0395 |
Minimal time | 0.02 |
Maximum time | 0.0599999999999 |
Median time | 0.04 |
Variation | 0.00920597631975 |
Additional measurement for server/hash table/db preparation:
Results for starting Blat server and preparing index:
Measurement | Value |
---|---|
Average time | 0.2285 |
Minimal time | 0.18 |
Maximum time | 0.31 |
Median time | 0.24 |
Variation | 0.0330567693521 |
Results for precomputing hash table for SSAHA:
Measurement | Value |
---|---|
Average time | 0.604 |
Minimal time | 0.48 |
Maximum time | 0.73 |
Median time | 0.63 |
Variation | 0.0689492567038 |
Results for Blast db preparation:
Measurement | Value |
---|---|
Average time | 1.3615 |
Minimal time | 1.17 |
Maximum time | 1.48 |
Median time | 1.42 |
Variation | 0.107994212808 |
The best result in average case comes from client/server Blat.
A simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, shows that Blat is even more than four times faster than Blast in the case of data currently not in use.
If we take into account also (average) times needed for data prepararion, we have:
Blast: 0.0395 + 1.3615 = 1.401
Blat: 0.0005 + 0.2285 = 0.229