7. Similarity search¶

There are many tools for similarity search in nucleic acid sequences. In this section, we describe how and why our similarity search tools were selected. A detailed Similarity search tools speed measurement is also given.

7.1. Blast¶

Note

See Similarity search tools speed measurement for BLAST speed measurement results.

Blast was selected as the first approach to similarity search because it is a de facto standard in bioinformatics.

We used 2.2.28+ Blast version which was current version when we integrated Blast into our poject. On 6th January 2014 was released Blast 2.2.29+ with some minor changes (in terms of Blast usage; of course there are improvements in this version). Currently, we use still 2.2.28+ because it is sufficient for us.

Also, there is proposed XML format change in summer 2014, which will be reflected later.

7.1.1. Blast deployment¶

We selected one program from the Blast suite: blastn, which is meant for searching in nucleotide databases using a nucleotide query. However, there were still choices to be made, as Blast offers multiple deployment options. A special database (not a full-scale DB, just several files in the proper formats) must be generated which Blast will query, and the deployment options are differentiated by what database is used and how.

The performance problems we encountered was not only in the response time but especially when building and indexing the Blast local database.

Generally, there are three options for running Blast:

First, it is possible to query a remote server which holds entire database. The advantage of this approach is that we don’t need to maintain our own database. However, there are significant disadvantages. First of all, we don’t have control over the database, so it might contains data that are not desirable for our purposes. This is both a performance problem (we could search faster if only rRNA sequences were in the central database) and an implemnetation challenge: how to recognize results from the central Blast database that are not in rData?

A second performance problem is that querying over a network may increase response time. Depending on how exactly blastn running locally on our server implements querying the remote database, this may be a considerable issue, especially combined with the fact that rRNA sequences are only a small subset of the central Blast database. Reliability must also be factored in as a potential issue, although we believe that the NCBI maintain their servers better than we do, and maintenance once the database undergoes changes incompatible with our version of Blast.

Second, Blast offers the option to download pre-formatted databases using the fastacmd utility. The advantage of this approach is that once the databases are downloaded, Blast can query locally instead of over the network. Also, these pre-formatted databases are smaller than if we donwloaded fasta files and added them into our own database on our local server. There is also a maintenance Perl script update_blastdb.pl for downloading the up-to-date versions of these pre-formatted databases, simplifying maintenance. However, there is still the significant disadvantage that we do not control the content of these files.

Third, we could download the .fasta files from the Silva database used to populate rPredictorDB and create our own local Blast database using the utility makeblastdb. This deployment option has the advantage that only data rPredictorDB holds will be used for making the database and the querying will still only happen locally on our server. Although setting up Blast and integrating it with rPredictorDB will be slightly more complicated and additional care must be taken that the Blast database is kept synchronized with rData, we chose this option, as it offers best performance.

We therefore chose the third option. However, even with the local database, we encountered some performance problems: the search ran a bit too slowly. Therefore, we started to look for some faster similarity search tools that can be used as a complement or alternative for Blast tool.

7.2. Blat¶

Note

See Similarity search tools speed measurement for speed measurement results.

While BLAST is a de facto standard for similarity search in the bioinformatics field, there are faster tools available today which are based on a similar approach. One of these is Blat, introduced in 2002 by W. J. Kent.

Blat is a heuristic algorithm for finding areas (in a target database) which are similar to a sequence under examination (a query). It computes similarity by deriving homology (i.e. existence of common ancestors between a pair of structures) for determining biological function of sequences.

Blat tries to quickly detect short sequences which are most likely homologous, align them and further expand homologous areas in both directions. The algorithm is similar to Blast in the sense that it first searches for short segments (seeds) in the database and query sequences that have a certain number of identical elements. Blat maintains an indexed list (hash table) for the target database. The target database is indexed by coordinates of non-overlapping k-mers (a k-mer is simply a word with k letters), apart from very frequent k-mers. This non-overlapping property is the approximation that sets Blat apart from Blast. Then, Blat builds a list of overlapping k-mers from the query sequence and searches for these k-mers in the target database.

For example:

We will set Blat to use k-mers of length 3
Our data is SIMILARITY SEARCH TOOL

Then the indexes in the database will be:

SIM|ILA|RIT|Y S|EAR|CH T|OOL

We search for:

SIMILAR

Blat will build the following k-mers from the query:

SIM|IMI|MIL|ILA|LAR

And the following k-mers will be found:

SIM, ILA

There are several strategies for searching candidates for areas of homology: exact match (for example 4 k-mers are exactly the same), non-exact match allowing one mismatch between two k-mers or more exact matches near each other. Blat uses the third method when aligning nucleotides and requires two exact matches with size 11 (11-mers).

Blat maintains the index of k-mers in memory, which seems to be the biggest speed advantage against Blast. Blast and Blat use different strategies for extending homologous areas - Blat can extend on more exact and almost exact matches while Blast extends a potential homologous area only if he has one or two exact matches close to each other. Blat is slightly more sensitive than Blast: it may include as similar some sequences that should get a lower similarity score (or would get one, under Blast).

It was measured that Blat is approximately four times faster than Blast.

7.3. Similarity search tools speed measurement¶

This document describes how similarity search tool speed was measured.

We needed to look for some complements/alternatives to Blast because its speed was not completely satisfactory. Some other heuristics tools with desirable properties (control over data sources, local installation possibility) were chosen for performance testing.

7.3.1. What tools were measured?¶

Blast was measured as the baseline. We measured speed of three Blat versions:

standalone Blat without the -ooc parameter set,
standalone Blat with the -ooc parameter set and
client/server Blat.

Note

Client/server Blat does not support the -ooc flag to be set.

We measured also SSAHA, comparing it to Blat results.

Additional information such as time taken by preparation of data for tools was also measured. If source FASTA files change, we need to download new files and prepare them for certain tools. For now, we have only two fasta files as sources which seems to be quite static, but the number of files may grow in the future and chages to them may be also more frequent than today. So we have to take this into account during decision process.

All of the tested similarity search tools are heuristics, because the amount of data to be processed is typically huge and the user will not wait for exact results which may be only just a little bit better than results from heuristics methods. There are also tools that take less time to set up: the Blast utility makeblastdb, which converts FASTA format files into the Blast format database, takes quite a long time, as can be seen in measurement result page.

7.3.2. Preparation process¶

Every tool needs different preparations before it can be used for searching.

Blast - makeblastdb to make locally acessed database
Blat standalone without -ooc set - nothing
Blat standalone with -ooc set -.occ file needs to be generated
Blat client/server - the starting server and indexing
ssaha - preparation of the hash table

7.3.3. Measurement parameters¶

Measurements were made for 50 000 records in the database file and 5 and then 20 queries performed.

We measured performance on the SSU and LSU Silva datasets, which are currently in use by rPredictorDB. For increased robustness we also measured results on a mouse dataset which contains much more sequences.

7.3.3.1. 5 tests¶

The assumption that on big data (the mouse dataset) will be results still the same in sense which tool is the best was not confirmed.

7.3.3.2. 20 tests¶

The assumption that on big data will be results still the same in sense which tool is the best was now confirmed.

7.3.4. How measurement is implemented¶

We take the data and in the beginning of the measurement script set two variables - NUM_OF_RECORDS_IN_DB_FILE and NUM_OF_FILES

(indicates also number of tests to be taken). These values are not provided as script parameters for clarity purposes.

7.3.4.1. Script usage¶

python measure.py {mouse|ssu|
                         lsu|
                 clean_mouse|
                   clean_ssu|
                   clean_lsu|
                       clean|
              generate_mouse|
                generate_ssu|
                generate_lsu|
               measure_mouse|
                 measure_ssu|
                 measure_lsu}

where

mouse|ssu|lsu: indicates database file used for generating test files, this option generates data and performs measurement.
clean_mouse|clean_ssu|clean_lsu: indicates which generated files will be deleted.
clean: deletes all generated files.
generate_mouse|generate_ssu|generate_lsu: generate database file and do not measure.
measure_mouse|measure_ssu|measure_lsu: do pairings on files generated previously and measure results.

7.3.4.2. Inputs¶

We have several databases and several queries. Databases and queries are chosen (pseudo) randomly. Databases will have different selected size (small, medium, big, original size), we have three fasta files (SSU, LSU and mouse) and database files will be generated from them. Note: To improve the speed of measurements itself, we create symlink to whole fasta file when the number of records in database file is less than number of records in input fasta file.

We randomly choose N sequences (where N is the NUM_OF_FILES parameter) from them to form one test file (database).

Then we randomly choose from N pre-prepared queries. These sequences will be taken from the database files. This is done by a similar method as generating database files.

7.3.4.3. Tools used¶

To be able to work with sequences we use the Biopython package (version 1.63 or later)

For measuring we use the time utility from Unix (os.times in python).

7.3.5. Measurements results¶

Note

Blat server has restricted index size, but it was tested that both SSU and LSU can be held by one server.

Twenty tests were taken, number of records in generated database file set to 50 000.

All results are given in seconds.

7.3.5.1. SSU¶

Results for standalone Blat without -ooc set:

Measurement	Value
Average time	0.608
Minimal time	0.13
Maximum time	2.69
Median time	0.13
Variation	0.957750489428

Results for standalone Blat with -ooc set:

Measurement	Value
Average time	0.13
Minimal time	0.12
Maximum time	0.14
Median time	0.13
Variation	0.00547722557505

Results for client/server Blat:

Measurement	Value
Average time	0.0365
Minimal time	0.0
Maximum time	0.19
Median time	0.0
Variation	0.0732307995313

Results for SSAHA:

Measurement	Value
Average time	0.053
Minimal time	0.04
Maximum time	0.09
Median time	0.05
Variation	0.0167630546142

Measurement of Blast for comparison

Measurement	Value
Average time	0.1565
Minimal time	0.0
Maximum time	0.73
Median time	0.02
Variation	0.280629203755

Additional measurement for server/hash table/db preparation:

Results for starting Blat server and preparing index:

Measurement	Value
Average time	0.1685
Minimal time	0.16
Maximum time	0.18
Median time	0.17
Variation	0.00653834841531

Results for precomputing hash table for SSAHA:

Measurement	Value
Average time	0.4455
Minimal time	0.43
Maximum time	0.46
Median time	0.44
Variation	0.00864580823289

Results for Blast db preparation:

Measurement	Value
Average time	0.5725
Minimal time	0.55
Maximum time	0.6
Median time	0.575
Variation	0.0113468057179

Best result in average case gives client/server: Blat.

And a simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, can be seen that Blat is approximatelly four times faster than Blast. If we take into account also (average) times needed for data prepararion, we have:

Blast: 0.1565 + 0.5725 = 0.729

Blat: 0.0365 + 0.1685 = 0.205

7.3.5.2. LSU¶

Results for standalone Blat without -ooc set:

Measurement	Value
Average time	5.447
Minimal time	1.49
Maximum time	11.6
Median time	3.905
Variation	3.62403242259

Results for standalone Blat with -ooc set:

Measurement	Value
Average time	5.4345
Minimal time	1.49
Maximum time	11.57
Median time	3.925
Variation	3.61336750829

Results for client/server Blat:

Measurement	Value
Average time	0.5825
Minimal time	0.27
Maximum time	0.74
Median time	0.605
Variation	0.128018553343

Results for SSAHA:

Measurement	Value
Average time	0.8935
Minimal time	0.0599999999999
Maximum time	2.57
Median time	0.625
Variation	0.731144821496

Measurement of Blast for comparison:

Measurement	Value
Average time	2.3635
Minimal time	0.87
Maximum time	3.51
Median time	2.565
Variation	0.852439294026

Additional measurement for server/hash table/db preparation:

Results for starting Blat server and preparing index

Measurement	Value
Average time	1.9285
Minimal time	1.88
Maximum time	2.31
Median time	1.91
Variation	0.0883897618506

Results for precomputing hash table for SSAHA:

Measurement	Value
Average time	5.533
Minimal time	5.4
Maximum time	6.54
Median time	5.475
Variation	0.238644086455

Results for Blast db preparation:

Measurement	Value
Average time	6.068
Minimal time	5.94
Maximum time	6.7
Median time	6.02
Variation	0.164304595188

As it can be seen, best result in average case gives client/server Blat.

And a simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, shows that Blat is approximately four times faster than Blast.

If we take into account also (average) times needed for data prepararion, we have:

Blast: 2.3635 + 6.068 = 8.4315

Blat: 0.5825 + 1.9285 = 2.511

7.3.5.3. Mouse¶

Results for standalone Blat without -ooc set:

Measurement	Value
Average time	0.2315
Minimal time	0.19
Maximum time	0.29
Median time	0.24
Variation	0.0276179289593

Results for standalone Blat with -ooc set

Measurement	Value
Average time	0.2355
Minimal time	0.19
Maximum time	0.31
Median time	0.235
Variation	0.0321675302129

Results for client/server Blat:

Measurement	Value
Average time	0.0005
Minimal time	0.0
Maximum time	0.00999999999999
Median time	0.0
Variation	0.00217944947177

Results for SSAHA:

Measurement	Value
Average time	0.055
Minimal time	0.04
Maximum time	0.07
Median time	0.055
Variation	0.00670820393251

Measurement of Blast for comparison:

Measurement	Value
Average time	0.0395
Minimal time	0.02
Maximum time	0.0599999999999
Median time	0.04
Variation	0.00920597631975

Additional measurement for server/hash table/db preparation:

Results for starting Blat server and preparing index:

Measurement	Value
Average time	0.2285
Minimal time	0.18
Maximum time	0.31
Median time	0.24
Variation	0.0330567693521

Results for precomputing hash table for SSAHA:

Measurement	Value
Average time	0.604
Minimal time	0.48
Maximum time	0.73
Median time	0.63
Variation	0.0689492567038

Results for Blast db preparation:

Measurement	Value
Average time	1.3615
Minimal time	1.17
Maximum time	1.48
Median time	1.42
Variation	0.107994212808

The best result in average case comes from client/server Blat.

A simple comparison of results for Blast and Blat client/server, which are both used by rPredictorDB site, shows that Blat is even more than four times faster than Blast in the case of data currently not in use.

If we take into account also (average) times needed for data prepararion, we have:

Blast: 0.0395 + 1.3615 = 1.401

Blat: 0.0005 + 0.2285 = 0.229

7. Similarity search¶

7.1. Blast¶

7.1.1. Blast deployment¶

7.2. Blat¶

7.3. Similarity search tools speed measurement¶

7.3.1. What tools were measured?¶

7.3.2. Preparation process¶

7.3.3. Measurement parameters¶

7.3.3.1. 5 tests¶

7.3.3.2. 20 tests¶

7.3.4. How measurement is implemented¶

7.3.4.1. Script usage¶

7.3.4.2. Inputs¶

7.3.4.3. Tools used¶

7.3.5. Measurements results¶

7.3.5.1. SSU¶

7.3.5.2. LSU¶

7.3.5.3. Mouse¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

7. Similarity search¶

7.1. Blast¶

7.1.1. Blast deployment¶

7.2. Blat¶

7.3. Similarity search tools speed measurement¶

7.3.1. What tools were measured?¶

7.3.2. Preparation process¶

7.3.3. Measurement parameters¶

7.3.3.1. 5 tests¶

7.3.3.2. 20 tests¶

7.3.4. How measurement is implemented¶

7.3.4.1. Script usage¶

7.3.4.2. Inputs¶

7.3.4.3. Tools used¶

7.3.5. Measurements results¶

7.3.5.1. SSU¶

7.3.5.2. LSU¶

7.3.5.3. Mouse¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation