2. rPredictorDB setup

First, a disclaimer: rPredictorDB is a somewhat complex [1] infrastructure. Setting it up from scratch will require considerable time and energy; it uses numerous programming languages, third-party libraries and applications.

Moreover, you will not get anything else than what is already available at the rPredictorDB website. The rPredictorDB infrastructure there is being actively developed. Instead of setting up your own rPredictorDB clone, you may want to consider helping develop the rPredictorDB that is already there. See: Can I help?

If you still really wish to set it up yourself, then this guide should get you going (but consider yourself warned).

Note

If you find this guide unclear or incomplete, do not hesitate to contact us.

2.1. Before starting

Before you embark on the setup process, you will need to get access to the rPredictorDB repository and check it out on your system. To get repository access, contact us.

2.2. Platform

The rPredictorDB infrastructure is known to run on Debian 3.2.54-2 and 8.6 x86_64 GNU/Linux. The infrastructure is untested on other platforms.

2.3. Programming languages

If you wish to set up rPredictorDB, you will need to be able to compile/run source files from the following languages:

  • PHP
  • Matlab 9.1
  • C#
  • Java
  • C, C++
  • PL/pgSQL

PHP is used as the main programming language for the website.

Matlab 9.1 is the core language for the cp-predict.

C# and Java are used for The ETL layer of rPredictorDB.

C and C++ are used for The ETL layer of rPredictorDB and by various third-party bioinformatical tools that need to be compiled.

PL/pgSQL is used for building the database schema and for importing data into it.

2.4. Setup process overview

The whole setup process has four distinct parts, each of which consists of multiple steps:

  1. Installing rWeb and rData requirements
    1. HTTP server
    2. Nette
    3. PostgreSQL
  2. Setting up rWeb
    1. rWeb
    2. Configuring rWeb
  3. Setting up rData
    1. rETL
  4. Installing individual tools: for each tool -
    1. Installing tool & requirements,
    2. Configuring the tool for rPredictorDB

2.5. HTTP server

Because the rPredictorDB application is developed in PHP, the first important step is installing an HTTP server which will run it. For using the application on one machine only, it is possible to use a developer server or a public web server.

Although several HTTP servers exist, we strongly recommend the Apache server, which can be installed through default package distribution channel in most GNU/Linux systems.

There are also several packages containing also PhP, for example:

Another option is installing the bare Apache HTTP server. The binaries can be downloaded on the official Apache webpages.

After installing Apache HTTP server, installation of PhP is required. It can also be done through a Linux package distribution channel or from binaries available on official PhP webpages.

2.6. Nette

The application uses the open-source framework Nette in version 2.0. It is necessary to verify the php.ini and .htaccess settings against Nette requirements. This should not be a problem in most cases, barring minor corrections. The verication can be done directly from the current rPredictorDB pages.

2.7. PostgreSQL

rPredictorDB uses the PostgreSQL database, which can be downloaded from its website. It can also be installed through standard Linux package distribution channel. For database administration, we recommend the pgAdmin application or phpPgAdmin (clone of famous MySQLAdmin).

In order to access the database from rPredictorDB, access credentials need to be set in the app/BaseModule/config.neon configuration file in the www branch.

Note

It is necessary to set the database datestyle to “European” (ISO DMY). Otherwise, search by publication date will not work. See PostgreSQL documentation for details.

2.8. rETL

See The ETL layer of rPredictorDB for an overview of what needs to be done to run rETL and populate the rPredictorDB database.

Note

Running rETL to populate the database may - and will - take long (days), since secondary structure predictions need to be computed, structural features extracted and visualization thumbnails created.

2.9. rWeb

To set up the website itself, simply copy the www branch of the repository so that the index.php file in the www folder of the branch is accessible at the URL where you want the web published.

Make sure owners, groups and permissions are set correctly so that all executable files can actually be executed by the server, temporary directories (www/www/files/, temp/ and subdirectories) are writeable, etc.

2.9.1. Configuring rWeb

The configuration file www/config.php contains variables that will be accessible to all classes in the app/ infrastructure. This is the preferred way of setting environmental variables for rWeb tools.

The most important variables in the configuration file that need to be set are TOOL_DIR and TMP_DIR. See tool installation instruction for details.

To properly set up the Nette config file (app\BaseModule\config.neon), see Nette configuration manual. The most important directives are database dsn string, correct timezone and variables in common/parameters.

2.10. Blast

The following paths to executable binaries for Blast and for the source database are currently set in the config.php file:

BLAST_DATABASE = /var/data/blast/data/database
BLAST_PATH = /var/rtools/blast/bin

2.10.1. Requirements

  • The Blast package, version 2.2.28+. The package can be downloaded from the NCBI FTP server. Follow the Blast installation instruction at this NCBI webpage, sections Installation and Configuration (we’ll set up our own database later).

    Note

    This is not the latest version: during rPredictorDB development, version 2.2.29+ was released, with some minor changes. We have NOT tested whether the rPredictorDB Blast tool will run with the new version.

2.10.2. Installation

After the Blast package has been successfully installed and configured (you can verify by running which blastn and getting non-blank output), we’ll need to finish the procedure of installing Blast for rPredictorDB, so that it can be used for searching the rPredictorDB database. To this end, we will need to set up a Blast database over the correct dataset.

2.10.2.1. Blast Data

Source data for Blast (and generally for all similarity search tools) can be found here - there is the dump of all sequences in rData (and therefore all sequences available for searching in rPredictorDB).

2.10.2.2. Blast database setup

The database is created and filled by data using a utility from the Blast package called makeblastdb:

makeblastdb -dbtype nucl -title newDB -in input.fasta

where:

  • -dbtype nucl says that the database will contain sequences of nucleotides (Blast can also work with amino-acids, for databases of proteins),
  • -title newDB sets the name of the newly created database,
  • -in input.fasta say that the database will be filled with data from the file input_file.fasta. This is a file containing all the sequences among which we will be searching when the Blast tool is deployed; we created this file in the previous step from the SILVA database exports.

This command produces several files: one with the extension nhr, one with nin and one with nsq. If the input file was named input.fasta, files input.fasta.nhr, input.fasta.nin and input.fasta.nsq will be created.

2.10.2.3. Running Blast

Note

This is not a part of the installation process itself, but it is useful to test whether the Blast setup went correctly.

From the several scripts provided in the Blast package, our Blast search will use blastn (the “n” stands for nucleotide), which is intended to work on databases of nucleic acid sequences with nucleic acid queries.

After the database is prepared, blastn is ready to process queries. A query is given by command:

blastn -db database_to_be_searched -query query_file -outfmt 5 -out output_file.xml

where

  • database_to_be_searched is a database previously created by the makeblastdb utility (the *.nin file - in our previous example, in.fasta.nin),
  • -query query_file specifies the file with the query sequence (not a FASTA file, only the sequence!). Optionally, multiple query sequences can be given; each on its own line.
  • --outfmt 5 specifies that Blast should return its results as an XML file,
  • -out output_file.xml specifies the output file name. The search results will be stored to this file. The appropriate suffix depends on the --outfmt argument.

2.11. Cppredict

CP-predict is the custom rRNA secondary structure prediction algorithm. For a description of what it does, see CP-predict: a two-phase algorithm for rRNA structure prediction, respectively.

Uses the following config variables:

  • CPPREDICT2_PATH a path to the rPredictorDB program
  • WILDCARDS_PATH a path to the program that replacement wildcards in the query sequence
  • CPPREDICT2_TEMPLATES a path to templates used for the prediction

2.11.1. Requirements

In order to run Cppredict on your own rPredictorDB infrastructure, you will need:

  • Matlab 9.1
  • the Vienna RNA package
  • clustalw2, which can be found somewhat unorthodoxly at the Help page of its website
  • ggsearch, a part of FASTA package
  • Optionally, if you want to set up your own templates from structures in the PDB, you will need x3dna-dssr
  • replacement, the C++ program, in branches/predict/packages/replacement.
  • Finally, the rPredictorDB Matlab program, in branches/predict/packages/rPredictorDB.

2.11.2. Installation

After installing all prerequisities, install the rPredictorDB program itself. In the package directory (branches/predict/packages/rPredictorDB in the repository), run:

./rPredictorDB_web.install

and follow the instructions. Installer also installs the required Matlab runtime if it is not installed yet. Predictor is then called as:

path/to/run_run_pactool_pairwise_f.sh /usr/local/MATLAB/MATLAB_Runtime/v901/ -sqs=query.fasta -str=template.br -ALM=clustalw2 -EXTEND_MECHANICALLY_LONELY_PAIRS=0 -BOOTSTRAP=0

where path/to/run_run_pactool_pairwise_f.sh is the prediction tool, /usr/local/MATLAB/MATLAB_Runtime/v901/ is path to the Matlab runtime, -sqs set a path to the predicted structure, -str set a path to the template, -ALM defines the tool used for mapping between the query and template, -EXTEND_MECHANICALLY_LONELY_PAIRS set TODO and -BOOTSTRAP denotes whether z-score should be calculated.

Footnotes

[1]Brutally complicated.