First, a disclaimer: rPredictorDB is a somewhat complex [1] infrastructure. Setting it up from scratch will require considerable time and energy; it uses numerous programming languages, third-party libraries and applications.
Moreover, you will not get anything else than what is already available at the rPredictorDB website. The rPredictorDB infrastructure there is being actively developed. Instead of setting up your own rPredictorDB clone, you may want to consider helping develop the rPredictorDB that is already there. See: Can I help?
If you still really wish to set it up yourself, then this guide should get you going (but consider yourself warned).
Note
If you find this guide unclear or incomplete, do not hesitate to contact us.
Before you embark on the setup process, you will need to get access to the rPredictorDB repository and check it out on your system. To get repository access, contact us.
The rPredictorDB infrastructure is known to run on Debian 3.2.54-2 and 8.6 x86_64 GNU/Linux. The infrastructure is untested on other platforms.
If you wish to set up rPredictorDB, you will need to be able to compile/run source files from the following languages:
PHP is used as the main programming language for the website.
Matlab 9.1 is the core language for the cp-predict.
C# and Java are used for The ETL layer of rPredictorDB.
C and C++ are used for The ETL layer of rPredictorDB and by various third-party bioinformatical tools that need to be compiled.
PL/pgSQL is used for building the database schema and for importing data into it.
The whole setup process has four distinct parts, each of which consists of multiple steps:
Because the rPredictorDB application is developed in PHP, the first important step is installing an HTTP server which will run it. For using the application on one machine only, it is possible to use a developer server or a public web server.
Although several HTTP servers exist, we strongly recommend the Apache server, which can be installed through default package distribution channel in most GNU/Linux systems.
There are also several packages containing also PhP, for example:
Another option is installing the bare Apache HTTP server. The binaries can be downloaded on the official Apache webpages.
After installing Apache HTTP server, installation of PhP is required. It can also be done through a Linux package distribution channel or from binaries available on official PhP webpages.
The application uses the open-source framework Nette in version 2.0. It is necessary to verify the php.ini and .htaccess settings against Nette requirements. This should not be a problem in most cases, barring minor corrections. The verication can be done directly from the current rPredictorDB pages.
rPredictorDB uses the PostgreSQL database, which can be downloaded from its website. It can also be installed through standard Linux package distribution channel. For database administration, we recommend the pgAdmin application or phpPgAdmin (clone of famous MySQLAdmin).
In order to access the database from rPredictorDB, access credentials need to be set in the app/BaseModule/config.neon configuration file in the www branch.
Note
It is necessary to set the database datestyle to “European” (ISO DMY). Otherwise, search by publication date will not work. See PostgreSQL documentation for details.
See The ETL layer of rPredictorDB for an overview of what needs to be done to run rETL and populate the rPredictorDB database.
Note
Running rETL to populate the database may - and will - take long (days), since secondary structure predictions need to be computed, structural features extracted and visualization thumbnails created.
To set up the website itself, simply copy the www branch of the repository so that the index.php file in the www folder of the branch is accessible at the URL where you want the web published.
Make sure owners, groups and permissions are set correctly so that all executable files can actually be executed by the server, temporary directories (www/www/files/, temp/ and subdirectories) are writeable, etc.
The configuration file www/config.php contains variables that will be accessible to all classes in the app/ infrastructure. This is the preferred way of setting environmental variables for rWeb tools.
The most important variables in the configuration file that need to be set are TOOL_DIR and TMP_DIR. See tool installation instruction for details.
To properly set up the Nette config file (app\BaseModule\config.neon), see Nette configuration manual. The most important directives are database dsn string, correct timezone and variables in common/parameters.
The following paths to executable binaries for Blast and for the source database are currently set in the config.php file:
BLAST_DATABASE = /var/data/blast/data/database
BLAST_PATH = /var/rtools/blast/bin
The Blast package, version 2.2.28+. The package can be downloaded from the NCBI FTP server. Follow the Blast installation instruction at this NCBI webpage, sections Installation and Configuration (we’ll set up our own database later).
Note
This is not the latest version: during rPredictorDB development, version 2.2.29+ was released, with some minor changes. We have NOT tested whether the rPredictorDB Blast tool will run with the new version.
After the Blast package has been successfully installed and configured (you can verify by running which blastn and getting non-blank output), we’ll need to finish the procedure of installing Blast for rPredictorDB, so that it can be used for searching the rPredictorDB database. To this end, we will need to set up a Blast database over the correct dataset.
Source data for Blast (and generally for all similarity search tools) can be found here - there is the dump of all sequences in rData (and therefore all sequences available for searching in rPredictorDB).
The database is created and filled by data using a utility from the Blast package called makeblastdb:
makeblastdb -dbtype nucl -title newDB -in input.fasta
where:
This command produces several files: one with the extension nhr, one with nin and one with nsq. If the input file was named input.fasta, files input.fasta.nhr, input.fasta.nin and input.fasta.nsq will be created.
Note
This is not a part of the installation process itself, but it is useful to test whether the Blast setup went correctly.
From the several scripts provided in the Blast package, our Blast search will use blastn (the “n” stands for nucleotide), which is intended to work on databases of nucleic acid sequences with nucleic acid queries.
After the database is prepared, blastn is ready to process queries. A query is given by command:
blastn -db database_to_be_searched -query query_file -outfmt 5 -out output_file.xml
where
CP-predict is the custom rRNA secondary structure prediction algorithm. For a description of what it does, see CP-predict: a two-phase algorithm for rRNA structure prediction, respectively.
Uses the following config variables:
In order to run Cppredict on your own rPredictorDB infrastructure, you will need:
After installing all prerequisities, install the rPredictorDB program itself. In the package directory (branches/predict/packages/rPredictorDB in the repository), run:
./rPredictorDB_web.install
and follow the instructions. Installer also installs the required Matlab runtime if it is not installed yet. Predictor is then called as:
path/to/run_run_pactool_pairwise_f.sh /usr/local/MATLAB/MATLAB_Runtime/v901/ -sqs=query.fasta -str=template.br -ALM=clustalw2 -EXTEND_MECHANICALLY_LONELY_PAIRS=0 -BOOTSTRAP=0
where path/to/run_run_pactool_pairwise_f.sh is the prediction tool, /usr/local/MATLAB/MATLAB_Runtime/v901/ is path to the Matlab runtime, -sqs set a path to the predicted structure, -str set a path to the template, -ALM defines the tool used for mapping between the query and template, -EXTEND_MECHANICALLY_LONELY_PAIRS set TODO and -BOOTSTRAP denotes whether z-score should be calculated.
Footnotes
[1] | Brutally complicated. |