The ETL (Extraction-Transformation-Load) layer of rPredictorDB is responsible for preparing the rPredictorDB database (rDB), downloading RNA data and filling rDB with the RNA data.
The scripts and other data used in rETL are located in the ETL branch of the rPredictorDB repository.
Running the ETL process requires:
There are four main sources of the data:
SILVA is a high quality ribosomal RNA database. Beside the web frontend of SILVA there is possibility to get it’s data exported in a FASTA file. The FASTA consists of a header containing identifiers, the taxonomy and nucleotide sequence.
SILVA divides its data into two datasets - LSU and SSU, for the large and small ribosomal subunit respectively. The FASTA files are exported per dataset so that there are separate FASTA files for LSU and SSU. (rPredictorDB does not keep this SSU/LSU split.)
Both SILVA datasets have at least two versions - PARC and REF. While the PARC version simply contains everything, the REF version contains high quality data only. For the SSU dataset, there is a further version called NR (non-redundant). This SSU REF NR dataset avoids duplicates.
Currently, rPredictorDB uses data from SILVA’s SSU REF NR datasets exported in FASTA format, which includes identifiers and taxonomy.
Beyond the FASTA files, SILVA provides some extra information. For both of datasets there is a tab-separated CSV file which contains information about each sequence quality and some other bioinformatical data.
Important files example | |
---|---|
SSURef_NR99_115_tax_silva_trunc.fasta | |
SSUParc_115_quality.csv |
Rfam is a RNA database with RNAs classified into families of RNA according to their type and function. Beside the web frontend of Rfam there is possibility to get whole families exported in a FASTA file.
In Rfam, similar to SILVA database, there is also currated a high quality subset called seed for each family. This seeds are exported in a STOCKHOLM file. In addition to seeds, this file also contains additional information about individual families.
Important files example | |
---|---|
Rfam.seed.gz | |
RF00001.gz | |
RF00004.gz |
ENA (European Nucleotide Archive) provides information about nucleotide sequences. As mentioned, SILVA and Rfam output contain basic sequence data only. ENA adds more detailed information: annotations. The annotations consist of various attributes like sequence authors, references to publications, sequence description, etc. This data is not neccessary for predicting secondary structures, but it can be very useful for understanding the sequence and in any case for searching.
ENA provides various types of exports. One of them is an XML export via a REST based API which, is used by one of utilities from rPredictorDB’s ETL layer.
Taxonomy (The National Center for Biotechnology Information) provides a currated taxonomic tree. Although taxonomic information is contained in ENA and SILVA too, taxonomic information in that sources is often inconsistent, incomplete or even incorrect. The Taxonomy database ensures uniqueness of passing taxonomic tree.
Beside the web frontend of Taxonomy there is possibility to get it’s data exported in a Vertical bar separated values files. rPredictorDB uses two files: nodes.dmp containing encoded tree and taxonomic ranks of individual nodes; and names.dmp containing names, synonyms, misspellings etc. for nodes defined in nodes.dmp.
Important files example | |
---|---|
taxdmp.zip |
The ETL process of rPredictorDB consists of several phases:
A phase can be typically divided into two or three parts:
(This is where the abbreviation “ETL” comes from.)
Warning
Note that at the end of the import process, it is necessary to delete Nette cache files in /www/temp/cache, otherwise the new data would might not be accessible right away.
All of the ETL sub-phases uses some common principles/technologies. Let us describe these techniques.
There are two approaches used in downloading phase:
The first approach is used if there is a couple of large files only or no API is available for automated download.
The second approach is used whenever a large number of files needs to be retrieved (hundreds of thousands). For this case a special utility is prepared.
The second subphase is responsible for transforming the downloaded data into some simple format to be easily importable to the rPredictorDB database. rPredictorDB uses simple CSV files as an intermediate structure (formally called the staging area) to store pre-processed data before loading it into the database.
Prepared CSVs from the processing step are loaded to the database using Postgresql’s COPY TO function. We performed some benchmarks and decided to use this approach as the fastest while being very simple to use. There is typically an intermediate temporary table used before inserting data to the destination table. Unfortunately, Postgresql does not support something like SQL Server’s OPENROWSET for querying CSV files “directly”.
The very first rETL phase is the Postgresql database preparation. The following assumes Postgresql (at least version 9.1) is installed and in order with a blank (it is not neccessary) database created. There has to be a pl/pgSQL procedural language extension installed and activated (it is a standard part of Postgresql’s installation).
All of this phase’s work is done by a function called create_rDB_structures.
The execution is pretty simple:
SELECT create_rdb_structures();
function feature | value |
---|---|
name | create_rDB_structures |
arguments | none |
returns | void |
description | creates the rpredict scheme and all of rDB’s tables |
The second task is to filter sequences from SILVA and Rfam and repair their taxonomic paths as we import reliable sequences only. This step is different for SILVA and RFAM, however both version require to download Taxonomy (NCBI) database and ENA files.
Unlike other sources, there is no suitable prepared export in CSV or any other format.
Note
There are some bulk export files from the ENA. All of them can be accessed via FTP. However, these files are very (very) large. It depends on dataset, but it could take tens of gigabytes. Since only high-quality sequences are imported from SILVA, the majority of these exports are unattractive for rPredictorDB.
Fortunately, the ENA provides a simple REST based API for downloading annotations for individual sequences. In technical terms, the annotation contains information from the EMBL header belonging to the given sequence. From ENA’s point of view sequences are identified by their accession number. Human users can look for an entry using this approach:
http://www.ebi.ac.uk/ena/data/view/AB542923
where AB542923` is the accession number.
This is nice for human users, but a bit useless for rETL. The rescue comes with ability to export the same data in XML:
http://www.ebi.ac.uk/ena/data/view/AB542923&display=xml
This file contains the annotation as well as the nucleotice sequence. At this time, however, the sequences are already imported into rDB. So, finally, let’s export the header only:
http://www.ebi.ac.uk/ena/data/view/AB542923&display=xml&header=true
With this we got all the data we wanted to get. To reach the goal programatically there is a script called download-ENA.sh for all sequences specified in the input FASTA file. The script can be called as:
./download-ENA.sh Silva/SILVA_128_SSURef_Nr99_tax_silva.fasta ENA/
Taxonomy is a database currated by NCBI and its dump can be publicly downloaded through FTP. To make preparation easier, we created a script called download-NCBI.sh that downloads relevant files and extracts necessary information into specified path. This script can be called as:
./download-NCBI.sh NCBI/
The script uses an application called ncbi that is written in C++ and can be compiled as:
g++ -std=c++11 -o ncbi ncbi.cpp
The database is used to make taxonomic information consistent and validated. To do this there is an application called taxonomy. The application is written in C++ and can be compilled using:
g++ -std=c++11 -o taxonomy Taxonomy.cpp
The application takes a FASTA file and for each sequence and its accession number it finds the corresponding organism name in the corresponding ENA file and than use the organism name for reconstruction of its taxonomic path. The application can be called as:
./taxonomy NCBI/taxonomy.db ENA/ ENA-17_03_24 Silva/SILVA_128_SSURef_Nr99_tax_silva.fasta
Sequences are downloaded using script called download-SILVA.sh that download required version of files in the specified directory. This script can be executed this way:
./download-SILVA.sh 128 Silva/
The next step is to repair taxonomic information of each sequence. To do this, we use the previously mentioned application taxonomy.
Then we choose all sequences belonging to a specified taxon (currently they are Chordata and Diptera). To do this, we use a script select.sh that can be called as:
./select.sh Silva/SILVA_128_SSURef_Nr99_tax_silva-ENA_2016_10_07.fasta Chordata
In the final step, we group remaining sequences according to their taxonomic path, filter out all sequences longer than a threshold (the limit is 2000 bases for Chordata and 2100 bases for Diptera) and select a single representative sequence for each group. To do this, we find the length L of the longest sequence in each group and filter out all sequences shorter than 99% L. Then we sort sequences according to a number of unordered pieces in description taken from the corresponding ENA file (ascending); whether the record is working draft or sequencing is in progress, or not (preferred); sequence and pintail qualities taken from the SILVA quality file (descending); first public and last update dates taken from the corresponding ENA file (descending); and the length of the sequence (descending). Finally we select the best sequence (the first one) only. If more sequences have the same annotation the representative sequence is taken arbitrary. For this purpose there is a script called filter.sh that can be called as:
./filter.sh Data/SILVA_128_SSURef_Nr99_tax_silva-ENA_2016_10_07-Chordata.fasta ENA/ Silva/SILVA_128_SSURef_Nr99.quality 2000
In the case of Rfam, the proces is significantly simpler. We pick up all seed sequences for a given Rfam family. That can be done using an application called Seeds that can be buid as:
javac Seeds.java
and called as:
java Seeds Rfam/Rfam.seed Rfam/RF00622.fasta
Than we add a taxonomic path using the previously mentioned application taxonomy.
Note
In some cases, boundaries of sequence in the seed file differ from the sequence in the family file (usually ±1). In that cases, the longer sequence is taken.
The third complex task is to import sequence and belonging data to the database. Simply said, this step consists of three sub-steps:
Note
Apart from uploading sequences to rDB, sequences are also used to build up a BLAST database that is used to Sequence search. To do this, there is a script “blast.sh” that build up a new database from specified files or add them to an existing database. It can be called as:
./blast.sh rBlast_16-128 "Chordata.fasta Diptera.fasta"
The FASTA files are parsed and transformed into intermediate CSV files. To achieve this there is an utility called rETL available through it’s frontend called rETL_frontend.exe. It is a .NET application running in Mono. This utility uses certain components:
It can be easily built using the included makefile:
make all
that builds all C#-based utilities. If you want to build the rETL utility only, you can simply call the target:
make ETL
The rETL utility requires a single input parameter - path to a special configuration XML file which defines what has to be done.
A sample XML:
<import>
<FASTA file="LSURef_115_tax_silva_trunc.fasta" dataset="LSU">
<CSV destinationFile="LSURef_115.csv" separator="|" />
</FASTA>
<FASTA file="SSURef_NR99_115_tax_silva_trunc.fasta" dataset="SSU">
<CSV destinationFile="SSURef_115.csv" separator="|"/>
</FASTA>
<taxonomy>
<new file="LSURef_115_tax_silva_trunc.fasta" />
<new file="SSURef_NR99_115_tax_silva_trunc.fasta" />
<old destinationFile="taxonomy.csv" separator="|"/>
<CSV destinationFile="taxonomy2.csv" separator="|"/>
</taxonomy>
</import>
It is neccessary to follow this structure!
A high-level look at the semantics:
The rETL utility is designed to enable implementing various types data destinations, i.e. files or databases. This part of the transformation is maintained by the adapter component. The second part included in the taxonomy element says that the taxonomy has to be processed as well.
A detailed look at the XML elements:
name | note |
---|---|
import | top level element, required |
FASTA | represents one processed FASTA file |
taxonomy | sign the taxonomy is processed as well |
new | source file for taxonomy processing |
old | previous taxonomy file that should be extended |
CSV | CSV file, input or output |
A detailed look at their attributes:
element | name | note |
---|---|---|
FASTA | file | path to the file |
FASTA | dataset | SILVA dataset name |
new | file | path to the file |
old | destinationFile | path to the file |
old | separator | single character used to delimit CSV items |
CSV | destinationFile | path to the file |
CSV | separator | single character used to delimit CSV items |
Finally, the utility can be executed this way:
mono rETL_frontend.exe import.xml
According to the sample XML, the rETL utility creates three CSV files. The CSV file containing the transformed FASTA files is formatted like:
accession | number | start | stop | dataset | sequence | taxonomic path |
---|---|---|---|---|---|---|
And the transformed taxonomic tree CSV:
id | depth | name | path name | parent id |
---|---|---|---|---|
Last, it is necessary to load the prepared data into the rDB database. Since it is a very common task to import a CSV file into a database, there is no “magic” behind this step. The Load phase implementation is packed in functions (or stored procedures, all of them return void). The data load itself precedes some preparation (it is necessary to create appropriate temporary table). Just for this purpose there is a function called prepareIntermediateStructure().
function feature | value |
---|---|
name | prepareIntermediateStructure |
arguments | none |
returns | void |
description | creates temporary table |
function feature | value |
---|---|
name | importFASTA |
argument 1 | parsedFastaFile |
argument 2 | separator |
returns | void |
description | imports given CSV (arg 1) into temporary table |
function feature | value |
---|---|
name | transformIntermediateData |
arguments | none |
returns | void |
description | transforms imported parsed FASTA files from temporary tables into their destination tables |
function feature | value |
---|---|
name | importTaxonomy |
argument 1 | CSVfile |
argument 2 | separator |
returns | void |
description | imports taxonomic tree from CSV file (arg 1) into destination table |
And a typical way how to solve this phase:
SELECT prepareIntermediateStructure();
SELECT importFASTA('/ETL/frontend/LSURef_115.csv'::varchar,'|');
SELECT importFASTA('/ETL/frontend/SSURef_115.csv'::varchar,'|');
SELECT transformIntermediateData();
SELECT importTaxonomy('/ETL/frontend/taxonomy.csv'::varchar,'|');
To complete the SILVA/RFAM-part of importing it remains to process data containing information about sequence qualities. In the case of SILVA, this part is the simpliest one. As mentioned earlier, SILVA provides this information in CSV files. In compare to process SILVA, processing Rfam requires an additional step before uploading the file as Rfam does not provide any quality file.
The only processing of the quality files is performed in the database. All performed transformations are pure technical - parsing and converting into corresponding datatypes.
In the case of the Rfam, the qualities file is generated based on informations in the seed file. To achieve this there is an utility called Qualities that computes required information. It is a JAVA application and it can be built using:
javac Qalities.java
Finally, the utility can be executed this way:
java Qualities Rfam/Rfam.seed Rfam/RF00622-seeds.fasta
function feature | value |
---|---|
name | importQuality |
argument 1 | qualityCSV |
argument 2 | subunit |
returns | void |
description | imports qualities into destination table |
And execution:
SELECT importQuality('/LSUParc_115_quality.csv','18S ribosomal RNA');
Now the ETL process leaves the territory of sequences and continues importing annotations from ENA. Like the sequence load this process can be decomposed as well:
What is an annotation?
An annotation is a set of extra information about the given sequence. In rPredictorDB, annotations contains following data:
name | description |
---|---|
annotation | general set of information like description or comments |
references | references to given sequence ie. in journals, papers, etc. |
authors | authors of reference |
applicants | patent applicants |
features | features of the sequence |
qualifiers | key-value pairs with detailed information about given feature |
xrefs | references to other databases where given sequence can be found |
As ENA files are used to repair taxonomic information in the Extracting data, they should be already downloaded and it is not neccessary to download them again. Now we need to extract all the fields and save them to the intermediate CSV files. This is not logically complex.
To reach the goal programatically there is an utility called AnnotationsParser (AP). It is written in C# and runs under Mono or the .NET framework. This executable takes a connection string’s parts as parameters. It is needed for getting all accession numbers from rDB. The only exception is the first parameter - it says how many parallel threads have to be used for downloading of missing ENA files.
If you want to build this utility only, you can call the make this way:
make annotationsParser
Running the utility can look like this:
mono annotationsParser.exe 4 localhost 5432 rpredictor password rDB
Simply put, the AP gets all accession numbers from the rDB and for each gets the EMBL header XML. This XML is parsed and results are exported in certain files, depending on annotation feature.
annotation feature | file |
---|---|
annotations | imp_annotations.csv |
applicants | imp_applicants.csv |
authors | imp_authors.csv |
features | imp_features.csv |
keywords | imp_keywords.csv |
qualifiers | imp_qualifiers.csv |
references | imp_references.csv |
xrefs | imp_xrefs.csv |
Loading created files into database is implemented the same way as previous loadings. It means there are database functions doing all the work. All of them have the same signature, so it is described only once followed by list of names of the fuctions.
feature | description |
---|---|
return type | void |
argument 1 | path to the CSV file |
argument 2 | one character used as fields delimiter |
output | predictHeap.csv |
And a promised list of functions:
And a sample how to execute the functions:
SELECT importAnnotations('/imp_annotations.csv','|');
SELECT importFeatures('/imp_features.csv','|');
SELECT importQualifiers('/imp_qualifiers.csv','|');
SELECT importReferences('/imp_references.csv','|');
SELECT importAuthors('/imp_authors.csv','|');
SELECT importApplicants('/imp_applicants.csv','|');
SELECT importXrefs('/imp_xrefs.csv','|');
SELECT importKeywords('/imp_keywords.csv','|');
Before this step all data about primary structures (sequences) are present in rDB and now the process aims to fill the last part of the rDB with predicted secondary structures. Like steps mentioned above, this one is also divided into two substeps:
Bulk prediction ( = batch prediction) is the most tricky part of the whole ETL process. It takes a list of primary structures (sequences) and predicts secondary structures for them. As a feature geared towards structural analysis capabilities, rPredictorDB stores structural features as well.
Warning
This part is highly dependent on the environment and prediction algorithm.
The first thing that has to be done is to replace wildcards in sequences as the current prediction algorithm is not able to handle them.
Note
Wildcards are e.g. N that represents uNknown nucleic acid, *R that represents adenine OR guanine or B that represents anything except adenine.
Of course, the question is, which nucleic acid should be uses to replace a wildcard? This question has no clear answer. Thus we determine a distribution of nucleic acids in the sequence and then we use the most frequent nucleic acid that is allowed by the wildcard. If more allowed nucleic acids have the same probability, arbitrary one is chosen. To do this, there is an application called replacement written in C++ that can be compiled using:
g++ -std=c++11 -o replacement prediction/replacement.cpp
The application process an input FASTA file and print resolved sequences to standard output.
For prediction in itself, another tool (or a set of tools) is needed. In our case it is CP-predict: a two-phase algorithm for rRNA structure prediction. The current version of a prediction algorithm is written in MATLAB. That application is distributed as an installer for UNIX-like systems. And require MATLAB runtime (it can be downloaded automatically during installation). The application takes template and sequences in FASTA format as input, saves each predicted structure into a separate dot-bracket file and prints some statistics to standard output.
Finally, the predicted structures are merged and transformed into one CSV file in the format:
accession number | start position | stop position | structure in dot-paren format |
---|---|---|---|
To make this process easier, there is a script called predict.sh that groups all steps together. This script take a template structure and predicted sequences as input and product one structure file in dot-bracket format for each input sequence amd one CSV file with all predicted structures. The script can be called as:
./predict.sh templates/4v6x.br Chordata.fasta
The other goal is to extract Structural features from created predictions. This is not as time-consuming as the prediction itself. Also the workflow is a bit simpler:
Extract is an utility written in JAVA and can be builded as:
javac Extract.java
The utility takes as arguments list of directories that should be processed and can be called as:
java Extract working_dir/*/
The featuresToCSV.exe utility can be built with:
make featuresToCSV
The utility takes two arguments:
feature | description |
---|---|
assembly name | featuresToCSV.exe |
argument 1 | path to all_features |
argument 2 | separator character for created CSV files |
ouptut | creates two CSV files for each struct. feature type |
The utility is called as:
mono featuresToCSV.exe /all_features ;
It creates two CSV files for each structural feature type - one contains instances of the feature and one which stores positions where the instance appears. In current version is the output:
feature | feature file | positions file |
---|---|---|
5-overhang | foverhangs.csv | foverhangsPositions.csv |
3-overhang | toverhangs.csv | toverhangsPositions.csv |
bulge | bulges.csv | bulgesPositions.csv |
hairpin | hairpirns.csv | hairpinsPositions.csv |
helix | helices.csv | helicesPositions.csv |
junction | junctions.csv | junctionsPositions.csv |
loop (internal) | loops.csv | loopsPositions.csv |
As in the case of prediction, there is one extract.sh that groups all steps together. It takes directories, where to look up for structure files. The script can be called as:
./extract.sh Chordata/ Diptera/
Once the prediction is over, predictions can be loaded into the rDB - it incorporates only database functions execution with proper filepaths.
function feature | value |
---|---|
function name | importStructures |
argument 1 | path to the dotparHeap.dp |
argument 2 | character used as CSV separator |
argument 3 | algorithm used for the prediction |
return type | void |
function feature | value |
---|---|
function name | importStructuralFeatures |
argument 1 | path to foverhangs.csv |
argument 2 | path to foverhangsPositions.csv |
argument 3 | path to toverhangs.csv |
argument 4 | path to toverhangsPositions.csv |
argument 5 | path to helices.csv |
argument 6 | path to helicesPositions.csv |
argument 7 | path to bulges.csv |
argument 8 | path to bulgesPositions.csv |
argument 9 | path to loops.csv |
argument 10 | path to loopsPositions.csv |
argument 11 | path to junctions.csv |
argument 12 | path to junctionsPositions.csv |
argument 13 | path to hairpins.csv |
argument 14 | path to hairpinsPositions.csv |
argument 15 | character used as CSV separator |
return type | void |
The last part of prediction is visualization of predicted structures. To do this, we use an external tool RNAplot, which creates secondary structure visualization, and inkscape, which create thumbnails. Both tools are called through script plot.sh that takes list of dot-paren files as input and can be called as:
./plot.sh working_dir/*.br
The very last step (well, finally!) of the rETL process is very simple - it only incorporate deletion of temporary tables, creation of auxiliary indices and summations.
After the rETL process there survive some unnecessary tables. This function gets rid of them.
function feature | value |
---|---|
function name | cleanup |
arguments | none |
return type | void |
For production usage there must be indices present in every relational database. Some of the rETL functions create some indices. However, there remain some columns which should be indexed. There is a script called 00_CREATE.sql in the database branch and 02_INDICES subfolder that creates indices which are used in our reference implementation (service). But indices are highly dependent on requested behavior and most common queries.
Warning
This could not be typically executed at one time. Some of utilities need data previously imported to the rDB and then produce output for the next database function.
The correct order (one of correct orders) is:
SELECT create_rdb_structures();
SELECT prepareIntermediateStructure();
SELECT importFASTA('/ETL/frontend/SSURef_115.csv'::varchar,'|');
SELECT transformIntermediateData();
SELECT importTaxonomy('/ETL/frontend/taxonomy.csv'::varchar,'|');
SELECT importQuality('/SSUParc_115_quality.csv'::varchar,'18S ribosomal RNA'::varchar);
SELECT importAnnotations('/imp_annotations.csv'::varchar,'|');
SELECT importFeatures('/imp_features.csv'::varchar,'|');
SELECT importQualifiers('/imp_qualifiers.csv'::varchar,'|');
SELECT importReferences('/imp_references.csv'::varchar,'|');
SELECT importAuthors('/imp_authors.csv'::varchar,'|');
SELECT importApplicants('/imp_applicants.csv'::varchar,'|');
SELECT importXrefs('/imp_xrefs.csv'::varchar,'|');
SELECT importKeywords('/imp_keywords.csv'::varchar,'|');
SELECT importStructures('/dotpar_heap.db'::varchar,';','cp-predict v. 16.12.07'::varchar);
SELECT importStructuralFeatures(
'/toverhangs.csv'::varchar, '/toverhangsPositions.csv'::varchar,
'/foverhangs.csv'::varchar, '/foverhangsPositions.csv'::varchar,
'/helices.csv'::varchar, '/helicesPositions.csv'::varchar,
'/bulges.csv'::varchar, '/bulgesPositions.csv'::varchar,
'/loops.csv'::varchar, '/loopsPositions.csv'::varchar,
'/junctions.csv'::varchar, '/junctionsPositions.csv'::varchar,
'/hairpins.csv'::varchar, '/hairpinsPositions.csv'::varchar,
';');
SELECT cleanup();
SELECT setTaxonomyAmounts();