.. _Technical-rDataETL:   

The ETL layer of rPredictorDB
*****************************

The ETL (Extraction-Transformation-Load) layer of rPredictorDB is responsible for preparing the rPredictorDB database (rDB), downloading RNA data and filling rDB with the RNA data. 

The scripts and other data used in rETL are located in the ``ETL`` branch of the rPredictorDB repository. 


.. _Technical-rDataETL-requirements:

Requirements
============

Running the ETL process requires:

* Mono, to run the .NET assemblies
* Mono compiler (dmcs) to compile C# source codes
* JRE, to run .jar assemblies
* JDK to compile Java source codes
* C++ compiler to compile C++ source codes


.. _Technical-rDataETL-sources:

Data sources
==================================

There are four main sources of the data:

* :ref:`Technical-rDataETL-sources-silva`

* :ref:`Technical-rDataETL-sources-rfam`

* :ref:`Technical-rDataETL-sources-ena`

* :ref:`Technical-rDataETL-sources-taxonomy`


.. _Technical-rDataETL-sources-silva:

SILVA
-------------------------------

`SILVA <http://www.arb-silva.de/>`_ is a high quality ribosomal RNA database. Beside the web frontend of SILVA there is possibility to get it's data exported in a `FASTA <http://en.wikipedia.org/wiki/FASTA_format>`_ file. The FASTA consists of a header containing identifiers, the taxonomy and nucleotide sequence.
 
SILVA divides its data into two datasets - LSU and SSU, for the large and small ribosomal subunit respectively. The FASTA files are exported per dataset so that there are separate FASTA files for LSU and SSU. (rPredictorDB does not keep this SSU/LSU split.)

Both SILVA datasets have at least two versions - PARC and REF. While the PARC version simply contains everything, the REF version contains high quality data only. For the SSU dataset, there is a further version called NR (non-redundant). This SSU REF NR dataset avoids duplicates.

Currently, rPredictorDB uses data from SILVA's SSU REF NR datasets exported in FASTA format, which includes identifiers and taxonomy. 

Beyond the FASTA files, SILVA provides some extra information. For both of datasets there is a tab-separated CSV file which contains information about each sequence quality and some other bioinformatical data.

========================================================== =
Important files example
========================================================== =
SSURef_NR99_115_tax_silva_trunc.fasta
SSUParc_115_quality.csv
========================================================== =


.. _Technical-rDataETL-sources-rfam:

Rfam
-------------------------------

`Rfam <http://http://rfam.xfam.org/>`_ is a RNA database with RNAs classified into families of RNA according to their type and function. Beside the web frontend of Rfam there is possibility to get whole families exported in a `FASTA <http://en.wikipedia.org/wiki/FASTA_format>`_ file.

In Rfam, similar to SILVA database, there is also currated a high quality subset called *seed* for each family. This seeds are exported in a `STOCKHOLM <https://en.wikipedia.org/wiki/Stockholm_format>`_ file. In addition to seeds, this file also contains additional information about individual families.

========================================================== =
Important files example
========================================================== =
Rfam.seed.gz
RF00001.gz
RF00004.gz
========================================================== =


.. _Technical-rDataETL-sources-ena:

ENA
-------------------------------

`ENA <http://www.ebi.ac.uk/ena/>`_ (*European Nucleotide Archive*) provides  information about nucleotide sequences. As mentioned, SILVA and Rfam output contain basic sequence data only. ENA adds more detailed information: *annotations*. The annotations consist of various attributes like sequence authors, references to publications, sequence description, etc. This data is not neccessary for predicting secondary structures, but it can be very useful for understanding the sequence and in any case for searching.

ENA provides various types of exports. One of them is an XML export via a REST based API which, is used by one of utilities from rPredictorDB's ETL layer.


.. _Technical-rDataETL-sources-taxonomy:

Taxonomy (NCBI)
-------------------------------

`Taxonomy <http://www.ebi.ac.uk/ena/>`_ (*The National Center for Biotechnology Information*) provides a currated taxonomic tree. Although taxonomic information is contained in ENA and SILVA too, taxonomic information in that sources is often inconsistent, incomplete or even incorrect. The Taxonomy database ensures uniqueness of passing taxonomic tree.

Beside the web frontend of Taxonomy there is possibility to get it's data exported in a `Vertical bar separated values <https://en.wikipedia.org/wiki/Delimiter-separated_values>`_ files. rPredictorDB uses two files: *nodes.dmp* containing encoded tree and taxonomic ranks of individual nodes; and *names.dmp* containing names, synonyms, misspellings etc. for nodes defined in *nodes.dmp*.

========================================================== =
Important files example
========================================================== =
taxdmp.zip
========================================================== =


.. _Technical-rDataETL-process:

ETL process phases
==================

The ETL process of rPredictorDB consists of several phases:

#. :ref:`Technical-rDataETL-database-preparation`

#. :ref:`Technical-rDataETL-importing-sequences`

#. :ref:`Technical-rDataETL-importing-qualities`

#. :ref:`Technical-rDataETL-importing-annotations`

#. :ref:`Technical-rDataETL-importing-predictions`

A phase can be typically divided into two or three parts:

#. :ref:`Technical-rDataETL-phase-implementation-extraction`

#. :ref:`Technical-rDataETL-phase-implementation-transformation`

#. :ref:`Technical-rDataETL-phase-implementation-load` 

(This is where the abbreviation "ETL" comes from.)

.. warning:: Note that at the end of the import process, it is necessary to delete Nette cache files in /www/temp/cache, otherwise the new data would might not be accessible right away.

.. _Technical-rDataETL-phase-implementation:

Structure of an ETL phase
=========================

All of the ETL sub-phases uses some common principles/technologies. Let us describe these techniques.


.. _Technical-rDataETL-phase-implementation-extraction:

Downloading (Extraction)
------------------------

There are two approaches used in downloading phase:

* manual file download

* automated files download

The first approach is used if there is a couple of large files only or no API is available for automated download.

The second approach is used whenever a large number of files needs to be retrieved (hundreds of thousands). For this case a special utility is prepared.


.. _Technical-rDataETL-phase-implementation-transformation:

Processing (Transformation)
---------------------------

The second subphase is responsible for transforming the downloaded data into some simple format to be easily importable to the rPredictorDB database. rPredictorDB uses simple CSV files as an intermediate structure (formally called the *staging area*) to store pre-processed data before loading it into the database.


.. _Technical-rDataETL-phase-implementation-load:

Importing data into rDB (Load)
------------------------------

Prepared CSVs from the processing step are loaded to the database using Postgresql's ``COPY TO`` function. We performed some benchmarks and decided to use this approach as the fastest while being very simple to use. There is typically an intermediate temporary table used before inserting data to the destination table. Unfortunately, Postgresql does not support something like SQL Server's ``OPENROWSET`` for querying CSV files "directly".


.. _Technical-rDataETL-schema:

Complete ETL process schema
============================

.. figure:: ETL_schema.svg
   :figwidth: 650
   :width: 650
   :align: center

   Schematic diagram displaying ETL process. Diagram is devided into three parts as imports from SILVA and Rfam differ. Thus section a) and b) show importing of sequences from SILVA and Rfam respectively together with importing their annotations. Section c) show prediction and visualisation phase for both SILVA and Rfam as these phases are the same. The files in dotted red denotes which files are used at the beginning of the prediction phase.


.. _Technical-rDataETL-database-preparation:

Database preparation
====================

The very first rETL phase is the Postgresql database preparation. The following assumes Postgresql (at least version 9.1) is installed and in order with a blank (it is not neccessary) database created. There has to be a pl/pgSQL procedural language extension installed and activated (it is a standard part of Postgresql's installation).

All of this phase's work is done by a function called ``create_rDB_structures``.

The execution is pretty simple::
  
  SELECT create_rdb_structures();


.. _Technical-rDataETL-database-preparation-functions:

Database functions
------------------

==========================  =================================
function feature            value
==========================  =================================
name                        **create_rDB_structures**
arguments                   none
returns                     void
description                 creates the rpredict scheme and all of rDB's tables
==========================  =================================
 

.. _Technical-rDataETL-filtering-sequences:

Extracting data
====================

The second task is to filter sequences from SILVA and Rfam and repair their taxonomic paths as we import reliable sequences only. This step is different for SILVA and RFAM, however both version require to download Taxonomy (NCBI) database and ENA files.

ENA
----

Unlike other sources, there is no suitable prepared export in CSV or any other format.

.. note:: There are some bulk export files from the ENA. All of them can be accessed via FTP. However, these files are very (very) large. It depends on dataset, but it could take tens of gigabytes. Since only high-quality sequences are imported from SILVA, the majority of these exports are unattractive for rPredictorDB.

Fortunately, the ENA provides a simple REST based API for downloading annotations for individual sequences. In technical terms, the annotation contains information from the EMBL header belonging to the given sequence. From ENA's point of view sequences are identified by their accession number. Human users can look for an entry using this approach::

  http://www.ebi.ac.uk/ena/data/view/AB542923
  
where ``AB542923``` is the accession number.

This is nice for human users, but a bit useless for rETL. The rescue comes with ability to export the same data in XML::   
  
  http://www.ebi.ac.uk/ena/data/view/AB542923&display=xml
  
This file contains the annotation as well as the nucleotice sequence. At this time, however, the sequences are already imported into rDB. So, finally, let's export the header only::

  http://www.ebi.ac.uk/ena/data/view/AB542923&display=xml&header=true
  
With this we got all the data we wanted to get. To reach the goal programatically there is a script called *download-ENA.sh* for all sequences specified in the input FASTA file. The script can be called as::

  ./download-ENA.sh Silva/SILVA_128_SSURef_Nr99_tax_silva.fasta ENA/

Taxonomy
--------

Taxonomy is a database currated by NCBI and its dump can be publicly downloaded through FTP. To make preparation easier, we created a script called *download-NCBI.sh* that downloads relevant files and extracts necessary information into specified path. This script can be called as::

  ./download-NCBI.sh NCBI/

The script uses an application called *ncbi* that is written in C++ and can be compiled as::

  g++ -std=c++11 -o ncbi ncbi.cpp

The database is used to make taxonomic information consistent and validated. To do this there is an application called *taxonomy*. The application is written in C++ and can be compilled using::

  g++ -std=c++11 -o taxonomy Taxonomy.cpp

The application takes a FASTA file and for each sequence and its accession number it finds the corresponding organism name in the corresponding ENA file and than use the organism name for reconstruction of its taxonomic path. The application can be called as::

  ./taxonomy NCBI/taxonomy.db ENA/ ENA-17_03_24 Silva/SILVA_128_SSURef_Nr99_tax_silva.fasta

SILVA
-----

Sequences are downloaded using script called *download-SILVA.sh* that download required version of files in the specified directory. This script can be executed this way::

  ./download-SILVA.sh 128 Silva/

The next step is to repair taxonomic information of each sequence. To do this, we use the previously mentioned application *taxonomy*.

Then we choose all sequences belonging to a specified taxon (currently they are Chordata and Diptera). To do this, we use a script *select.sh* that can be called as::

  ./select.sh Silva/SILVA_128_SSURef_Nr99_tax_silva-ENA_2016_10_07.fasta Chordata

In the final step, we group remaining sequences according to their taxonomic path, filter out all sequences longer than a threshold (the limit is 2000 bases for Chordata and 2100 bases for Diptera) and select a single representative sequence for each group. To do this, we find the length *L* of the longest sequence in each group and filter out all sequences shorter than 99% *L*. Then we sort sequences according to a number of unordered pieces in description taken from the corresponding ENA file (ascending); whether the record is working draft or sequencing is in progress, or not (preferred); sequence and pintail qualities taken from the SILVA quality file (descending); first public and last update dates taken from the corresponding ENA file (descending); and the length of the sequence (descending). Finally we select the best sequence (the first one) only. If more sequences have the same annotation the representative sequence is taken arbitrary. For this purpose there is a script called *filter.sh* that can be called as::

  ./filter.sh Data/SILVA_128_SSURef_Nr99_tax_silva-ENA_2016_10_07-Chordata.fasta ENA/ Silva/SILVA_128_SSURef_Nr99.quality 2000

Rfam
----

In the case of Rfam, the proces is significantly simpler. We pick up all seed sequences for a given Rfam family. That can be done using an application called *Seeds* that can be buid as::

  javac Seeds.java

and called as::

  java Seeds Rfam/Rfam.seed Rfam/RF00622.fasta


Than we add a taxonomic path using the previously mentioned application *taxonomy*.

.. note::

    In some cases, boundaries of sequence in the seed file differ from the sequence in the family file (usually ±1). In that cases, the longer sequence is taken.

 
.. _Technical-rDataETL-importing-sequences:

Importing sequences
====================

The third complex task is to import sequence and belonging data to the database. Simply said, this step consists of three sub-steps:

#. parsing FASTA files

#. importing sequences

#. importing taxonomy

.. note::

  Apart from uploading sequences to rDB, sequences are also used to build up a BLAST database that is used to :ref:`User-toolkit-search-sequence`. To do this, there is a script "blast.sh" that build up a new database from specified files or add them to an existing database. It can be called as::

    ./blast.sh rBlast_16-128 "Chordata.fasta Diptera.fasta"


Parsing FASTA files
---------------------

The FASTA files are parsed and transformed into intermediate CSV files. To achieve this there is an utility called *rETL* available through it's frontend called *rETL_frontend.exe*. It is a .NET application running in Mono. This utility uses certain components:

#. common interfaces

#. output adapters

#. library containing 'business logic' 

#. frontend

It can be easily built using the included ``makefile``::
  
  make all
  
that builds all C#-based utilities.
If you want to build the rETL utility only, you can simply call the target::

  make ETL
  
The rETL utility requires a single input parameter - path to a special configuration XML file which defines what has to be done.

A sample XML::
  
  <import>
	<FASTA file="LSURef_115_tax_silva_trunc.fasta" dataset="LSU">
		<CSV destinationFile="LSURef_115.csv" separator="|" />
	</FASTA>
	<FASTA file="SSURef_NR99_115_tax_silva_trunc.fasta" dataset="SSU">
		<CSV destinationFile="SSURef_115.csv" separator="|"/>
	</FASTA>
	<taxonomy>
		<new file="LSURef_115_tax_silva_trunc.fasta" />
		<new file="SSURef_NR99_115_tax_silva_trunc.fasta" />
		<old destinationFile="taxonomy.csv" separator="|"/>
		<CSV destinationFile="taxonomy2.csv" separator="|"/>
	</taxonomy>
  </import>
  

It is neccessary to follow this structure!

A high-level look at the semantics:

* The required single root element is called *import*. 

* Then are listed *FASTA* elements, one for each FASTA file that should be processed. The FASTA element contains just one subelement, in this case *CSV*. The CSV element states that the destination of the transformation is a CSV file. 
 
The rETL utility is designed to enable implementing various types data destinations, i.e. files or databases. This part of the transformation is maintained by the adapter component. The second part included in the *taxonomy* element says that the taxonomy has to be processed as well.

A detailed look at the XML elements:

================== ================================================
name               note
================== ================================================
import              top level element, required
FASTA               represents one processed FASTA file
taxonomy            sign the taxonomy is processed as well
new                 source file for taxonomy processing
old                 previous taxonomy file that should be extended
CSV                 CSV file, input or output        
================== ================================================

A detailed look at their attributes:

==================  =================== ===============================================
element             name                note
==================  =================== ===============================================
FASTA               file                path to the file
FASTA               dataset             SILVA dataset name
new                 file                path to the file
old                 destinationFile     path to the file
old                 separator           single character used to delimit CSV items
CSV                 destinationFile     path to the file
CSV                 separator           single character used to delimit CSV items
==================  =================== ===============================================

Finally, the utility can be executed this way::

  mono rETL_frontend.exe import.xml
   
According to the sample XML, the rETL utility creates three CSV files. The CSV file containing the transformed FASTA files is formatted like:
                                                           
========= ====== ===== ==== ======= ======== ==============                                                           
accession number start stop dataset sequence taxonomic path
========= ====== ===== ==== ======= ======== ==============
========= ====== ===== ==== ======= ======== ==============

And the transformed taxonomic tree CSV:

== ===== ==== ========= =========
id depth name path name parent id
== ===== ==== ========= =========
== ===== ==== ========= =========

Last, it is necessary to load the prepared data into the rDB database. Since it is a very common task to import a CSV file into a database, there is no "magic" behind this step. The Load phase implementation is packed in functions (or stored procedures, all of them return ``void``). The data load itself precedes some preparation (it is necessary to create appropriate temporary table). Just for this purpose there is a function called ``prepareIntermediateStructure()``.
    

Database functions
--------------------

==========================  =================================
function feature            value
==========================  =================================
name                        **prepareIntermediateStructure**
arguments                   none
returns                     void
description                 creates temporary table
==========================  =================================

==========================  =================================
function feature            value
==========================  =================================
name                        **importFASTA**
argument 1                  parsedFastaFile
argument 2                  separator
returns                     void
description                 imports given CSV (arg 1) into temporary table
==========================  =================================

==========================  =================================
function feature            value
==========================  =================================
name                        **transformIntermediateData**
arguments                   none
returns                     void
description                 transforms imported parsed FASTA files from temporary tables into their destination tables
==========================  =================================

==========================  =================================
function feature            value
==========================  =================================
name                        **importTaxonomy**
argument 1                  CSVfile
argument 2                  separator
returns                     void
description                 imports taxonomic tree from CSV file (arg 1) into destination table
==========================  =================================

And a typical way how to solve this phase::
    
  SELECT prepareIntermediateStructure();
  SELECT importFASTA('/ETL/frontend/LSURef_115.csv'::varchar,'|');
  SELECT importFASTA('/ETL/frontend/SSURef_115.csv'::varchar,'|');
  SELECT transformIntermediateData();
  SELECT importTaxonomy('/ETL/frontend/taxonomy.csv'::varchar,'|');

.. _Technical-rDataETL-importing-qualities:

Importing qualities
====================

To complete the SILVA/RFAM-part of importing it remains to process data containing information about sequence qualities. In the case of SILVA, this part is the simpliest one. As mentioned earlier, SILVA provides this information in CSV files. In compare to process SILVA, processing Rfam requires an additional step before uploading the file as Rfam does not provide any quality file.

The only processing of the quality files is performed in the database. All performed transformations are pure technical - parsing and converting into corresponding datatypes.

Rfam
------

In the case of the Rfam, the qualities file is generated based on informations in the seed file. To achieve this there is an utility called Qualities that computes required information. It is a JAVA application and it can be built using::

  javac Qalities.java

Finally, the utility can be executed this way::

  java Qualities Rfam/Rfam.seed Rfam/RF00622-seeds.fasta

Database functions
--------------------

==========================  =================================
function feature            value
==========================  =================================
name                        **importQuality**
argument 1                  qualityCSV
argument 2                  subunit
returns                     void
description                 imports qualities into destination table
==========================  =================================

And execution::

  SELECT importQuality('/LSUParc_115_quality.csv','18S ribosomal RNA');
  

.. _Technical-rDataETL-importing-annotations:

Importing annotations
======================
Now the ETL process leaves the territory of sequences and continues importing annotations from ENA. Like the sequence load this process can be decomposed as well:

#. downloading and parsing annotations

#. loading annotations into rDB

Downloading and parsing annotations
------------------------------------

**What is an annotation?**

An *annotation* is a set of extra information about the given sequence. In rPredictorDB, annotations contains following data:

=============================== ===================================
name                            description
=============================== ===================================
annotation                      general set of information like description or comments
references                      references to given sequence ie. in journals, papers, etc.
authors                         authors of reference
applicants                      patent applicants
features                        features of the sequence
qualifiers                      key-value pairs with detailed information about given feature
xrefs                           references to other databases where given sequence can be found
=============================== ===================================

As ENA files are used to repair taxonomic information in the :ref:`Technical-rDataETL-filtering-sequences`, they should be already downloaded and it is not neccessary to download them again. Now we need to extract all the fields and save them to the intermediate CSV files. This is not logically complex.

To reach the goal programatically there is an utility called **AnnotationsParser** (AP). It is written in C# and runs under Mono or the .NET framework. This executable takes a connection string's parts as parameters. It is needed for getting all accession numbers from rDB. The only exception is the first parameter - it says how many parallel threads have to be used for downloading of missing ENA files.

If you want to build this utility only, you can call the make this way::

  make annotationsParser

Running the utility can look like this::

  mono annotationsParser.exe 4 localhost 5432 rpredictor password rDB
    
Simply put, the AP gets all accession numbers from the rDB and for each gets the EMBL header XML. This XML is parsed and results are exported in certain files, depending on annotation feature.

============================= ==================================
annotation feature            file
============================= ==================================
annotations                   imp_annotations.csv
applicants                    imp_applicants.csv
authors                       imp_authors.csv
features                      imp_features.csv
keywords                      imp_keywords.csv
qualifiers                    imp_qualifiers.csv
references                    imp_references.csv
xrefs                         imp_xrefs.csv
============================= ==================================


Loading annotations into rDB
------------------------------------
Loading created files into database is implemented the same way as previous loadings. It means there are database functions doing all the work. All of them have the same signature, so it is described only once followed by list of names of the fuctions.

Database functions
--------------------

=========================== =========================================
feature                     description
=========================== =========================================
return type                 void
argument 1                  path to the CSV file
argument 2                  one character used as fields delimiter
output                      predictHeap.csv
=========================== =========================================  

And a promised list of functions:

#. importAnnotations

#. importReferences

#. importFeatures 

#. importQualifiers 

#. importAuthors 

#. importApplicants 

#. importXrefs 

#. importKeywords

And a sample how to execute the functions::
  
  SELECT importAnnotations('/imp_annotations.csv','|');
  SELECT importFeatures('/imp_features.csv','|');
  SELECT importQualifiers('/imp_qualifiers.csv','|');
  SELECT importReferences('/imp_references.csv','|');
  SELECT importAuthors('/imp_authors.csv','|');
  SELECT importApplicants('/imp_applicants.csv','|');
  SELECT importXrefs('/imp_xrefs.csv','|');
  SELECT importKeywords('/imp_keywords.csv','|');
  
    
.. _Technical-rDataETL-importing-predictions:

Importing predictions
======================

Before this step all data about primary structures (sequences) are present in rDB and now the process aims to fill the last part of the rDB with predicted secondary structures. Like steps mentioned above, this one is also divided into two substeps:

#. Bulk prediction

#. Loading predictions into rDB 

#. Plot predictions


Bulk prediction
------------------

Bulk prediction ( = *batch prediction*) is the most tricky part of the whole ETL process. It takes a list of primary structures (sequences) and predicts secondary structures for them. As a feature geared towards structural analysis capabilities, rPredictorDB stores *structural features* as well.

.. warning:: 

  This part is highly dependent on the environment and prediction algorithm.

The first thing that has to be done is to replace wildcards in sequences as the current prediction algorithm is not able to handle them.

.. note::

  Wildcards are e.g. *N* that represents *uNknown nucleic acid, *R* that represents *adenine OR guanine* or *B* that represents *anything except adenine*.

Of course, the question is, which nucleic acid should be uses to replace a wildcard? This question has no clear answer. Thus we determine a distribution of nucleic acids in the sequence and then we use the most frequent nucleic acid that is allowed by the wildcard. If more allowed nucleic acids have the same probability, arbitrary one is chosen. To do this, there is an application called *replacement* written in C++ that can be compiled using::

  g++ -std=c++11 -o replacement prediction/replacement.cpp

The application process an input FASTA file and print resolved sequences to standard output.

For prediction in itself, another tool (or a set of tools) is needed. In our case it is :ref:`User-cp-predict`. The current version of a prediction algorithm is written in MATLAB. That application is distributed as an installer for UNIX-like systems. And require MATLAB runtime (it can be downloaded automatically during installation). The application takes template and sequences in FASTA format as input, saves each predicted structure into a separate dot-bracket file and prints some statistics to standard output.

Finally, the predicted structures are merged and transformed into one CSV file in the format:

================ ============== ============= ================================
accession number start position stop position structure in dot-paren format
================ ============== ============= ================================
================ ============== ============= ================================


To make this process easier, there is a script called *predict.sh* that groups all steps together. This script take a template structure and predicted sequences as input and product one structure file in dot-bracket format for each input sequence amd one CSV file with all predicted structures. The script can be called as::

  ./predict.sh templates/4v6x.br Chordata.fasta


The other goal is to extract :ref:`User-structural-features` from created predictions. This is not as time-consuming as the prediction itself. Also the workflow is a bit simpler:

#. the ``Extract`` utility extracts structural features from the dot-paren files for each structure and stored in a file

#. all results are concatenated into one file,

#. the ``featuresToCSV.exe`` utility transforms results into intermediate CSV files.

*Extract* is an utility written in JAVA and can be builded as::

  javac Extract.java

The utility takes as arguments list of directories that should be processed and can be called as::

  java Extract working_dir/*/

The **featuresToCSV.exe** utility can be built with::

  make featuresToCSV

The utility takes two arguments:

============================ ===================================================
feature                      description
============================ ===================================================
assembly name                featuresToCSV.exe
argument 1                   path to all_features
argument 2                   separator character for created CSV files
ouptut                       creates two CSV files for each struct. feature type
============================ ===================================================
  
The utility is called as::

  mono featuresToCSV.exe /all_features ;

It creates two CSV files for each structural feature type - one contains instances of the feature and one which stores positions where the instance appears. In current version is the output:

===================== ============================ ============================
feature               feature file                 positions file  
===================== ============================ ============================
5-overhang            foverhangs.csv               foverhangsPositions.csv
3-overhang            toverhangs.csv               toverhangsPositions.csv
bulge                 bulges.csv                   bulgesPositions.csv
hairpin               hairpirns.csv                hairpinsPositions.csv                
helix                 helices.csv                  helicesPositions.csv
junction              junctions.csv                junctionsPositions.csv
loop (internal)       loops.csv                    loopsPositions.csv
===================== ============================ ============================

As in the case of prediction, there is one **extract.sh** that groups all steps together. It takes directories, where to look up for structure files. The script can be called as::

  ./extract.sh Chordata/ Diptera/


Loading predictions into rDB
------------------------------

Once the prediction is over, predictions can be loaded into the rDB - it incorporates only database functions execution with proper filepaths.

Database functions
------------------ 

========================== ===================================
function feature            value
========================== ===================================
function name               **importStructures**
argument 1                  path to the dotparHeap.dp
argument 2                  character used as CSV separator
argument 3                  algorithm used for the prediction
return type                 void
========================== ===================================

========================== =================================
function feature            value
========================== =================================
function name               **importStructuralFeatures**
argument 1                  path to foverhangs.csv
argument 2                  path to foverhangsPositions.csv
argument 3                  path to toverhangs.csv
argument 4                  path to toverhangsPositions.csv
argument 5                  path to helices.csv
argument 6                  path to helicesPositions.csv
argument 7                  path to bulges.csv
argument 8                  path to bulgesPositions.csv
argument 9                  path to loops.csv
argument 10                 path to loopsPositions.csv
argument 11                 path to junctions.csv
argument 12                 path to junctionsPositions.csv
argument 13                 path to hairpins.csv
argument 14                 path to hairpinsPositions.csv
argument 15                 character used as CSV separator
return type                 void
========================== =================================


Plot predictions
-----------------

The last part of prediction is visualization of predicted structures. To do this, we use an external tool ``RNAplot``, which creates secondary structure visualization, and ``inkscape``, which create thumbnails. Both tools are called through script *plot.sh* that takes list of dot-paren files as input and can be called as::

  ./plot.sh working_dir/*.br


Finalization
======================

The very last step (well, finally!) of the rETL process is very simple - it only incorporate deletion of temporary tables, creation of auxiliary indices and summations.


Clean up
----------

After the rETL process there survive some unnecessary tables. This function gets rid of them.

========================== =================================
function feature            value
========================== =================================
function name               **cleanup**
arguments                   none
return type                 void
========================== =================================


Indices
-------------

For production usage there must be indices present in every relational database. Some of the rETL functions create some indices. However, there remain some columns which should be indexed. There is a script called **00_CREATE.sql** in the ``database`` branch and ``02_INDICES`` subfolder that creates indices which are used in our reference implementation (service).
But indices are highly dependent on requested behavior and most common queries.


Taxonomy amounts
-----------------

:ref:`User-toolkit-reference-taxonomy-taxonomy` shows a number of species in given taxon, thus this function re-count this number as it is not effective to compute it everytimes.

========================== =================================
function feature            value
========================== =================================
function name               **settaxonomyamounts**
arguments                   none
return type                 void
========================== =================================


Sample: How to execute all database functions in correct order
=================================================================

.. warning::
  This could not be typically executed at one time. Some of utilities need data previously imported to the rDB and then produce output for the next database function.

The correct order (one of correct orders) is::

  SELECT create_rdb_structures();
  SELECT prepareIntermediateStructure();
  SELECT importFASTA('/ETL/frontend/SSURef_115.csv'::varchar,'|');
  SELECT transformIntermediateData();
  SELECT importTaxonomy('/ETL/frontend/taxonomy.csv'::varchar,'|');
  SELECT importQuality('/SSUParc_115_quality.csv'::varchar,'18S ribosomal RNA'::varchar);
  SELECT importAnnotations('/imp_annotations.csv'::varchar,'|');
  SELECT importFeatures('/imp_features.csv'::varchar,'|');
  SELECT importQualifiers('/imp_qualifiers.csv'::varchar,'|');
  SELECT importReferences('/imp_references.csv'::varchar,'|');
  SELECT importAuthors('/imp_authors.csv'::varchar,'|');
  SELECT importApplicants('/imp_applicants.csv'::varchar,'|');
  SELECT importXrefs('/imp_xrefs.csv'::varchar,'|');
  SELECT importKeywords('/imp_keywords.csv'::varchar,'|');
  SELECT importStructures('/dotpar_heap.db'::varchar,';','cp-predict v. 16.12.07'::varchar);
  SELECT importStructuralFeatures(
          '/toverhangs.csv'::varchar, '/toverhangsPositions.csv'::varchar,
          '/foverhangs.csv'::varchar, '/foverhangsPositions.csv'::varchar,
          '/helices.csv'::varchar,    '/helicesPositions.csv'::varchar,
          '/bulges.csv'::varchar,     '/bulgesPositions.csv'::varchar,
          '/loops.csv'::varchar,      '/loopsPositions.csv'::varchar,
          '/junctions.csv'::varchar,  '/junctionsPositions.csv'::varchar,
          '/hairpins.csv'::varchar,   '/hairpinsPositions.csv'::varchar,
          ';');
  SELECT cleanup();
  SELECT setTaxonomyAmounts();