4. The ETL layer of rPredictorDB¶

The ETL (Extraction-Transformation-Load) layer of rPredictorDB is responsible for preparing the rPredictorDB database (rDB), downloading RNA data and filling rDB with the RNA data.

The scripts and other data used in rETL are located in the ETL branch of the rPredictorDB repository.

4.1. Requirements¶

Running the ETL process requires:

Mono, to run the .NET assemblies
Mono compiler (dmcs) to compile C# source codes
JRE, to run .jar assemblies
JDK to compile Java source codes
C++ compiler to compile C++ source codes

4.2. Data sources¶

There are four main sources of the data:

SILVA
Rfam
ENA
Taxonomy (NCBI)

4.2.1. SILVA¶

SILVA is a high quality ribosomal RNA database. Beside the web frontend of SILVA there is possibility to get it’s data exported in a FASTA file. The FASTA consists of a header containing identifiers, the taxonomy and nucleotide sequence.

SILVA divides its data into two datasets - LSU and SSU, for the large and small ribosomal subunit respectively. The FASTA files are exported per dataset so that there are separate FASTA files for LSU and SSU. (rPredictorDB does not keep this SSU/LSU split.)

Both SILVA datasets have at least two versions - PARC and REF. While the PARC version simply contains everything, the REF version contains high quality data only. For the SSU dataset, there is a further version called NR (non-redundant). This SSU REF NR dataset avoids duplicates.

Currently, rPredictorDB uses data from SILVA’s SSU REF NR datasets exported in FASTA format, which includes identifiers and taxonomy.

Beyond the FASTA files, SILVA provides some extra information. For both of datasets there is a tab-separated CSV file which contains information about each sequence quality and some other bioinformatical data.

Important files example
SSURef_NR99_115_tax_silva_trunc.fasta
SSUParc_115_quality.csv

4.2.2. Rfam¶

Rfam is a RNA database with RNAs classified into families of RNA according to their type and function. Beside the web frontend of Rfam there is possibility to get whole families exported in a FASTA file.

In Rfam, similar to SILVA database, there is also currated a high quality subset called seed for each family. This seeds are exported in a STOCKHOLM file. In addition to seeds, this file also contains additional information about individual families.

Important files example
Rfam.seed.gz
RF00001.gz
RF00004.gz

4.2.3. ENA¶

ENA (European Nucleotide Archive) provides information about nucleotide sequences. As mentioned, SILVA and Rfam output contain basic sequence data only. ENA adds more detailed information: annotations. The annotations consist of various attributes like sequence authors, references to publications, sequence description, etc. This data is not neccessary for predicting secondary structures, but it can be very useful for understanding the sequence and in any case for searching.

ENA provides various types of exports. One of them is an XML export via a REST based API which, is used by one of utilities from rPredictorDB’s ETL layer.

4.2.4. Taxonomy (NCBI)¶

Taxonomy (The National Center for Biotechnology Information) provides a currated taxonomic tree. Although taxonomic information is contained in ENA and SILVA too, taxonomic information in that sources is often inconsistent, incomplete or even incorrect. The Taxonomy database ensures uniqueness of passing taxonomic tree.

Beside the web frontend of Taxonomy there is possibility to get it’s data exported in a Vertical bar separated values files. rPredictorDB uses two files: nodes.dmp containing encoded tree and taxonomic ranks of individual nodes; and names.dmp containing names, synonyms, misspellings etc. for nodes defined in nodes.dmp.

Important files example
taxdmp.zip

4.3. ETL process phases¶

The ETL process of rPredictorDB consists of several phases:

Database preparation
Importing sequences
Importing qualities
Importing annotations
Importing predictions

A phase can be typically divided into two or three parts:

Downloading (Extraction)
Processing (Transformation)
Importing data into rDB (Load)

(This is where the abbreviation “ETL” comes from.)

Warning

Note that at the end of the import process, it is necessary to delete Nette cache files in /www/temp/cache, otherwise the new data would might not be accessible right away.

4.4. Structure of an ETL phase¶

All of the ETL sub-phases uses some common principles/technologies. Let us describe these techniques.

4.4.1. Downloading (Extraction)¶

There are two approaches used in downloading phase:

manual file download
automated files download

The first approach is used if there is a couple of large files only or no API is available for automated download.

The second approach is used whenever a large number of files needs to be retrieved (hundreds of thousands). For this case a special utility is prepared.

4.4.2. Processing (Transformation)¶

The second subphase is responsible for transforming the downloaded data into some simple format to be easily importable to the rPredictorDB database. rPredictorDB uses simple CSV files as an intermediate structure (formally called the staging area) to store pre-processed data before loading it into the database.

4.4.3. Importing data into rDB (Load)¶

Prepared CSVs from the processing step are loaded to the database using Postgresql’s COPY TO function. We performed some benchmarks and decided to use this approach as the fastest while being very simple to use. There is typically an intermediate temporary table used before inserting data to the destination table. Unfortunately, Postgresql does not support something like SQL Server’s OPENROWSET for querying CSV files “directly”.

4.5. Complete ETL process schema¶

Schematic diagram displaying ETL process. Diagram is devided into three parts as imports from SILVA and Rfam differ. Thus section a) and b) show importing of sequences from SILVA and Rfam respectively together with importing their annotations. Section c) show prediction and visualisation phase for both SILVA and Rfam as these phases are the same. The files in dotted red denotes which files are used at the beginning of the prediction phase.

4.6. Database preparation¶

The very first rETL phase is the Postgresql database preparation. The following assumes Postgresql (at least version 9.1) is installed and in order with a blank (it is not neccessary) database created. There has to be a pl/pgSQL procedural language extension installed and activated (it is a standard part of Postgresql’s installation).

All of this phase’s work is done by a function called create_rDB_structures.

The execution is pretty simple:

SELECT create_rdb_structures();

4.6.1. Database functions¶

function feature	value
name	create_rDB_structures
arguments	none
returns	void
description	creates the rpredict scheme and all of rDB’s tables

4.7. Extracting data¶

The second task is to filter sequences from SILVA and Rfam and repair their taxonomic paths as we import reliable sequences only. This step is different for SILVA and RFAM, however both version require to download Taxonomy (NCBI) database and ENA files.

4.7.1. ENA¶

Unlike other sources, there is no suitable prepared export in CSV or any other format.

Note

There are some bulk export files from the ENA. All of them can be accessed via FTP. However, these files are very (very) large. It depends on dataset, but it could take tens of gigabytes. Since only high-quality sequences are imported from SILVA, the majority of these exports are unattractive for rPredictorDB.

Fortunately, the ENA provides a simple REST based API for downloading annotations for individual sequences. In technical terms, the annotation contains information from the EMBL header belonging to the given sequence. From ENA’s point of view sequences are identified by their accession number. Human users can look for an entry using this approach:

http://www.ebi.ac.uk/ena/data/view/AB542923

where AB542923` is the accession number.

This is nice for human users, but a bit useless for rETL. The rescue comes with ability to export the same data in XML:

http://www.ebi.ac.uk/ena/data/view/AB542923&display=xml

This file contains the annotation as well as the nucleotice sequence. At this time, however, the sequences are already imported into rDB. So, finally, let’s export the header only:

http://www.ebi.ac.uk/ena/data/view/AB542923&display=xml&header=true

With this we got all the data we wanted to get. To reach the goal programatically there is a script called download-ENA.sh for all sequences specified in the input FASTA file. The script can be called as:

./download-ENA.sh Silva/SILVA_128_SSURef_Nr99_tax_silva.fasta ENA/

4.7.2. Taxonomy¶

Taxonomy is a database currated by NCBI and its dump can be publicly downloaded through FTP. To make preparation easier, we created a script called download-NCBI.sh that downloads relevant files and extracts necessary information into specified path. This script can be called as:

./download-NCBI.sh NCBI/

The script uses an application called ncbi that is written in C++ and can be compiled as:

g++ -std=c++11 -o ncbi ncbi.cpp

The database is used to make taxonomic information consistent and validated. To do this there is an application called taxonomy. The application is written in C++ and can be compilled using:

g++ -std=c++11 -o taxonomy Taxonomy.cpp

The application takes a FASTA file and for each sequence and its accession number it finds the corresponding organism name in the corresponding ENA file and than use the organism name for reconstruction of its taxonomic path. The application can be called as:

./taxonomy NCBI/taxonomy.db ENA/ ENA-17_03_24 Silva/SILVA_128_SSURef_Nr99_tax_silva.fasta

4.7.3. SILVA¶

Sequences are downloaded using script called download-SILVA.sh that download required version of files in the specified directory. This script can be executed this way:

./download-SILVA.sh 128 Silva/

The next step is to repair taxonomic information of each sequence. To do this, we use the previously mentioned application taxonomy.

Then we choose all sequences belonging to a specified taxon (currently they are Chordata and Diptera). To do this, we use a script select.sh that can be called as:

./select.sh Silva/SILVA_128_SSURef_Nr99_tax_silva-ENA_2016_10_07.fasta Chordata

In the final step, we group remaining sequences according to their taxonomic path, filter out all sequences longer than a threshold (the limit is 2000 bases for Chordata and 2100 bases for Diptera) and select a single representative sequence for each group. To do this, we find the length L of the longest sequence in each group and filter out all sequences shorter than 99% L. Then we sort sequences according to a number of unordered pieces in description taken from the corresponding ENA file (ascending); whether the record is working draft or sequencing is in progress, or not (preferred); sequence and pintail qualities taken from the SILVA quality file (descending); first public and last update dates taken from the corresponding ENA file (descending); and the length of the sequence (descending). Finally we select the best sequence (the first one) only. If more sequences have the same annotation the representative sequence is taken arbitrary. For this purpose there is a script called filter.sh that can be called as:

./filter.sh Data/SILVA_128_SSURef_Nr99_tax_silva-ENA_2016_10_07-Chordata.fasta ENA/ Silva/SILVA_128_SSURef_Nr99.quality 2000

4.7.4. Rfam¶

In the case of Rfam, the proces is significantly simpler. We pick up all seed sequences for a given Rfam family. That can be done using an application called Seeds that can be buid as:

javac Seeds.java

and called as:

java Seeds Rfam/Rfam.seed Rfam/RF00622.fasta

Than we add a taxonomic path using the previously mentioned application taxonomy.

Note

In some cases, boundaries of sequence in the seed file differ from the sequence in the family file (usually ±1). In that cases, the longer sequence is taken.

4.8. Importing sequences¶

The third complex task is to import sequence and belonging data to the database. Simply said, this step consists of three sub-steps:

parsing FASTA files
importing sequences
importing taxonomy

Note

Apart from uploading sequences to rDB, sequences are also used to build up a BLAST database that is used to Sequence search. To do this, there is a script “blast.sh” that build up a new database from specified files or add them to an existing database. It can be called as:

./blast.sh rBlast_16-128 "Chordata.fasta Diptera.fasta"

4.8.1. Parsing FASTA files¶

The FASTA files are parsed and transformed into intermediate CSV files. To achieve this there is an utility called rETL available through it’s frontend called rETL_frontend.exe. It is a .NET application running in Mono. This utility uses certain components:

common interfaces
output adapters
library containing ‘business logic’
frontend

It can be easily built using the included makefile:

make all

that builds all C#-based utilities. If you want to build the rETL utility only, you can simply call the target:

make ETL

The rETL utility requires a single input parameter - path to a special configuration XML file which defines what has to be done.

A sample XML:

<import>
      <FASTA file="LSURef_115_tax_silva_trunc.fasta" dataset="LSU">
              <CSV destinationFile="LSURef_115.csv" separator="|" />
      </FASTA>
      <FASTA file="SSURef_NR99_115_tax_silva_trunc.fasta" dataset="SSU">
              <CSV destinationFile="SSURef_115.csv" separator="|"/>
      </FASTA>
      <taxonomy>
              <new file="LSURef_115_tax_silva_trunc.fasta" />
              <new file="SSURef_NR99_115_tax_silva_trunc.fasta" />
              <old destinationFile="taxonomy.csv" separator="|"/>
              <CSV destinationFile="taxonomy2.csv" separator="|"/>
      </taxonomy>
</import>

It is neccessary to follow this structure!

A high-level look at the semantics:

The required single root element is called import.
Then are listed FASTA elements, one for each FASTA file that should be processed. The FASTA element contains just one subelement, in this case CSV. The CSV element states that the destination of the transformation is a CSV file.

The rETL utility is designed to enable implementing various types data destinations, i.e. files or databases. This part of the transformation is maintained by the adapter component. The second part included in the taxonomy element says that the taxonomy has to be processed as well.

A detailed look at the XML elements:

name	note
import	top level element, required
FASTA	represents one processed FASTA file
taxonomy	sign the taxonomy is processed as well
new	source file for taxonomy processing
old	previous taxonomy file that should be extended
CSV	CSV file, input or output

A detailed look at their attributes:

element	name	note
FASTA	file	path to the file
FASTA	dataset	SILVA dataset name
new	file	path to the file
old	destinationFile	path to the file
old	separator	single character used to delimit CSV items
CSV	destinationFile	path to the file
CSV	separator	single character used to delimit CSV items

Finally, the utility can be executed this way:

mono rETL_frontend.exe import.xml

According to the sample XML, the rETL utility creates three CSV files. The CSV file containing the transformed FASTA files is formatted like:

accession	number	start	stop	dataset	sequence	taxonomic path

And the transformed taxonomic tree CSV:

id	depth	name	path name	parent id

Last, it is necessary to load the prepared data into the rDB database. Since it is a very common task to import a CSV file into a database, there is no “magic” behind this step. The Load phase implementation is packed in functions (or stored procedures, all of them return void). The data load itself precedes some preparation (it is necessary to create appropriate temporary table). Just for this purpose there is a function called prepareIntermediateStructure().

4.8.2. Database functions¶

function feature	value
name	prepareIntermediateStructure
arguments	none
returns	void
description	creates temporary table

function feature	value
name	importFASTA
argument 1	parsedFastaFile
argument 2	separator
returns	void
description	imports given CSV (arg 1) into temporary table

function feature	value
name	transformIntermediateData
arguments	none
returns	void
description	transforms imported parsed FASTA files from temporary tables into their destination tables

function feature	value
name	importTaxonomy
argument 1	CSVfile
argument 2	separator
returns	void
description	imports taxonomic tree from CSV file (arg 1) into destination table

And a typical way how to solve this phase:

SELECT prepareIntermediateStructure();
SELECT importFASTA('/ETL/frontend/LSURef_115.csv'::varchar,'|');
SELECT importFASTA('/ETL/frontend/SSURef_115.csv'::varchar,'|');
SELECT transformIntermediateData();
SELECT importTaxonomy('/ETL/frontend/taxonomy.csv'::varchar,'|');

4.9. Importing qualities¶

To complete the SILVA/RFAM-part of importing it remains to process data containing information about sequence qualities. In the case of SILVA, this part is the simpliest one. As mentioned earlier, SILVA provides this information in CSV files. In compare to process SILVA, processing Rfam requires an additional step before uploading the file as Rfam does not provide any quality file.

The only processing of the quality files is performed in the database. All performed transformations are pure technical - parsing and converting into corresponding datatypes.

4.9.1. Rfam¶

In the case of the Rfam, the qualities file is generated based on informations in the seed file. To achieve this there is an utility called Qualities that computes required information. It is a JAVA application and it can be built using:

javac Qalities.java

Finally, the utility can be executed this way:

java Qualities Rfam/Rfam.seed Rfam/RF00622-seeds.fasta

4.9.2. Database functions¶

function feature	value
name	importQuality
argument 1	qualityCSV
argument 2	subunit
returns	void
description	imports qualities into destination table

And execution:

SELECT importQuality('/LSUParc_115_quality.csv','18S ribosomal RNA');

4.10. Importing annotations¶

Now the ETL process leaves the territory of sequences and continues importing annotations from ENA. Like the sequence load this process can be decomposed as well:

downloading and parsing annotations
loading annotations into rDB

4.10.1. Downloading and parsing annotations¶

What is an annotation?

An annotation is a set of extra information about the given sequence. In rPredictorDB, annotations contains following data:

name	description
annotation	general set of information like description or comments
references	references to given sequence ie. in journals, papers, etc.
authors	authors of reference
applicants	patent applicants
features	features of the sequence
qualifiers	key-value pairs with detailed information about given feature
xrefs	references to other databases where given sequence can be found

As ENA files are used to repair taxonomic information in the Extracting data, they should be already downloaded and it is not neccessary to download them again. Now we need to extract all the fields and save them to the intermediate CSV files. This is not logically complex.

To reach the goal programatically there is an utility called AnnotationsParser (AP). It is written in C# and runs under Mono or the .NET framework. This executable takes a connection string’s parts as parameters. It is needed for getting all accession numbers from rDB. The only exception is the first parameter - it says how many parallel threads have to be used for downloading of missing ENA files.

If you want to build this utility only, you can call the make this way:

make annotationsParser

Running the utility can look like this:

mono annotationsParser.exe 4 localhost 5432 rpredictor password rDB

Simply put, the AP gets all accession numbers from the rDB and for each gets the EMBL header XML. This XML is parsed and results are exported in certain files, depending on annotation feature.

annotation feature	file
annotations	imp_annotations.csv
applicants	imp_applicants.csv
authors	imp_authors.csv
features	imp_features.csv
keywords	imp_keywords.csv
qualifiers	imp_qualifiers.csv
references	imp_references.csv
xrefs	imp_xrefs.csv

4.10.2. Loading annotations into rDB¶

Loading created files into database is implemented the same way as previous loadings. It means there are database functions doing all the work. All of them have the same signature, so it is described only once followed by list of names of the fuctions.

4.10.3. Database functions¶

feature	description
return type	void
argument 1	path to the CSV file
argument 2	one character used as fields delimiter
output	predictHeap.csv

And a promised list of functions:

importAnnotations
importReferences
importFeatures
importQualifiers
importAuthors
importApplicants
importXrefs
importKeywords

And a sample how to execute the functions:

SELECT importAnnotations('/imp_annotations.csv','|');
SELECT importFeatures('/imp_features.csv','|');
SELECT importQualifiers('/imp_qualifiers.csv','|');
SELECT importReferences('/imp_references.csv','|');
SELECT importAuthors('/imp_authors.csv','|');
SELECT importApplicants('/imp_applicants.csv','|');
SELECT importXrefs('/imp_xrefs.csv','|');
SELECT importKeywords('/imp_keywords.csv','|');

4.11. Importing predictions¶

Before this step all data about primary structures (sequences) are present in rDB and now the process aims to fill the last part of the rDB with predicted secondary structures. Like steps mentioned above, this one is also divided into two substeps:

Bulk prediction
Loading predictions into rDB
Plot predictions

4.11.1. Bulk prediction¶

Bulk prediction ( = batch prediction) is the most tricky part of the whole ETL process. It takes a list of primary structures (sequences) and predicts secondary structures for them. As a feature geared towards structural analysis capabilities, rPredictorDB stores structural features as well.

Warning

This part is highly dependent on the environment and prediction algorithm.

The first thing that has to be done is to replace wildcards in sequences as the current prediction algorithm is not able to handle them.

Note

Wildcards are e.g. N that represents uNknown nucleic acid, *R that represents adenine OR guanine or B that represents anything except adenine.

Of course, the question is, which nucleic acid should be uses to replace a wildcard? This question has no clear answer. Thus we determine a distribution of nucleic acids in the sequence and then we use the most frequent nucleic acid that is allowed by the wildcard. If more allowed nucleic acids have the same probability, arbitrary one is chosen. To do this, there is an application called replacement written in C++ that can be compiled using:

g++ -std=c++11 -o replacement prediction/replacement.cpp

The application process an input FASTA file and print resolved sequences to standard output.

For prediction in itself, another tool (or a set of tools) is needed. In our case it is CP-predict: a two-phase algorithm for rRNA structure prediction. The current version of a prediction algorithm is written in MATLAB. That application is distributed as an installer for UNIX-like systems. And require MATLAB runtime (it can be downloaded automatically during installation). The application takes template and sequences in FASTA format as input, saves each predicted structure into a separate dot-bracket file and prints some statistics to standard output.

Finally, the predicted structures are merged and transformed into one CSV file in the format:

accession number	start position	stop position	structure in dot-paren format

To make this process easier, there is a script called predict.sh that groups all steps together. This script take a template structure and predicted sequences as input and product one structure file in dot-bracket format for each input sequence amd one CSV file with all predicted structures. The script can be called as:

./predict.sh templates/4v6x.br Chordata.fasta

The other goal is to extract Structural features from created predictions. This is not as time-consuming as the prediction itself. Also the workflow is a bit simpler:

the Extract utility extracts structural features from the dot-paren files for each structure and stored in a file
all results are concatenated into one file,
the featuresToCSV.exe utility transforms results into intermediate CSV files.

Extract is an utility written in JAVA and can be builded as:

javac Extract.java

The utility takes as arguments list of directories that should be processed and can be called as:

java Extract working_dir/*/

The featuresToCSV.exe utility can be built with:

make featuresToCSV

The utility takes two arguments:

feature	description
assembly name	featuresToCSV.exe
argument 1	path to all_features
argument 2	separator character for created CSV files
ouptut	creates two CSV files for each struct. feature type

The utility is called as:

mono featuresToCSV.exe /all_features ;

It creates two CSV files for each structural feature type - one contains instances of the feature and one which stores positions where the instance appears. In current version is the output:

feature	feature file	positions file
5-overhang	foverhangs.csv	foverhangsPositions.csv
3-overhang	toverhangs.csv	toverhangsPositions.csv
bulge	bulges.csv	bulgesPositions.csv
hairpin	hairpirns.csv	hairpinsPositions.csv
helix	helices.csv	helicesPositions.csv
junction	junctions.csv	junctionsPositions.csv
loop (internal)	loops.csv	loopsPositions.csv

As in the case of prediction, there is one extract.sh that groups all steps together. It takes directories, where to look up for structure files. The script can be called as:

./extract.sh Chordata/ Diptera/

4.11.2. Loading predictions into rDB¶

Once the prediction is over, predictions can be loaded into the rDB - it incorporates only database functions execution with proper filepaths.

4.11.3. Database functions¶

function feature	value
function name	importStructures
argument 1	path to the dotparHeap.dp
argument 2	character used as CSV separator
argument 3	algorithm used for the prediction
return type	void

function feature	value
function name	importStructuralFeatures
argument 1	path to foverhangs.csv
argument 2	path to foverhangsPositions.csv
argument 3	path to toverhangs.csv
argument 4	path to toverhangsPositions.csv
argument 5	path to helices.csv
argument 6	path to helicesPositions.csv
argument 7	path to bulges.csv
argument 8	path to bulgesPositions.csv
argument 9	path to loops.csv
argument 10	path to loopsPositions.csv
argument 11	path to junctions.csv
argument 12	path to junctionsPositions.csv
argument 13	path to hairpins.csv
argument 14	path to hairpinsPositions.csv
argument 15	character used as CSV separator
return type	void

4.11.4. Plot predictions¶

The last part of prediction is visualization of predicted structures. To do this, we use an external tool RNAplot, which creates secondary structure visualization, and inkscape, which create thumbnails. Both tools are called through script plot.sh that takes list of dot-paren files as input and can be called as:

./plot.sh working_dir/*.br

4.12. Finalization¶

The very last step (well, finally!) of the rETL process is very simple - it only incorporate deletion of temporary tables, creation of auxiliary indices and summations.

4.12.1. Clean up¶

After the rETL process there survive some unnecessary tables. This function gets rid of them.

function feature	value
function name	cleanup
arguments	none
return type	void

4.12.2. Indices¶

For production usage there must be indices present in every relational database. Some of the rETL functions create some indices. However, there remain some columns which should be indexed. There is a script called 00_CREATE.sql in the database branch and 02_INDICES subfolder that creates indices which are used in our reference implementation (service). But indices are highly dependent on requested behavior and most common queries.

4.12.3. Taxonomy amounts¶

Taxonomy shows a number of species in given taxon, thus this function re-count this number as it is not effective to compute it everytimes.

function feature	value
function name	settaxonomyamounts
arguments	none
return type	void

4.13. Sample: How to execute all database functions in correct order¶

Warning

This could not be typically executed at one time. Some of utilities need data previously imported to the rDB and then produce output for the next database function.

The correct order (one of correct orders) is:

SELECT create_rdb_structures();
SELECT prepareIntermediateStructure();
SELECT importFASTA('/ETL/frontend/SSURef_115.csv'::varchar,'|');
SELECT transformIntermediateData();
SELECT importTaxonomy('/ETL/frontend/taxonomy.csv'::varchar,'|');
SELECT importQuality('/SSUParc_115_quality.csv'::varchar,'18S ribosomal RNA'::varchar);
SELECT importAnnotations('/imp_annotations.csv'::varchar,'|');
SELECT importFeatures('/imp_features.csv'::varchar,'|');
SELECT importQualifiers('/imp_qualifiers.csv'::varchar,'|');
SELECT importReferences('/imp_references.csv'::varchar,'|');
SELECT importAuthors('/imp_authors.csv'::varchar,'|');
SELECT importApplicants('/imp_applicants.csv'::varchar,'|');
SELECT importXrefs('/imp_xrefs.csv'::varchar,'|');
SELECT importKeywords('/imp_keywords.csv'::varchar,'|');
SELECT importStructures('/dotpar_heap.db'::varchar,';','cp-predict v. 16.12.07'::varchar);
SELECT importStructuralFeatures(
        '/toverhangs.csv'::varchar, '/toverhangsPositions.csv'::varchar,
        '/foverhangs.csv'::varchar, '/foverhangsPositions.csv'::varchar,
        '/helices.csv'::varchar,    '/helicesPositions.csv'::varchar,
        '/bulges.csv'::varchar,     '/bulgesPositions.csv'::varchar,
        '/loops.csv'::varchar,      '/loopsPositions.csv'::varchar,
        '/junctions.csv'::varchar,  '/junctionsPositions.csv'::varchar,
        '/hairpins.csv'::varchar,   '/hairpinsPositions.csv'::varchar,
        ';');
SELECT cleanup();
SELECT setTaxonomyAmounts();

Navigation

4. The ETL layer of rPredictorDB¶

4.1. Requirements¶

4.2. Data sources¶

4.2.1. SILVA¶

4.2.2. Rfam¶

4.2.3. ENA¶

4.2.4. Taxonomy (NCBI)¶

4.3. ETL process phases¶

4.4. Structure of an ETL phase¶

4.4.1. Downloading (Extraction)¶

4.4.2. Processing (Transformation)¶

4.4.3. Importing data into rDB (Load)¶

4.5. Complete ETL process schema¶

4.6. Database preparation¶

4.6.1. Database functions¶

4.7. Extracting data¶

4.7.1. ENA¶

4.7.2. Taxonomy¶

4.7.3. SILVA¶

4.7.4. Rfam¶

4.8. Importing sequences¶

4.8.1. Parsing FASTA files¶

4.8.2. Database functions¶

4.9. Importing qualities¶

4.9.1. Rfam¶

4.9.2. Database functions¶

4.10. Importing annotations¶

4.10.1. Downloading and parsing annotations¶

4.10.2. Loading annotations into rDB¶

4.10.3. Database functions¶

4.11. Importing predictions¶

4.11.1. Bulk prediction¶

4.11.2. Loading predictions into rDB¶

4.11.3. Database functions¶

4.11.4. Plot predictions¶

4.12. Finalization¶

4.12.1. Clean up¶

4.12.2. Indices¶

4.12.3. Taxonomy amounts¶

4.13. Sample: How to execute all database functions in correct order¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation