This chapter should give you a brief idea of what kind of data is available in rPredictorDB. It is a short, high-level overview only; for the detailed description of what rPredictorDB’s dataset offers, see rPredictorDB record detail.
To talk about the rPredictorDB data, we first need to clear up some terms.
Here, we will talk only about the contents of the rPredictorDB dataset, leaving the technical aspects aside. The rDB database is described in the technical part of the documentation: The Data of rPredictorDB, the dataset representations used by other tools are described in the individual tools’ setup instruction.
Note
A complete overview of the individual fields available from the database are described in rPredictorDB record detail .
For an overview of available export formats, see Exporting results.
The dataset generally contains information of the following types:
There are four external sources of data that are combined in the dataset, together with secondary structure data predicted in-house during ETL (Extraction-Transformation-Load, the process that assembles the dataset into rDB; a detailed description of the process is in section The ETL layer of rPredictorDB). The external sources are SILVA, Rfam, ENA (European Nucleotide Archive) and the Taxonomy (NCBI).
The SILVA database provides the core of rDB for ribosomal RNA - primary structures (nucleotide sequences), their unique identification using accession numbers and sequence quality measures; In this area, SILVA is well-curated and has a comprehensive quality control system. It also provides a taxonomic information for the sequences, but its quality is insufficient for our needs.
The current publication for SILVA is the 2013 article The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. This article describes in detail the meaning of various quality indicators.
The Rfam provides primary structures for other than ribosomal RNA together with their accession numbers. The Rfam does not provides quality measures as ENA, however it provides so-called ‘seeds’ - currated subsets of representative sequences - for each family and a consensus sequence. Thus, we use sequence similarity as a quality measure.
The current publication for Rfam is the 2014 article Rfam 12.0: updates to the RNA families database.
The ENA provides for rPredictorDB a wealth of additional annotation about the sequence: things like references to scientific literature, classification by source molecule type, method of obtaining the sequence, etc. The structure of ENA records is much more complicated than in SILVA (which is relatively flat); ENA itself integrates data from various sources. (For the purposes of rData, we use the ENA REST API to only retrieve records of interest.)
As its name suggests, Taxonomy database provides taxonomic classification for all sequences in our dataset as the information in primary databases are often discontinuous and inconsistent.
The fifth source from which the dataset is built is an in-house secondary structure prediction method. For the current release of rPredictorDB, we use the second version of the custom rRNA secondary structure prediction algorithm to create the predictions. (See: CP-predict: a two-phase algorithm for rRNA structure prediction)
In addition to the predicted structure, a list of structural features is computed for each predicted structure. Structural feautres describe certain basic “building blocks”, secondary structure motifs of several nucleotides each. (They are described in detail in the section Structural features.)