.. _Technical-rData: The Data of rPredictorDB ************************* This section documents the main part of rPredictorDB's backend that stores the data - the rDB database. General information ====================== The rDB is architected as relational (or object-relational) database. It is implemented in Postgresql 9.5. However, the principles are the same for most of enterprise RDBMS (ie. Oracle or SQL Server) so it could be maintained using some other system as well. Of course it has to be compatible with other parts of the rPredictorDB system. Database schema ==================== The database can be divided into 4 logical parts. This does not affect the database schema on a syntactic level, but it is important for understanding its semantics. The parts are: ======================== = rDB logical parts ======================== = Primary structures Secondary structures Annotations Miscellaneous ======================== = All database tables belong to the schema called *rpredict*. Complete database diagram: .. figure:: database_schema.svg :figwidth: 650 :width: 650 :align: center Naming conventions ---------------------- * Names of entities are in plural. * Strong entities have self-describing names. * Weak entities representing many-to-many relations have dash separated two words names. Both of words are names of tables in the relation. Primary structures -------------------- Tables overview ^^^^^^^^^^^^^^^^^^^^^^^ ======================== = Table name ======================== = rpredict.Accessions rpredict.Qualities rpredict.Sequences rpredict.Taxonomy ======================== = Tables details ^^^^^^^^^^^^^^^^^^ rpredict.Accessions #################### Contains sequence identifiers. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier accession VARACHAR(255) Biological identifier start INTEGER Sequence region start stop INTEGER Sequence region stop state INTEGER Allows to classify items dataset VARCHAR(50) Source dataset insertion TIMESTAMP Timestamp when sequence has been imported fk_taxonomy INTEGER Reference to taxonomy fk_annotation INTEGER Reference to sequence's annotation ======================== ============== =============================== rpredict.Qualities #################### Contains information about sequence quality. ======================== ============== ====================================== Column name Type Note ======================== ============== ====================================== id INTEGER Internal identifier fk_accession INTEGER Reference to accession annotation_source VARCHAR(255) The source(s) where ENA took this data from sequence_quality NUMERIC(20,7) SILVA quality indicator (combination of ambig./homopol./vec.contam.) ambiguities NUMERIC(20,7) % of ambiguously sequenced nucleotides homopolymers NUMERIC(20,7) % of homopolymers longer than 4 nt. vector_contamination NUMERIC(20,7) % of sequence contaminated by vector alignment_quality NUMERIC(20,7) Conservancy indicator base_pair_score INTEGER SILVA-specific: possible base pairs in common with SILVA ref. structure aligned_bases INTEGER Number of seq. bases aligned to within rRNA gene. pintail_quality INTEGER Confidence that sequence is not anomalous state INTEGER Allows to classify items insertion TIMESTAMP Timestamp when quality has been imported ======================== ============== ====================================== rpredict.Sequences #################### Contains strings of rRNA nucleotide sequences. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_accession INTEGER Reference to accession sequence TEXT String of nucleotide sequence ======================== ============== =============================== rpredict.Taxonomy #################### Contains taxonomy tree. It is implemented as a tree using self-reference to parent node. It is possible there is no one root only. More formally said, it is a forest. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier level INTEGER Number of edges from a root fk_parent INTEGER Self-reference to parent node name VARCHAR(255) Name of current node path_name VARCHAR(4000) Contains names of all of node's ancestors recursively from the root. By default separated by a semicolon. is_specie BOOLEAN Flag if the node is a leaf subtreesize INTEGER Number of leafs in the subtree ======================== ============== =============================== Annotations ---------------- Tables overview ^^^^^^^^^^^^^^^^^ ================================ = Table name ================================ = rpredict.Annotations rpredict.Annotation_keywords rpredict.Applicants rpredict.Applicants_references rpredict.Authors rpredict.Authors_references rpredict.Features rpredict.Keywords rpredict.Qualifiers rpredict.References rpredict.Xrefs ================================ = Tables details ^^^^^^^^^^^^^^^^^^^^^ rpredict.Annotations #################### Contains information about sequence annotation. ======================== ============== ================================== Column name Type Note ======================== ============== ================================== id INTEGER Internal identifier molecule_type VARCHAR(4000) Sequence source molecule type (rRNA, mRNA, DNA, etc.) topology VARCHAR(4000) Whether the source is linear, or circular RNA data_class VARCHAR(4000) ENA data class annotation version INTEGER Version of the record in ENA database first_public TIMESTAMP Date when it was first published last_updated TIMESTAMP Date of the last update last_updated_release VARCHAR(255) Identifier of the last release within this record has been updated description TEXT Often contains 16S/23S/... info comment TEXT Author's comment taxonomic_division VARCHAR(4000) ENA taxonomic division annotation ======================== ============== ================================== rpredict.Annotation_keywords ############################# Weak entity connecting annotations and keywords. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== fk_annotation INTEGER Reference to annotation fk_keyword INTEGER Reference to keyword ======================== ============== =============================== rpredict.Applicants #################### Contains list of patent applicants mentioned in references. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier name VARCHAR(4000) Name of the applicant ======================== ============== =============================== rpredict.Applicants_references #################################### Weak entity connection applicants and references. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== fk_applicant INTEGER Reference to applicant fk_reference INTEGER Reference to annotation reference ======================== ============== =============================== rpredict.Authors #################### Contains list of authors mentioned in annotation references. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier name VARCHAR(4000) Name of the author ======================== ============== =============================== rpredict.Authors_references #################################### Weak entity connection authors and references. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== fk_author INTEGER Reference to author fk_reference INTEGER Reference to annotation reference ======================== ============== =============================== rpredict.Features #################### Unions some properties of a region of a nucleotide sequence (part of a gene). ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier name VARCHAR(4000) location VARCHAR(255) The sequence region fk_annotation INTEGER Reference to annotation ======================== ============== =============================== rpredict.Keywords #################### Contains keywords mentioned in annotation references. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier keyword VARCHAR(255) Keyword ======================== ============== =============================== rpredict.Qualifiers #################### Contains key-value pairs with sequence's feature's information. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier key VARCHAR(255) Key value TEXT Value fk_feature INTEGER Reference to sequence feature ======================== ============== =============================== rpredict.References #################### Contains information about references (ie. an article, paper,...). ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier title VARCHAR(4000) consortium VARCHAR(4000) submission_date TIMESTAMP journal VARCHAR(4000) year INTEGER volume VARCHAR(4000) issue VARCHAR(4000) first_page VARCHAR(255) last_page VARCHAR(255) comment TEXT reference_location TEXT type VARCHAR(255) Contains: patent, thesis, unpublished, article, submission, book number INTEGER location VARCHAR(4000) fk_annotation INTEGER Reference to annotation ======================== ============== =============================== rpredict.Xrefs #################### Contains data about references (links) to external databases. Reference can be stored for annotaion (reference to a DB) or annotation reference (journal, webpage) as well. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier db VARCHAR(255) Database name db_id VARCHAR(255) Id in destination database secondary_id VARCHAR(255) Secondary id in destination DB fk_annotation INTEGER Reference to annotation fk_reference INTEGER Reference to annotation reference ======================== ============== =============================== Secondary structures -------------------------- Tables overview ^^^^^^^^^^^^^^^^^^^ ================================ = Table name ================================ = rpredict.Predictions rpredict.Structures rpredict.Toverhangs rpredict.Toverhangs_positions rpredict.Foverhangs rpredict.Foverhangs_positions rpredict.Helices rpredict.Helices_positions rpredict.Bulges rpredict.Bulges_positions rpredict.Loops rpredict.Loops_positions rpredict.Junctions rpredict.Junctions_positions rpredict.Hairpins rpredict.Hairpins_positions ================================ = Tables details ^^^^^^^^^^^^^^^^^ rpredict.Predictions #################### Contains information about secondary structure prediction. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_accession INTEGER Reference to accessions algorithm VARCHAR(255) Algorithm used for prediction template VARCHAR(255) Template used for prediction similarity NUMERIC(20,7) Sequence similarity to the template z_score NUMERIC(20,7) Quality of RNAdistance score with respect to a random sequence ======================== ============== =============================== rpredict.Structures #################### Contains rRNA secondary structures as strings in dot-paren format. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions structure TEXT Secondary structure in dot-paren string ======================== ============== =============================== rpredict.Foverhangs #################### Contains instances of 5-overhangs. .. note:: All of following tables are containing info about structural features. Informally said, structural feature is a substring of structure. These tables have the same "interface". No inheritance is maintained so tables are defined in very similar way. The pattern is two tables per structural feature. The first table's name likes structural feature's name and represents one instance of the feature in given structure. The other table, with the suffix _positions constains information where in structure this instance of structural feature appears.* ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Foverhangs_positions ################################ Contains positions of 5-overhangs. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_foverhang INTEGER Reference to foverhangs start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== rpredict.Toverhangs #################### Contains instances of 3-overhangs. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Toverhangs_positions ################################ Contains positions of 3-overhangs. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_toverhang INTEGER Reference to toverhangs start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== rpredict.Helices #################### Contains instances of helices. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Helices_positions ################################ Contains positions of helices. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_helice INTEGER Reference to helices start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== rpredict.Bulges #################### Contains instances of bulges. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Bulges_positions ################################ Contains positions of bulges. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_bulge INTEGER Reference to bulges start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== rpredict.Loops #################### Contains instances of loops. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Loops_positions ################################ Contains positions of loops. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_loop INTEGER Reference to loops start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== rpredict.Junctions #################### Contains instances of junctions. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Junctions_positions ################################ Contains positions of junctions. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_junction INTEGER Reference to junctions start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== rpredict.Hairpins #################### Contains instances of hairpins. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_prediction INTEGER Reference to predictions ======================== ============== =============================== rpredict.Hairpins_positions ################################ Contains positions of hairpins. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier fk_hairpin INTEGER Reference to hairpins start INTEGER Start position stop INTEGER Stop position ======================== ============== =============================== Miscellaneous ---------------- This part is somewhat removed from the bionformatical bag of tricks. At this time it contains a single table used as storage for rPredictorDB's RSS channel. Tables overview ^^^^^^^^^^^^^^^^^ ================================ = Table name ================================ = rpredict.rss ================================ = Tables details ^^^^^^^^^^^^^^^^^ rpredict.rss #################### This table is an underlying storage for rPredictorDB's RSS channel. ======================== ============== =============================== Column name Type Note ======================== ============== =============================== id INTEGER Internal identifier title VARCHAR(50) Title of the new description VARCHAR(255) The new's desription link VARCHAR(100) Hyperlink to the new published TIMESTAMP When the new has been published ======================== ============== =============================== Functions ================== There are some database functions prepared. All of them together with usage are described in the ETL process documentation. List of the database functions: * create_rdb_structures * prepareIntermediateStructure * importFASTA * transformIntermediateData * importTaxonomy * importQuality * importAnnotations * importFeatures * importQualifiers * importReferences * importAuthors * importApplicants * importXrefs * importKeywords * importStructures * importStructuralFeature * importStructuralFeatures * setTaxonomyAmounts * cleanup Views =================== Currently, there is one view - ``rpredictor.vw_acc_ann_tax``. This view is simple left join of table ``rpredictor.accessions`` with tables ``rpredictor.annotations`` and ``rpredictor.taxonomy``. This results in shorter application code when performing search. The subversion repository ========================================= The database branch in rPredictorDB's repository contains folders named by the purpose of scripts it each contains. The name's prefix is a number - the order according to the number is also one of correct order of execution it's contents. ======================= ============================= folder description ======================= ============================= 01_FCEPROC database functions 02_INDICES some index definitions 03_VIEWS definiton of views 09_DEPLOY script with correct execution order ======================= =============================