.. _User-glossary:

rPredictorDB glossary
*******************

This is a list of terminology related to rPredictorDB. It aims to explain *how certain terms are used in the context of rPredictorDB*, not generally what these terms mean, because -- sadly -- not all terminology is used consistently across various bioinformatical sites.

.. note:

  We do our best, but the glossary might not be complete. If you think that something is missing here, please let us know!

.. note:

  If you find any explanation here unclear, please do not hesitate and contact us.
  
  
Biology
=======

Nucleotide
----------

A nucleotide is a basic building block of RNA. The basic nucleotides are Adenine, Cytosine, Guanine, Thymine and Uracil (denoted A, C, G, T, U); of those, T occurs only in DNA molecules and gets substituted for U in genome transcription (when the messenger RNA molecule "copies" over the information from DNA), so RNA molecules are composed of A, C, G and U. 


Residue
-------

A residue is a more general term than nucleotide. Nucleic acid residues are nucleotides, protein residues are amino-acids. The scheme used to describe a structure in `PDB <http://www.rcsb.org/pdb>`_, which is the de facto standard, is model --> chain --> **residue** --> atom, from highest to lowest level of description.

Sequence
--------

The string assigning to each position of a molecule a nucleotide. Looks like 'AAUGUUGACCGUGGACAG...'. Sequences are most often represented using the ``FASTA`` format, although many others are also possible.

Primary structure
-----------------

Synonymous to *sequence*.

Secondary structure
-------------------

The description of a nucleic acid molecule on the level of base pairs. For each position in the molecule sequence, the secondary structure of the give molecule says whether the nucleotide at the given position is paired or not. If paired, it also gives the position of the nucleotide in the sequence to which it is paired. This includes pseudoknots, non-canonical base pairs and anything that can be described in terms of base pairs, although some websites (notably the Comparative rRNA Web) refer to these base pairings as "tertiary interactions".

The secondary structure is typically represented either in a *dot-paren* format, or as a list of base pairs (optionally with some additional information). The two most common base pair list formats are called ``*.bpseq`` and ``*.ct``.

.. note:

  There is some confusion in literature with respect to what "secondary structure" means. In the Protein Data Bank and other data sources that work with the 3-D structure of molecules, secondary structure refers to some typical, standard elements of the 3-D structure that have similar names as features recognized in the base-pair-defined secondary structure: helices, hairpins, etc. We use the term secondary structure EXCLUSIVELY for the description of base pairs.
	
.. note:

  The reference secondary structures are obtained by a non-trivial algorithm from the measurements of individual atom positions in the rRNA molecule. Unless other sources are cited for reference structures, they were obtained from PDB 3-D structure measurements by the tool DSSR. 


(Bio)Informatics
================

Guide tree
----------

When building a multiple sequence alignment, the first step in many algorithms is to determine the order in which sequences are aligned to each other. This ordering is encoded by the *guide tree*: a tree graph where edge lengths represent how different the sequences are from each other. The closer the sequences to each other, the sooner they are aligned. The guide tree can be used also as an estimate of sequence similarities.

If you want to see an example, the Clustalw2 multiple sequence alignment algorighm will generate a guide tree in *.dnd* format. 


rPredictorDB-specific
===================

Reference structure
-------------------

A reference structure is secondary structure derived from an experimentally verified 3-D rRNA structure. See the question "How do you get reference structures?" from the :ref:`User-FAQ`.


Region
------

A region of a molecule is a set of adjacent residues. However, adjacency of residues is a non-trivial term: usually, it means residues connected to each other by the sugar-phosphate backbone, but the backbone can sometimes be broken and we still use the term "region", even if it includes the break. In rPredictorDB, for all intents and purposes, adjacency is defined by whichever numbering scheme is chosen - adjacent are residues that have numbers X, X+1.

Regions include their bounds: a region 5-7 will include residues 5, 6 and 7.


Structural feature
------------------

A structural feature is defined as a subset of the structure that fulfills some constraints on the base pairs it contains (and typically is a maximal such subset). For instance, an *internal loop* is a set of two intervals in the structure such that the 5'-end of one is paired with the 3'-end of the other and vice versa and no other residues are paired. The structural features are defined to represent some common elements of secondary structures, such as helices, hairpin loops, etc. For definitions and examples of structural features, see :ref:`User-structural-features`.


.. _User-glossary-dot-paren-file-string:

Dot-paren file string
---------------------

A dot-paren file string is the content of a dot-paren file. The dot-paren file format is a way of storing secondary structure information. It looks like this::

  >FASTA header of a sequence
  AAACGCUAGCAGGAGUGCUUUGCACCGGAGAUCUCUGGAUAAGCACGGCGCGCAUCUCAGGAC
  ...(.((...))).((((((....((((((..))))))..))))))(.(..)).(((..))).
  
The first line is a FASTA header, the second line is the sequence and the third
line is a dot-paren representation of the secondary structure of that sequence.