.. _Technical-cp-predict:

CP-predict: setup & customization
**********************************

This is a guide to setting up, customizing and running CP-predict. For an overview of what it is and what it does, read :ref:`User-cp-predict`.

.. _Technical-cp-predict-requirements:

Requirements
============

In order to be able to use CP-predict, you must have the following installed:

* Matlab runtime

* the `Vienna RNA package <http://tbi.univie.ac.at/RNA>`_
 
* ``clustalw2``, which can be found somewhat unorthodoxly at the `Help page of its website <http://www.ebi.ac.uk/Tools/msa/clustalw2/help/>`_ 

* Optionally, if you want to set up your own templates from structures in the PDB, you will need `x3dna-dssr <http://x3dna.org>`_, Python 2.7 and the latest biopython from the `development branch on GitHub <http://github.com/biopython/biopython>`_ (Or, if not the latest, at least one with the ``Bio.Phylo.TreeConstruction`` module.)

* If you wish to compile the API documentation, you will need `Sphinx <http://sphinx-doc.org>`__.

.. _Technical-cp-predict-setup:

Setup
=====

If you want to use CP-predict with the default infrastructure, simply run::

   rPredictorDB_web.install
   
This will install the prediction script and optionally also Matlab runtime necessary for CP-predict to run. 

If you wish to customize your installation and/or to understand what CP-predict is doing behind the scenes, read on.


Converting from PDB to secondary structure
------------------------------------------

The conversion from the 3D-structure measurements recorded in PDB files to secondary structure is a non-trivial task. It is handled by the ``pdb2dp.py`` script. Most of the "heavy lifting" is done by ``x3dna-dssr``, which is the program that determines which residues form base pairs based on the positions of their atoms. 

However, there are additional concerns which DSSR does not address. Some residues are not measured in the PDB files. While unmeasured residues would only impede template selection for rather closely related templates, they might distort conservation statistics. More generally, if the sequence is known, we are discarding information by not taking it into account.

Re-inserting unmeasured residues into the structure predicted by DSSR is implemented in ``pdb2dp.py`` - the script is essentially a wrapper for DSSR that additionally performs this re-insertion. It uses the ``*.ct`` output DSSR provides and scans the input PDB file for REMARK 465 and REMARK 470 records which mark the residues that were not measured at all or from which some atoms are missing.

The other critical functionality that ``pdb2dp.py`` provides during the conversion process is *untangling*: the measurements of 3D structures were not done on individual molecules but on the entire ribosome. In this setting, some residues form base pairs to residues from another rRNA molecule in the ribosomal subunit. These base pairs need to be filtered out for dealing with individual molecules.  

The ``pdb2dp.py`` script is the primary way of acquiring template structures. The ``setup_cp_predict.py`` script uses it internally. To run ``pdb2dp.py`` separately, use::

  pdb2dp.py -r $CP_ROOT --standard_with_dssr ABCD
  
to extract secondary structures and unmeasured and tangled region information.

.. note::

  Running ``x3dna-dssr`` may take up to several minutes.
  
.. note::

  ``pdb2dp.py`` is also very useful when you want to obtain reference structures for evaluating CP-predict performance.


.. _Technical-cp-predict-running:

Running CP-predict
===================

Running Cp-predict yourself is not necessary for rPredictorDB operation. However, it may be useful to test that everything went correctly. Run::

  ./cp_predict.sh /usr/local/MATLAB/MATLAB_Runtime/v901/ -sqs=test.fasta -str=template.br
                  -ALM=clustalw2 -EXTEND_MECHANICALLY_LONELY_PAIRS=0 -BOOTSTRAP=1
  
for a test run that will go through all nooks and crannies of the prediction algorithm. ``/usr/local/MATLAB/MATLAB_Runtime/v901/`` is a location of Matlab runtime. ``test.fasta`` is a FASTA file with a RNA sequence that should be predicted, ``template.br`` is a template sequence in the dot-parent format.