This document specifies the export formats for rPredictorDB records.
Note
See Exporting for a walkthrough of the exporting functionality.
There are several supported formats for export:
Warning
FASTA/dot-paren export currently does not provide access to the entire record from the rPredictorDB dataset.
An example is worth a thousand words. The following is the JSON-format export that lists all entities exported from rPredictorDB. All other formats are simply a different syntax for the same set of key-value pairs, in the same ordering.
For the sake of (relative) brevity, we leave out some repetitive parts (most of the sequence and structure strings and the full list of exported Structural features, which take up the overwhelming majority of the exported file. The full description of the structural features will be in the next example.)
{ "sequence":"CCCGC...CACCGCACG",
"blastMatchingSubsequence":null,
"blastMatchingScore":null,
"blatMatchingSubsequence":null,
"blatMatchingScore":null,
"accession":"GM049488",
"startPosition":"1455",
"stopPosition":"4900",
"regionLength":"3446",
"silvaDataset":"LSU",
"sequenceQuality":"68.08",
"ambiguities":"0",
"homopolymers":"1.92",
"vectorContamination":"0",
"alignmentQuality":"35.79",
"basePairScore":"89",
"alignedBases":"3446",
"pintailQuality":"100",
"description":"Sequence 1324 from Patent WO2007026255.",
"moleculeType":"unassigned DNA",
"dataClass":"PAT",
"taxonomicDivision":"HUM",
"version":"1",
"firstPublic":"14. 12. 2008",
"lastUpdated":"14. 12. 2008",
"lastUpdatedRelease":"98",
"comment":null,
"name":"Homo sapiens (human)",
"pathName":"Eukaryota; Metazoa; (... taxonomy ...) Hominidae; Homo; ; Homo sapiens (human)",
"annotationSource":"RNAmmer; ",
"state":"1",
"references":
[
{ "id":"561199",
"title":"Dedifferentiated cells and methods of making and using dedifferentiated cells",
"consortium":null,
"submission_date":null,
"journal":null,
"year":null,
"volume":null,
"issue":null,
"first_page":null,
"last_page":null,
"comment":null,
"reference_location":"Patent number WO2007026255-A2/1324, 08-MAR-2007.",
"type":"patent",
"number":"1",
"location":null,
"authors":
[ "Freberg T.C.",
"Collas P."
],
"applicants":
[ "Universitetet I Oslo (NO)"
]
}
],
"features":
[
{ "id":"268023",
"name":"source",
"location":"1..5025",
"qualifiers":
{ "organism":"Homo sapiens"
}
}
],
"xrefs":
[
{ "id":"1214295",
"db":"SILVA-LSU",
"db_id":"GM049488",
"secondary_id":null,
"fk_annotation":"268023",
"fk_reference":null
}
],
"predictions":
[
{ "structure":"((((((((.( ... )).)).))))))). ",
"algorithm":"rf-predict",
"structure_id":"7796",
"bulges":
[
[0,1,7962846],
[1072,1074,7962846],
[17,19,7962847],
[1054,1055,7962847],
(... more bulges ...)
],
"foverhangs":
[
[0,0,339357]
],
"hairpins":
[
(...a list of hairpins...)
],
"helices": [ (...a list of helices...) ]
"junctions": [ (...a list of junctions...) ],
"loops": [ (...a list of internal loops...) ]
"toverhangs": [ (...the 3'-overhang of the structure...) ]
}
]}
There can be any number of entities in the "references", "features" and "xrefs" lists. (The structural features list is constant, but the lists wihtin lists are not - see Structural feature entities.)
There can be any amount of "references" and "xrefs" entities. The entities themselves will, however, follow the given format (or at least should, unless there was an error in the rPredictorDB source database - ENA - in which case getting some error in processing the export probably should happen): there should always be an "author", "title", "year" entry in each member of the "references" list. (Many, of course, may be null.)
The "features" entities, on the other hand, cannot be trusted to contain a specific list of entities. They are provided by the authors of the original database records (ENA).
The structural feature entities are the list members of the predictions section: "foverhangs", "hairpins", "helices", "junctions", "loops", "toverhangs". Each structural feature denotes a certain set of regions of the secondary structure. Each triplet [start,stop,ID] represents one region that makes up a structural feature; a structural feature is represented as a list of these regions.
For a thorough description of what structural features are, see: Structural features.
In the CSV file format export, there are comma-separated particular fields of the record. If some field contains subfields (it is e.g. in the case of Features), they are separated by vertical bars (the number of vertical bars denotes the depth of the recursion).
CUGGUUGAUC...GCGGAAGGAU,,AJ391735,1,2100,2100,"18S ribosomal RNA",97.62,0,0.14,0,94,105,2097,20,"Lutzomyia toroensis 18S rRNA gene",...
A description of the JSON format on Wikipedia
The rPredictorDB JSON export looks like:
{"sequence":"AGTCTGATGTGAAAGCCTTCGGCTCAACCGAAACTGGGAAACTTGA",
"accession":"M58827",
"startPosition":"622", ... }
The complete list of exported fields mirrors the complete list of rPredictorDB record fields. This includes structural features and cross-references: the total number of exported fields varies between molecules.
In the dot-paren file format export: the first line is a FASTA header that contains the accession number and the name of the organism, the second line is the sequence and the predicted structure is on the third line.
>M58827.Lactobacillus plantarum
UAAUUUGAGAGUUUGAUCCUGGCUCAGGACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAACGAACUC...
........(((((..(((((......((((.....((((.((....(((((((((.((.(..(.(.((((.((...
Warning
This is currently the only one format that does not export the entire rPredictorDB record.