Protein Tertiary Structures Zaman_Molecules20

Citation Author(s):
Ahmed Bin
Zaman
George Mason University
Parastoo
Kamranfar
George Mason University
Carlotta
Domeniconi
George Mason University
Amarda
Shehu
George Mason University
Submitted by:
Ahmed Bin Zaman
Last updated:
Mon, 04/20/2020 - 09:40
DOI:
10.21227/gq2v-8k24
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

 

 

Controlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs. This dataset is associated with our paper, "Reducing Ensembles of Protein Tertiary Structures Generated De Novo via Clustering", where we propose a novel clustering-based approach which we demonstrate to significantly reduce an ensemble of generated structures without sacrificing quality. This dataset provides the biologically-active structures of the protein targets used for evaluation, necessary data (sequence, fragment files) for generating structure ensembles, generated ensembles, reduced ensembles, and data necessary for generating and plotting the results in the paper. The paper is under review and we will update the link to the paper once it is published. The codes associated with this dataset can be found in, https://github.com/psp-codes/reduced-decoy-ensemble

Instructions: 

The .zip file contains 6 folders when unzipped. We provide the details of each folder below.

 

“Proteins” folder: Contains 20 protein targets organized into two folders (Benchmark and CASP) depending on the family each target belongs to. Data for each protein is provided in a subfolder named with its id. Each such subfolder contains the following 4 files.

  1. A .fasta file containing the amino-acid sequence of the protein.

  2. A .pdb file containing the native tertiary structure coordinates. Detailed format for a .pdb file can be found in http://www.wwpdb.org/documentation/file-format

  3. A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/

  4. A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

 

“Generation” folder: Contains the generated ensembles for the protein targets in 20 subfolders, one for each target, named with their ids. Each subfolder contains 5 files, each containing the generated ensemble for one run. Each such file contains 14 columns and each row represents one generated structure. The first column provides the Rosetta score4 energy, the second column provides the lRMSD to the native structure, and each of the rest of the 12 columns provides one USR feature for the structure.

 

“Reduced” folder: Contains the reduced ensembles for each clustering technique in separate folders. Each such folder contains 20 subfolders, one for each target, named with their ids. Each such subfolder contains 5 files, each containing the reduced ensemble for one run. Each such file contains 2 columns and each row represents one structure in the reduced ensemble. The first column provides the Rosetta score4 energy and the second column provides the lRMSD to the native structure.

 

“Truncation” folder: Contains the reduced ensembles via truncation for the protein targets in 20 subfolders, one for each target, named with their ids. Each such subfolder contains 5 files, each containing the reduced ensemble for one run. Each such file contains 2 columns and each row represents one structure in the reduced ensemble. The first column provides the Rosetta score4 energy and the second column provides the lRMSD to the native structure.

 

“Ks” folder: Contains 4 separate files, one for each clustering technique, containing the number of clusters for each run of each protein target. These files can be used to plot the distributions for the number of clusters.

 

“Bars” folder: Contains 3 separate subfolders containing the information needed to plot the bar charts for the minimum, average, and standard deviation of lRMSDs to the native structure for the CASP targets. Each subfolder contains 10 files, one for each target. Each file contains 6 rows that provide the lRMSD value for original ensemble, reduced ensemble for hierarchical clustering, reduced ensemble for k-means clustering, reduced ensemble for GMM clustering, reduced ensemble for gmx-cluster clustering, and reduced ensemble for truncation, respectively.

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in  users. Don't have a login?  Create a free IEEE account.  IEEE Membership is not required.