AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

文章列表

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://blog.csdn.net/caroline_wendy/article/details/130089781

整体：
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

RCSB PDB 数据集是一个收集了蛋白质的三维结构信息的数据库，是世界蛋白质数据库（wwPDB）的成员之一，也是生物学和医学领域第一个开放访问的数字数据资源库。RCSB PDB 数据集不仅提供了来自蛋白质数据银行（PDB）档案的实验确定的3D结构，还提供了来自 AlphaFold DB 和 ModelArchive 的计算结构模型（CSM）。用户可以利用 RCSB PDB 数据集提供的各种工具和资源，根据序列、结构和功能的注释进行简单和高级搜索，可视化、下载和分析这些分子，并且在外部注释的背景下，探索生物学的结构视角。

关于 Complex 和 Multimer 的差别：

组合在一起形成功能性基团的复合物，通常会叫Complex，比如Antibody-antigen Complex 或 Ligand-receptor Complex；Multimer通常指堆在一起的非单体情况，不一定有真正的结合或可以发挥功能，只是结构上在一起，比如aggregation发生时，通常会提到Monomer/Multimer。总体，Multimer在生物学上用的不多。以上我的理解和习惯，也可能不同的文章中会有人混用，尤其是非母语文章，也不能算是错误。

RCSB：Research Collaboratory for Structural Bioinformatics，即结构生物信息学的研究合作实验室。

官网：https://www.pdbus.org/
目前，已经有 202,467 (2023.3.21) 个PDB结构。
Vision：To expand the frontiers of fundamental biology, biomedicine, energy sciences, and biotechnology through open and sustainable access to the 3D structure, function, and evolution of biological macromolecules contained in the Protein Data Bank (PDB) archive.
愿景：扩展基础生物学、生物医学、能源科学和生物技术的前沿方向，通过开放的和可持续的访问 PDB 档案中所包括的生物大分子的 3D 结构、功能和进化。

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

1. RCSB PDB

PDB全量数据：最后更新日期 2022.4.13

全量链数：593491，约59万
全量PDB数：183653个PDB结构，实际包括182703个，略有差异，相差950个，约18万

标签如下：

id：行号，从0开始
pdb：PDB编号，例如 3eo1
resolution：分辨率，例如 3.1
release_date：发布日期，例如 2008-12-02
seq：序列，例如 ETVLTQSPGT…
len：序列长度，例如 215
chain_type：链的类型，例如 k是kappa型轻链，l是lamda
bcr_or_tcr：BCR或TCR或none

第1个示例，PDB 3eo1，包括0-11，一共12行，即12个链。具体数据如下：

bid,pdb,chain,resolution,release_date,seq,len,chain_type,bcr_or_tcr
0,3eo1,A,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
1,3eo1,B,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
2,3eo1,C,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none
3,3eo1,D,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
4,3eo1,E,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
5,3eo1,F,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none
6,3eo1,G,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
7,3eo1,H,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
8,3eo1,I,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none
9,3eo1,J,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
10,3eo1,K,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
11,3eo1,L,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none

PDB提取的FASTA文件与标签文件一致，例如3EO1 PDB：
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

2. Resolution

Resolution Range: 0.48 ~ 70.0

Resolution (chain-level)：

bins: [15729,1119,150876,272060,111062,21324,4037,4357,3500,2636,6791], sum: 593491
Empty: 15729, 2.65%
High(0~3): 1119+150876+272060 = 424055, 71.45%
Middle(3~5): 111062+21324 = 132386, 22.31%
Low(>5): 4037+4357+3500+2636+6791 = 21321, 3.59%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)
Resolution (PDB-level)：

bins: [12347,810,70325,79230,15929,2179,346,375,307,197,658], sum: 182703
Empty: 12347, 6.76%
High(0~3): 810+70325+79230 = 150365, 82.30%
Middle(3~5): 15929+2179 = 18108, 9.91%
Low(>5): 346+375+307+197+658 = 1883, 1.03%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

3. Seq. Len.

链长分布：0 ~ 4433。

异常数据，链长是0，包括38447个，数据如下：

id   pdb chain  resolution release_date  seq  len chain_type bcr_or_tcr
19  6dts     C         1.5   2018-09-19  NaN    0    protein       none
20  6dts     D         1.5   2018-09-19  NaN    0    protein       none
69  6v8x     M         3.0   2020-02-05  NaN    0    protein       none
70  6v8x     N         3.0   2020-02-05  NaN    0    protein       none
71  6v8x     O         3.0   2020-02-05  NaN    0    protein       none

数据分布：

标签 0: 81019, 100: 140791, 200: 132580, 300: 75717, 400: 42730, 500: 19507, 600: 8027, 700: 6335, 800: 2849, 900: 2017, 1000: 6730
len >= 20, 518302, 87.33%；len < 20, 75189, 12.66%
Short(20~100): 81019, 15.63%
Normal(100~300): 140791+132580 = 273371, 52.75%
Long(300~500): 75717+42730 = 118447, 22.85%
Very Long(>500): 19507+8027+6335+2849+2017+6730 = 45465, 8.77%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

蛋白质的链长大于20

蛋白质至少包含一个长多肽。短多肽，含有少于20-30个残基，很少被认为是蛋白质，通常被称为肽。

4. Antibody

chain_type: [‘k’ ‘h’ ‘protein’ ‘l’ ‘a’ ‘b’ ‘d’ ‘g’]

其中，k和l是轻链，protein是抗原或其他蛋白质
a\\b\\d\\g是TCR的链
a: 721, b: 824, d: 54, g: 28, h: 10762, k: 6831, l: 2143, protein: 572128
Percentage of DB: 21363/593491 = 3.60%
BCR (19736)；TCR (1627)

lypz 实例如下：

g：F和H是 T cell Receptor Gamma Chain，T细胞受体 $γ\\gamma$ 链
d：E和G是 T cell Receptor Delta，T细胞受体 $δ\\delta$ 链

标签：

22355       22355  1ypz     A         3.4   2005-04-12  GSHSLRYFYTAVSRPGLGEPWFIIVGYVDDMQVLRFSSKEETPRMA...  260    protein       none
22356       22356  1ypz     B         3.4   2005-04-12  ADPIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNG...  102    protein       none
22357       22357  1ypz     C         3.4   2005-04-12  GSHSLRYFYTAVSRPGLGEPWFIIVGYVDDMQVLRFSSKEETPRMA...  260    protein       none
22358       22358  1ypz     D         3.4   2005-04-12  ADPIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNG...  102    protein       none
22359       22359  1ypz     E         3.4   2005-04-12  GDQVEQSPSALSLHEGTDSALRCNFTTTMRSVQWFRQNSRGSLISL...  207          d        TCR
22360       22360  1ypz     F         3.4   2005-04-12  HGKLEQPEISISRPRDETAQISCKVFIESFRSVTIHWYRQKPNQGL...  230          g        TCR
22361       22361  1ypz     G         3.4   2005-04-12  GDQVEQSPSALSLHEGTDSALRCNFTTTMRSVQWFRQNSRGSLISL...  207          d        TCR
22362       22362  1ypz     H         3.4   2005-04-12  HGKLEQPEISISRPRDETAQISCKVFIESFRSVTIHWYRQKPNQGL...  230          g        TCR
22363       22363  1ypz     I         3.4   2005-04-12                                                NaN    0    protein       none
22364       22364  1ypz     J         3.4   2005-04-12                                                NaN    0    protein       none
22365       22365  1ypz     K         3.4   2005-04-12                                                NaN    0    protein       none
22366       22366  1ypz     L         3.4   2005-04-12                                                NaN    0    protein       none
22367       22367  1ypz     M         3.4   2005-04-12                                                NaN    0    protein       none

LYPZ PDB结构：
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

Chain Type 数据分布：
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)
BCR or TCR：

bcr or tcr type: [‘none’ ‘BCR’ ‘TCR’]
BCR: 3308, TCR: 186, none: 179209
Percentage of DB: 3494/182703 = 1.91%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

5. Complex / Multimer

Chain 清洗前593491，清洗后357216；PDB 清洗前182703，清洗后140320。清洗方法：

df = df.loc[df['len'] >= 20]
df = df.loc[df['len'] <= 500]
df = df.loc[df["resolution"].fillna(-1).astype(int) > 0]
df = df.loc[df["resolution"] <= 3]

具体分析：

complex chain range: 1 ~ 55
clean pdb (357216)：20 <= seq len <=500；resolution <= 3
链长范围：1: 57033, 2: 46973, 3: 6594, 4: 17094, 5: 1141, 6: 4703, 7: 301, 8: 2801, 9: 224, 10: 3456, sum: 140320
Monomer: 57033, 40.64%
Multimer(2~4): 46973+6594+17094 = 70661, 50.35%
Multimer(>=5): 1141+4703+301+2801+224+3456 = 12626, 9.00%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

在全部的复合物 (83287) 中，包括同源多聚体和异源多聚体：

Homo Multimer: 21721, 26.08%
Hetero Multimer: 83287, 73.92%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

6. 参考

Stack Overfolw - Convert floats to ints in Pandas?
Stack Overfolw - histogram: setting y-axis label for pandas
Stack Overfolw - Matplotlib histogram with multiple legend entries
PDB - Resolution
Pandas: How to Combine Rows with Same Column Values
Stack Overflow - Selecting multiple columns in a Pandas dataframe
Stack Overflow - How to center labels in histogram plot
Control the color of barplots built with matplotlib
Display percentage above bar chart in Matplotlib
Stack Overflow - Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
Stack Overflow - How to get unique values from multiple columns in a pandas groupby
Pandas Groupby – Count of rows in each group

7. 源码

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/4/10
"""import os
import sys
from time import timeimport numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.patches import Rectanglep = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if p not in sys.path:sys.path.append(p)from myutils.project_utils import traverse_dir_files, write_list_to_file, mkdir_if_not_exist
from root_dir import DATA_DIRclass RcsbProcessor(object):"""RCSB数据集分析"""def __init__(self):self.rcsb_dir = os.path.join(DATA_DIR, "rcsb")mkdir_if_not_exist(self.rcsb_dir)# 输入self.rcsb_full_dir = "[PDB文件夹]"self.profiling_protein_path = os.path.join(self.rcsb_dir, "profiling_protein_593491.csv")# 输出rcsb_full_prefix = "rcsb_pdb_all"self.rcsb_all_pdb_format = os.path.join(self.rcsb_dir, f"{rcsb_full_prefix}" + "_{}.txt")# 读取PDBpaths_list = traverse_dir_files(self.rcsb_dir)is_traverse = Falsefor path in paths_list:base_name = os.path.basename(path)if rcsb_full_prefix in base_name:is_traverse = Trueif not is_traverse:self.init_full_paths()  # 初始化全部路径else:print("[Info] 已经初始化完成PDB全部路径!")def init_full_paths(self):print(f"[Info] 初始化路径开始!")s_time = time()print(f"[Info] 数据集路径: {self.rcsb_full_dir}")paths_list = traverse_dir_files(self.rcsb_full_dir)rcsb_all_pdb_path = self.rcsb_all_pdb_format.format(len(paths_list))print(f"[Info] 输出路径: {self.rcsb_full_dir}")write_list_to_file(rcsb_all_pdb_path, paths_list)print(f"[Info] 写入完成! {rcsb_all_pdb_path}, 耗时: {time()-s_time}")@staticmethoddef draw_resolution(data_list, save_path):"""绘制分辨率，分辨率的范围是-1到10，划分11个bin其中，-1是empty、[1,2,3]是high、其余是low:param data_list:   数据列表:param save_path:   存储路径:return:  绘制图像"""labels, counts = np.unique(np.array(data_list), return_counts=True)labels_str = []for vl in labels:if vl == -1:label = "empty"else:label = f"{vl} ~ {vl+1}"labels_str.append(label)labels_str.pop(-1)labels_str.append(f">{labels[-1]}")# 颜色设置cmap = plt.get_cmap('jet')empty, high, middle, low = cmap(0.2), cmap(0.4), cmap(0.6), cmap(0.8)color = [empty, high, high, high, middle, middle, low, low, low, low, low, low]graph = plt.bar(labels_str, counts, align='center', color=color, edgecolor='black')plt.gca().set_xticks(labels_str)handles = [Rectangle((0, 0), 1, 1, color=c, ec="k") for c in [empty, high, middle, low]]color_labels = ["empty", "high", "middle", "low"]plt.legend(handles, color_labels)# 绘制百分比count_sum = sum(counts)percentage_list = []for count in counts:pct = (count / count_sum) * 100percentage_list.append(round(pct, 2))i = 0max_height = max([p.get_height() for p in graph])for p in graph:width = p.get_width()height = p.get_height()x, y = p.get_xy()plt.text(x + width / 2,y + height + max_height*0.01,str(percentage_list[i]) + '%',size=8,ha='center',weight='bold')i += 1# label设置plt.xlabel("Resolution")plt.ylabel("Frequency")# 尺寸以及存储fig = plt.gcf()fig.set_size_inches(10, 6)if save_path:plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)else:plt.show()plt.close()@staticmethoddef draw_seq_len(data_list, save_path=None):"""绘制序列长度的分布:param data_list: 序列数据集:param save_path: 图像存储:return: None"""labels, counts = np.unique(np.array(data_list), return_counts=True)labels_str = []for vl in labels:if vl == -1:label = "empty"else:label = f"{vl}~{vl+100}"labels_str.append(label)labels_str[-1] = f">{labels[-1]}"labels_str[0] = f"20~100"counts = list(counts)graph = plt.bar(labels_str, counts, align='center', edgecolor='black')plt.gca().set_xticks(labels_str)# label设置plt.xlabel("Seq. Len.")plt.ylabel("Frequency")# 颜色设置cmap = plt.get_cmap('jet')short, normal, long, v_long = cmap(0.2), cmap(0.4), cmap(0.6), cmap(0.8)color = [short, normal, normal, long, long, v_long, v_long, v_long, v_long, v_long, v_long]graph = plt.bar(labels_str, counts, align='center', color=color, edgecolor='black')plt.gca().set_xticks(labels_str)handles = [Rectangle((0, 0), 1, 1, color=c, ec="k") for c in [short, normal, long, v_long]]color_labels = ["short", "normal", "long", "very long"]plt.legend(handles, color_labels)# 绘制百分比count_sum = sum(counts)percentage_list = []for count in counts:pct = (count / count_sum) * 100percentage_list.append(round(pct, 2))i = 0max_height = max([p.get_height() for p in graph])for p in graph:width = p.get_width()height = p.get_height()x, y = p.get_xy()plt.text(x + width / 2,y + height + max_height*0.01,str(percentage_list[i]) + '%',size=8,ha='center',weight='bold')i += 1# 尺寸以及存储fig = plt.gcf()fig.set_size_inches(12, 6)if save_path:plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)else:plt.show()plt.close()@staticmethoddef draw_chain_type(data_list, save_path=None):labels, counts = np.unique(np.array(data_list), return_counts=True)graph = plt.bar(labels, counts, align='center', edgecolor='black')# label设置plt.xlabel("Chain Type")plt.ylabel("Frequency")plt.gca().set_xticks(labels)# 绘制百分比count_sum = sum(counts)percentage_list = []for count in counts:pct = (count / count_sum) * 100percentage_list.append(round(pct, 2))i = 0max_height = max([p.get_height() for p in graph])for p in graph:width = p.get_width()height = p.get_height()x, y = p.get_xy()plt.text(x + width / 2,y + height + max_height*0.01,str(percentage_list[i]) + '%',size=8,ha='center',weight='bold')i += 1# 尺寸以及存储fig = plt.gcf()fig.set_size_inches(12, 6)if save_path:plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)else:plt.show()plt.close()plt.show()@staticmethoddef draw_bcr_or_tcr_type(data_list, save_path=None):labels, counts = np.unique(np.array(data_list), return_counts=True)graph = plt.bar(labels, counts, align='center', edgecolor='black')# label设置plt.xlabel("BCR TCR")plt.ylabel("Frequency")plt.gca().set_xticks(labels)# 绘制百分比count_sum = sum(counts)percentage_list = []for count in counts:pct = (count / count_sum) * 100percentage_list.append(round(pct, 2))i = 0max_height = max([p.get_height() for p in graph])for p in graph:width = p.get_width()height = p.get_height()x, y = p.get_xy()plt.text(x + width / 2,y + height + max_height*0.01,str(percentage_list[i]) + '%',size=8,ha='center',weight='bold')i += 1# 尺寸以及存储fig = plt.gcf()fig.set_size_inches(6, 6)if save_path:plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)else:plt.show()plt.close()plt.show()@staticmethoddef draw_complex_counts(data_list, x_label, save_path=None):"""绘制复合物的链数"""labels, counts = np.unique(np.array(data_list), return_counts=True)labels_str = [str(l) for l in labels]labels_str[-1] = f">={labels_str[-1]}"counts = list(counts)graph = plt.bar(labels_str, counts, align='center', edgecolor='black')plt.gca().set_xticks(labels_str)# label设置plt.xlabel(x_label)plt.ylabel("Frequency")# 绘制百分比count_sum = sum(counts)percentage_list = []for count in counts:pct = (count / count_sum) * 100percentage_list.append(round(pct, 2))i = 0max_height = max([p.get_height() for p in graph])for p in graph:width = p.get_width()height = p.get_height()x, y = p.get_xy()plt.text(x + width / 2,y + height + max_height*0.01,str(percentage_list[i]) + '%',size=8,ha='center',weight='bold')i += 1# 尺寸以及存储fig = plt.gcf()if len(labels_str) > 2:fig.set_size_inches(12, 6)else:fig.set_size_inches(6, 6)if save_path:plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)else:plt.show()plt.close()def process_resolution(self, df):"""处理分辨率"""out_dir = os.path.join(self.rcsb_dir, "charts")mkdir_if_not_exist(out_dir)df_resolution_unique = df["resolution"].unique()df_resolution_unique = sorted(df_resolution_unique)print(f"[Info] resolution range: {df_resolution_unique[0]} ~ {df_resolution_unique[-1]}")df_resolution = df["resolution"].fillna(-1).astype(int)df_resolution[df_resolution >= 10] = 10self.draw_resolution(df_resolution, os.path.join(out_dir, "resolution_chain.png"))agg_functions = {'pdb': 'first', 'resolution': 'mean'}df_resolution_pdb = df.groupby(df['pdb']).aggregate(agg_functions)df_resolution_pdb = df_resolution_pdb["resolution"].fillna(-1).astype(int)df_resolution_pdb[df_resolution_pdb >= 10] = 10self.draw_resolution(df_resolution_pdb, os.path.join(out_dir, "resolution_pdb.png"))@staticmethoddef show_value_counts(data_list):labels, counts = np.unique(np.array(data_list), return_counts=True)label_res_str = ""for label, count in zip(labels, counts):label_res_str += f"{label}: {count}, "label_res_str = label_res_str[:-2]print(f"[Info] value_counts: {label_res_str}, sum: {sum(counts)}")def process_seq_len(self, df):"""处理序列长度"""df_len_unique = df["len"].unique()df_len_unique = sorted(df_len_unique)print(f"[Info] seq len range: {df_len_unique[0]} ~ {df_len_unique[-1]}")df_len_all = df.loc[df['len'] >= 20]print(f"[Info] len > 20: {len(df_len_all)}, len < 20: {len(df.loc[df['len'] < 20])}")df_len = df_len_all["len"].astype(int)df_len[df_len >= 1000] = 1000df_len = (df_len / 100).astype(int)df_len = (df_len * 100).astype(int)self.show_value_counts(df_len)out_dir = os.path.join(self.rcsb_dir, "charts")mkdir_if_not_exist(out_dir)self.draw_seq_len(df_len, os.path.join(out_dir, "seq_len.png"))def process_chain_type(self, df):df_chain_type = df["chain_type"]print(f"[Info] chain_type: {df_chain_type.unique()}")self.show_value_counts(df_chain_type)df_chain_type = df.loc[df["chain_type"] != "protein"]["chain_type"]out_dir = os.path.join(self.rcsb_dir, "charts")mkdir_if_not_exist(out_dir)self.draw_chain_type(df_chain_type, os.path.join(out_dir, "chain_type.png"))agg_functions = {'pdb': 'first', 'bcr_or_tcr': 'first'}df_btcr = df.groupby(df['pdb']).aggregate(agg_functions)df_btcr_type = df_btcr["bcr_or_tcr"]print(f"[Info] bcr or tcr type: {df_btcr_type.unique()}")self.show_value_counts(df_btcr_type)df_btcr_type = df_btcr.loc[df_btcr["bcr_or_tcr"] != "none"]["bcr_or_tcr"]self.draw_bcr_or_tcr_type(df_btcr_type, os.path.join(out_dir, "bcr_or_tcr.png"))def process_complex(self, df):df_pre_len = len(df)df = df.loc[df['len'] >= 20]df = df.loc[df['len'] <= 500]df = df.loc[df["resolution"].fillna(-1).astype(int) > 0]df = df.loc[df["resolution"] <= 3]df_post_len = len(df)print(f"[Info] df_pre_len: {df_pre_len}, df_post_len: {df_post_len}")df_pdb = df["pdb"].unique()print(f"[Info] Clean PDB样本总数: {len(df_pdb)}")df_multimer = df.groupby(['pdb']).size().reset_index(name='counts')df_multimer_unique = df_multimer['counts'].unique()print(f"[Info] multimer: {df_multimer_unique[0]} - {df_multimer_unique[-1]}")df_multimer_counts = df_multimer["counts"].astype(int)df_multimer_counts[df_multimer_counts >= 10] = 10self.show_value_counts(df_multimer_counts)out_dir = os.path.join(self.rcsb_dir, "charts")mkdir_if_not_exist(out_dir)save_path = os.path.join(out_dir, "complex_chain_num.png")self.draw_complex_counts(df_multimer_counts, x_label="Complex Chain Num", save_path=save_path)# 同源或异源df_multimer_1 = df.groupby(['pdb']).size().reset_index(name='counts')df_multimer_2 = df.groupby(['pdb'])['seq'].apply(lambda x: len(set(x))).reset_index(name='unique')# print(f"{len(df_multimer_1)}, {len(df_multimer_2)}")# df_multimer_3 = df_multimer_2.loc[df_multimer_1["counts"] > 1]df_multimer = pd.merge(df_multimer_1, df_multimer_2, on='pdb')  # 根据PDB合并# print(f"{len(df_multimer)}")# print(f"[Info] df_multimer: \\n{df_multimer[:5]}")df_multimer_unique = df_multimer.loc[df_multimer["counts"] > 1]["unique"]df_multimer_unique = df_multimer_unique.astype(int)df_multimer_unique[df_multimer_unique >= 2] = 2self.show_value_counts(df_multimer_unique)save_path = os.path.join(out_dir, "multimer_unique_num.png")self.draw_complex_counts(df_multimer_unique, x_label="Multimer Unique Num", save_path=save_path)def process_profiling(self, csv_path):print(f"[Info] csv文件: {csv_path}")df = pd.read_csv(csv_path)# print(df.info())df_pdb = df["pdb"].unique()print(f"[Info] PDB样本总数: {len(df_pdb)}")df_chain = df["chain"].unique()print(f"[Info] chain: {sorted(df_chain)}")df_release_date = df["release_date"].unique()df_release_date = sorted(df_release_date)print(f"[Info] release_date {df_release_date[0]} - {df_release_date[-1]}")self.process_resolution(df)self.process_seq_len(df)self.process_chain_type(df)self.process_complex(df)def process(self):self.process_profiling(self.profiling_protein_path)def main():rp = RcsbProcessor()rp.process()if __name__ == '__main__':main()

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

1. RCSB PDB

2. Resolution

3. Seq. Len.

4. Antibody

5. Complex / Multimer

6. 参考

7. 源码

相关问题

公告

标签