> 文章列表 > AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/130118339

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)
数据集:AlphaFold Protein Structure Database
下载地址:https://alphafold.ebi.ac.uk/download

AlphaFold DB 数据量,由最初214M,清洗高置信度为61M,再聚类为5M。

1. AlphaFold DB 官网

数据包含2个部分:

  1. Compressed prediction files
  2. UniProt

1.1 Compressed prediction files

The AlphaFold DB website currently provides bulk downloads for the 48 organisms listed below, as well as the majority of Swiss-Prot.

  • AlphaFold DB 网站目前提供下列 48 种生物以及大部分 Swiss-Prot 的批量下载。
  • Swiss Institute of Bioinformatics,瑞士生物信息学研究所,Prot -> Protein

UniProtKB/Swiss-Prot is the expertly curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
UniProtKB/Swiss-Prot 是 UniProtKB(由 UniProt 联盟制作)的专业策划组件,包含数十万种蛋白质描述,包括功能、域结构、亚细胞定位、翻译后修饰和功能特征变异。

结构包括 PDB 和 mmCIF 两种类型,数据集包括:

  1. Compressed prediction files for model organism proteomes,模型生物蛋白质组 的压缩预测文件
  2. Compressed prediction files for global health proteomes,全球健康蛋白质组 的压缩预测文件
  3. Compressed prediction files for Swiss-Prot,Swiss-Prot 的压缩预测文件
  4. MANE Select dataset,MANE 选择数据集,MANE,Matched Annotation (匹配标注) from NCBI and EMBL-EBI

NCBI,The National Center for Biotechnology Information,国家生物技术信息中心
EMBL-EBI,European Molecular Biology Laboratory - European Bioinformatics Institute,欧洲分子生物学实验室 - 欧洲生物信息学研究所

目前版本最近是v4,下载地址:https://ftp.ebi.ac.uk/pub/databases/alphafold/

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)

1.2 UniProt

全量数据集:Full dataset download for AlphaFold Database - UniProt (214M),即2.14亿

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)

数据集说明:

  • 全量数据共214M个样本,保存为 1015808 tar 文件,共23TB,解压后得到3*214M个gz文件,再次解压后得到214M个cif和418M个json
  • 每个样本共包含三个文件(1个cif,两个json)
    • model_v3.cif – contains the atomic coordinates for the predicted protein structure, along with some metadata. Useful references for this file format are the ModelCIF and PDBx/mmCIF project sites.
    • confidence_v3.json – contains a confidence metric output by AlphaFold called pLDDT. This provides a number for each residue, indicating how confident AlphaFold is in the local surrounding structure. pLDDT ranges from 0 to 100, where 100 is most confident. This is also contained in the CIF file.
    • predicted_aligned_error_v3.json – contains a confidence metric output by AlphaFold called PAE. This provides a number for every pair of residues, which is lower when AlphaFold is more confident in the relative position of the two residues. PAE is more suitable than pLDDT for judging confidence in relative domain placements.

1.3 单个结构

You can download a prediction for an individual UniProt accession by visiting the corresponding structure page.

  • 您可以通过访问相应的结构页面,下载单个 UniProt 新增的预测。

预测的单体PDB结构,例如AF-F4HVG8-F1-model_v4.pdb

  • Chloroplast sensor kinase, chloroplastic:叶绿体传感器激酶, 叶绿体

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)
在AF2 PDB中,Temperature factor (B factor) 温度因子的位置,就是pLDDT的预测值,即

MODEL        1                                                                  
ATOM      1  N   MET A   1     -15.359  18.253 -11.695  1.00 27.33           N  
ATOM      2  CA  MET A   1     -15.812  17.432 -12.846  1.00 27.33           C  
ATOM      3  C   MET A   1     -15.487  15.976 -12.539  1.00 27.33           C  
ATOM      4  CB  MET A   1     -15.064  17.802 -14.142  1.00 27.33           C  
ATOM      5  O   MET A   1     -14.426  15.760 -11.977  1.00 27.33           O  
ATOM      6  CG  MET A   1     -15.223  19.246 -14.625  1.00 27.33           C  
ATOM      7  SD  MET A   1     -14.329  19.504 -16.180  1.00 27.33           S  
ATOM      8  CE  MET A   1     -14.334  21.313 -16.299  1.00 27.33           C  
ATOM      9  N   LEU A   2     -16.290  14.967 -12.875  1.00 26.17           N  
ATOM     10  CA  LEU A   2     -17.714  14.928 -13.250  1.00 26.17           C  
ATOM     11  C   LEU A   2     -18.221  13.489 -12.989  1.00 26.17           C  
ATOM     12  CB  LEU A   2     -17.913  15.315 -14.736  1.00 26.17           C  
ATOM     13  O   LEU A   2     -17.420  12.559 -12.940  1.00 26.17           O  
ATOM     14  CG  LEU A   2     -18.870  16.504 -14.945  1.00 26.17           C  
ATOM     15  CD1 LEU A   2     -18.802  16.990 -16.390  1.00 26.17           C  
ATOM     16  CD2 LEU A   2     -20.319  16.141 -14.614  1.00 26.17           C  

相对于:

AI制药 - AlphaFold DB PDB 数据集的多维度分析与整理 (2)

2. AlphaFold DB 清洗

清洗规则:

  1. 序列长度在 [100, 1000]
  2. 全局pLDDT在 [90, 100],即 Model Confidence Very High,来自于官网。
  3. 与 RCSB数据集 去重

遍历一次需要:7775s = 2.15h,样本数量61775031个,7.6G的路径文件,以前2个字母,建立子文件夹。

即数据量由 214M 降至 61M,即由2.14亿降低为6177万

计算PDB的序列长度和pLDDT的值,例如ff文件夹的fffffffb38338e27427f7fef20b3c53f_A.pdb

获取每个原子的pLDDT,参考:BioPython - Bio.PDB.PDBParser:

def get_plddt_from_pdb(self, pdb_path):p = Bio.PDB.PDBParser()structure = p.get_structure('', pdb_path)for a in structure.get_atoms():print(f"[Info] plddt: {a.get_bfactor()}")break

获取每个残基的pLDDT,进而计算PDB中残基pLDDT的均值,作为PDB的pLDDT:

def get_plddt_from_pdb(self, pdb_path):p = Bio.PDB.PDBParser()structure = p.get_structure('input', pdb_path)plddt_list = []for a in structure.get_residues():b = a.get_unpacked_list()if len(b) > 0:plddt_list.append(b[0].get_bfactor())plddt = np.average(np.array(plddt_list))plddt = round(plddt, 4)return plddt

3. AlphaFold DB 聚类

清洗规则:

  1. 使用mmseq2对6000万进行聚类
  2. 与v3数据集(plddt>=90的部分)合并,参考 https://ftp.ebi.ac.uk/pub/databases/alphafold/v3/

即由6177万,降低为5040445,即504万,再与v3数据集合并去重,增加至5334594,即533万。

目前版本最近是v4。

4. 参考

  • StackOverflow - Iterate over a very large number of files in a folder

英语口语