Query
Query
Fetch PDB entries and validate peptide–protein complexes with high-quality, constraint-driven filters and robust CLI tools.
Quick overview
What
Retrieve PDB/CIF files and screen peptide–protein complexes using length, canonical, HETATM and experimental constraints.
Why
Screen before download to save bandwidth and focus on high-quality, ligand-free complexes for modelling.
How
Use the Python API for pipelines, or the CLI for batch/streaming workflows (NDJSON → jq → xargs/parallel).
Fetch a single PDB
API:
from pepkit.query.request import retrieve_pdb
# writes ./7qwv.pdb or ./7qwv.cif depending on format source
retrieve_pdb(pdb_id="7qwv", outdir="./", format = 'pdb')
CLI:
python -m pepkit.query.request 7qwv --output ./ --format pdb
Filter PDB
Online validation of PDB entries using simple constraints (run checks before downloading).
Single validation
Lightweight check for one PDB (no file written):
from pepkit.query.filter import validate_complex_pdb
valid, valid_chains, result, valid_sequences = validate_complex_pdb(
pdb_id="6A90",
length_cutoff=50,
canonical_check=False,
hetatm_check=False,
)
Example output
(False,
None,
None,
{'B': 'SAKDGDVEGPAGCKKYDVECDSGECCQKQYLWYKWRPLDCRCLKSGFFSSKCVCRDV',
'A': 'MADNSPLIREERQRLFRPYTRAMLTAPSAQPAKENGKTEENKDNSRDKGRGANKD..'})
Batch validation
Validate many PDBs in parallel; returns a mapping from pdb_id → result:
from pepkit.query.filter import validate_complex_pdbs
pdb_ids = ["6A90", "8S6A", "9H3D", "8CBP"]
results = validate_complex_pdbs(
pdb_ids=pdb_ids,
length_cutoff=50,
canonical_check=True,
hetatm_check=True,
n_jobs=4,
)
Command line interface
Validate a set of PDBs (parallel, NDJSON-friendly):
python -m pepkit.query.filter --pdb_ids 6A90 8S6A 9H3D 8CBP \
--length_cutoff 50 --canonical_check --n_jobs 4
Flag reference
--pdb_ids— space-separated PDB IDs (required)--length_cutoff— minimal peptide/protein chain length (int)--canonical_check— require canonical amino acids (flag)--hetatm_check— reject structures with HETATM near the peptide (flag)--n_jobs— number of parallel workers
Constraint-based PDB query
Run filtered harvests using quality / experimental / date / sequence constraints. Writes results to CSV and FASTA; returns a DataFrame (or writes files depending on call).
Python API example:
from pepkit.query.query import query
import pandas as pd
query(
quality=3.0,
exp_method="X-RAY DIFFRACTION",
release_date={"from": "2018-01-01", "to": "2018-01-08"},
length_cutoff=50,
canonical_check=True,
hetatm_check=True,
csv_path="demo.csv",
fasta_path="demo.fasta",
receptor_only=True,
n_jobs=4,
)
df = pd.read_csv("demo.csv")
print(df.head())
pdb_id,peptide_chain,peptide_sequence,protein_chain,protein_sequences,peptide_length,protein_lengths
5WHB,B,GPRRPRCPGDDASIEDLHEYWARLWNYLYRVA,"['A','J']","MTEYKLVVVGAVGVGKSALTIQLIQNHFVDEYDPTIEDSYR...",32,"{'A':166,'J':166}"
5WHA,B,GPRRPRCPGDDASIEDLHEYWARLWNYLYAVA,"['A','D','G']","MTEYKLVVVGAVGVGKSALTIQLIQNHFVDEYDPTIEDSYR...",32,"{'A':166,'D':166,'G':166}"
Module entry point:
python -m pepkit.query.query \
--quality 3.0 \
--exp_method "X-RAY DIFFRACTION" \
--core_release_date 2018-01-01 2018-01-08 \
--length_cutoff 50 \
--canonical_check \
--hetatm_check \
--core_csv_path demo.csv \
--core_fasta_path demo.fasta \
--receptor_only \
--n_jobs 4
Installed CLI:
pepkit query \
--quality 3.0 \
--exp_method "X-RAY DIFFRACTION" \
--core_release_date 2018-01-01 2018-01-08 \
--length_cutoff 50 \
--canonical_check \
--hetatm_check \
--core_csv_path demo.csv \
--core_fasta_path demo.fasta \
--receptor_only \
--n_jobs 4
quality— numeric threshold for chosen structure quality metric.exp_method— match PDB experimental method string (case-insensitive).release_date— dict{"from":"YYYY-MM-DD","to":"YYYY-MM-DD"}(inclusive). CLI accepts two dates after--core_release_date.length_cutoff— minimal peptide length.canonical_check/hetatm_check— booleans to ensure canonical residues and ligand-free peptides.receptor_only— only output receptor chains paired to peptides (useful for receptor-centric pipelines).n_jobs— parallel workers for network/validation tasks.
See also
Getting Started — example pipeline: query → validate → standardize → featurize.
Chemical Modeling — sequence ↔ SMILES conversion and chemical filtering.
Graph — representing validated chains as graphs (ITS/graph processing).
API Reference — full programmatic reference for the query module.