Query ===== .. raw:: html

Query

Fetch PDB entries and validate peptide–protein complexes with high-quality, constraint-driven filters and robust CLI tools.

fetch PDB validate & filter batch / parallel CLI friendly
Quick overview -------------- .. raw:: html

What

Retrieve PDB/CIF files and screen peptide–protein complexes using length, canonical, HETATM and experimental constraints.

Why

Screen before download to save bandwidth and focus on high-quality, ligand-free complexes for modelling.

How

Use the Python API for pipelines, or the CLI for batch/streaming workflows (NDJSON β†’ jq β†’ xargs/parallel).

Fetch a single PDB ------------------ API: .. code-block:: python from pepkit.query.request import retrieve_pdb # writes ./7qwv.pdb or ./7qwv.cif depending on format source retrieve_pdb(pdb_id="7qwv", outdir="./", format = 'pdb') CLI: .. code-block:: console python -m pepkit.query.request 7qwv --output ./ --format pdb Filter PDB ---------- Online validation of PDB entries using simple constraints (run checks before downloading). Single validation ^^^^^^^^^^^^^^^^^ Lightweight check for one PDB (no file written): .. code-block:: python from pepkit.query.filter import validate_complex_pdb valid, valid_chains, result, valid_sequences = validate_complex_pdb( pdb_id="6A90", length_cutoff=50, canonical_check=False, hetatm_check=False, ) .. admonition:: Example output :class: note .. code-block:: text (False, None, None, {'B': 'SAKDGDVEGPAGCKKYDVECDSGECCQKQYLWYKWRPLDCRCLKSGFFSSKCVCRDV', 'A': 'MADNSPLIREERQRLFRPYTRAMLTAPSAQPAKENGKTEENKDNSRDKGRGANKD..'}) Batch validation ^^^^^^^^^^^^^^^^^^ Validate many PDBs in parallel; returns a mapping from pdb_id β†’ result: .. code-block:: python from pepkit.query.filter import validate_complex_pdbs pdb_ids = ["6A90", "8S6A", "9H3D", "8CBP"] results = validate_complex_pdbs( pdb_ids=pdb_ids, length_cutoff=50, canonical_check=True, hetatm_check=True, n_jobs=4, ) Command line interface ^^^^^^^^^^^^^^^^^^^^^^ Validate a set of PDBs (parallel, NDJSON-friendly): .. code-block:: console python -m pepkit.query.filter --pdb_ids 6A90 8S6A 9H3D 8CBP \ --length_cutoff 50 --canonical_check --n_jobs 4 Flag reference ~~~~~~~~~~~~~~ - ``--pdb_ids`` β€” space-separated PDB IDs (required) - ``--length_cutoff`` β€” minimal peptide/protein chain length (int) - ``--canonical_check`` β€” require canonical amino acids (flag) - ``--hetatm_check`` β€” reject structures with HETATM near the peptide (flag) - ``--n_jobs`` β€” number of parallel workers Constraint-based PDB query -------------------------- Run filtered harvests using quality / experimental / date / sequence constraints. Writes results to CSV and FASTA; returns a DataFrame (or writes files depending on call). Python API example: .. code-block:: python from pepkit.query.query import query import pandas as pd query( quality=3.0, exp_method="X-RAY DIFFRACTION", release_date={"from": "2018-01-01", "to": "2018-01-08"}, length_cutoff=50, canonical_check=True, hetatm_check=True, csv_path="demo.csv", fasta_path="demo.fasta", receptor_only=True, n_jobs=4, ) df = pd.read_csv("demo.csv") print(df.head()) Example CSV ~~~~~~~~~~~ .. code-block:: csv pdb_id,peptide_chain,peptide_sequence,protein_chain,protein_sequences,peptide_length,protein_lengths 5WHB,B,GPRRPRCPGDDASIEDLHEYWARLWNYLYRVA,"['A','J']","MTEYKLVVVGAVGVGKSALTIQLIQNHFVDEYDPTIEDSYR...",32,"{'A':166,'J':166}" 5WHA,B,GPRRPRCPGDDASIEDLHEYWARLWNYLYAVA,"['A','D','G']","MTEYKLVVVGAVGVGKSALTIQLIQNHFVDEYDPTIEDSYR...",32,"{'A':166,'D':166,'G':166}" Command line interface ~~~~~~~~~~~~~~~~~~~~~~ Module entry point: .. code-block:: console python -m pepkit.query.query \ --quality 3.0 \ --exp_method "X-RAY DIFFRACTION" \ --core_release_date 2018-01-01 2018-01-08 \ --length_cutoff 50 \ --canonical_check \ --hetatm_check \ --core_csv_path demo.csv \ --core_fasta_path demo.fasta \ --receptor_only \ --n_jobs 4 Installed CLI: .. code-block:: console pepkit query \ --quality 3.0 \ --exp_method "X-RAY DIFFRACTION" \ --core_release_date 2018-01-01 2018-01-08 \ --length_cutoff 50 \ --canonical_check \ --hetatm_check \ --core_csv_path demo.csv \ --core_fasta_path demo.fasta \ --receptor_only \ --n_jobs 4 Flag reference ~~~~~~~~~~~~~~ - ``quality`` β€” numeric threshold for chosen structure quality metric. - ``exp_method`` β€” match PDB experimental method string (case-insensitive). - ``release_date`` β€” dict ``{"from":"YYYY-MM-DD","to":"YYYY-MM-DD"}`` (inclusive). CLI accepts two dates after ``--core_release_date``. - ``length_cutoff`` β€” minimal peptide length. - ``canonical_check`` / ``hetatm_check`` β€” booleans to ensure canonical residues and ligand-free peptides. - ``receptor_only`` β€” only output receptor chains paired to peptides (useful for receptor-centric pipelines). - ``n_jobs`` β€” parallel workers for network/validation tasks. See also -------- - :doc:`getting_started` β€” example pipeline: query β†’ validate β†’ standardize β†’ featurize. - :doc:`chem` β€” sequence ↔ SMILES conversion and chemical filtering. - :doc:`graph` β€” representing validated chains as graphs (ITS/graph processing). - :doc:`api` β€” full programmatic reference for the query module.