API Reference
This page contains the full API reference generated from docstrings. If you are new to PepKit, start with Getting Started and the module guides.
Module map
Parsing/standardization, conversion, properties, descriptors |
|
Fetch, constraint-based filtering |
Chem Module
Conversion (pepkit.chem.conversion.conversion)
Tools for parsing peptide representations (FASTA/SMILES), standardizing sequences, and filtering non-canonical FASTA records.
Convenience conversion functions exported by pepkit.chem.
- pepkit.chem.conversion.conversion.smiles_to_fasta(smiles, header=None, split=False)[source]
Convert peptide SMILES to FASTA or raw sequence.
- By default this returns a FASTA-formatted string:
>[header] SEQUENCE
If
split=Truethe function returns the raw one-letter sequence (e.g. “GPG”) without any FASTA header.- Parameters:
- Returns:
FASTA-formatted string (default) or raw sequence (if split=True).
- Raises:
ValueError – On parse/decoding failure.
- Return type:
- pepkit.chem.conversion.conversion.fasta_to_smiles(fasta)[source]
Convert one-letter FASTA (no header) to canonical SMILES using RDKit.
Rejects non-canonical sequences containing the placeholder ‘X’.
- Parameters:
fasta (str) – Amino-acid sequence in one-letter code.
- Returns:
Canonical SMILES string.
- Raises:
ValueError – If the sequence contains ‘X’ or RDKit cannot parse it.
- Return type:
Descriptor (pepkit.chem.desc.descriptor)
Calculation of molecular descriptors and physicochemical properties.
- class pepkit.chem.desc.descriptor.Descriptor(engine='peptides', fasta_key='peptide_sequence', id_key='id', smiles_key='smiles')[source]
Bases:
objectCompute molecular or peptide descriptors for a collection of records.
This class provides descriptor calculation for peptides or small molecules, supporting two engines:
‘peptides’: Uses the
peptidesPython package for peptide descriptors.‘rdkit’: Uses RDKit for general molecular descriptors from SMILES.
- Parameters:
‘rdkit’ for molecular descriptors). :type engine: str :param fasta_key: Key for the peptide sequence in input records or DataFrame. :type fasta_key: str :param id_key: Key for unique record identifiers in input. :type id_key: str :param smiles_key: Key for SMILES string in input records (used only by ‘rdkit’). :type smiles_key: str
Example
>>> descriptor = Descriptor(engine='peptides') >>> records = [{'id': 1, 'peptide_sequence': 'ACDE'}] >>> df_out = descriptor.calculate(records, n_jobs=2) >>> descriptor = Descriptor(engine='rdkit') >>> records = [{'id': 1, 'smiles': 'CC(=O)O'}] >>> df_out = descriptor.calculate(records, n_jobs=4)
- SUPPORTED_ENGINES = {'peptides', 'rdkit'}
- calculate(data, n_jobs=1, verbose=0)[source]
Compute descriptors in parallel for all records in
data.The output type matches the input type: if you provide a DataFrame, you get a DataFrame; if you provide a list of dicts, you get a list.
- Parameters:
- Return type:
fields for sequence/SMILES and ID. :type data: Union[pd.DataFrame, List[Dict[str, Any]]] :param n_jobs: Number of parallel jobs (joblib, -1 uses all available cores). :type n_jobs: int :param verbose: Verbosity for joblib parallel execution. :type verbose: int :raises TypeError: If input is not a DataFrame or list of dicts. :raises KeyError: If required keys are missing in input records. :raises ValueError: If SMILES cannot be parsed by RDKit. :return: Descriptor results, in the same format as the input. :rtype: Union[pd.DataFrame, List[Dict[str, Any]]]
Example
>>> descriptor = Descriptor(engine='peptides') >>> df = pd.DataFrame([{'id': 1, 'peptide_sequence': 'ACDE'}]) >>> result = descriptor.calculate(df, n_jobs=1)
Standardize (pepkit.chem.standardize)
Utilities for standardizing peptide sequences and molecular representations.
- class pepkit.chem.standardize.Standardizer(remove_non_canonical=False, charge_by_pH=False, pH=7.4, logger=None)[source]
Bases:
objectUtility for processing peptide/protein sequences: - Validate canonical sequences - Convert FASTA to SMILES - Add pH-dependent charges - Batch and dict/DataFrame-based processing
- Parameters:
remove_non_canonical (bool) – If True, filter out non-canonical sequences
charge_by_pH (bool) – If True, adjust SMILES charges at given pH
pH (float) – pH value for charge adjustment
logger (logging.Logger) – Logger instance for status messages
- static is_canonical_sequence(sequence)[source]
Check if a sequence contains only canonical amino acids.
- static add_charge_by_pH(smi, pH=7.4)[source]
Adjust the protonation state of a SMILES string for a given pH.
- static process_fasta(fasta, remove_non_canonical=False, charge_by_pH=False, pH=7.4)[source]
Convert a FASTA sequence to a SMILES string, with optional filtering and charging.
- Parameters:
- Return type:
str | None
non-canonical residues :type remove_non_canonical: bool :param charge_by_pH: If True, adjust SMILES at specified pH :type charge_by_pH: bool :param pH: pH for protonation adjustment :type pH: float :return: Generated SMILES or None if filtered out :rtype: Optional[str]
- static dict_process(data, fasta_key, remove_non_canonical=False, charge_by_pH=False, pH=7.4)[source]
Process a list of dictionaries, converting FASTA sequences to SMILES.
- Parameters:
data (List[Dict[str, Any]]) – List of records (dicts) containing FASTA sequences
fasta_key (str) – Key in each dict for the FASTA sequence
remove_non_canonical (bool) – Remove non-canonical sequences if True
charge_by_pH (bool) – Adjust SMILES at specified pH if True
pH (float) – pH value for charge adjustment
- Returns:
New list of dicts with ‘smiles’ field added
- Return type:
List[Dict[str, Any]]
- Raises:
KeyError – If fasta_key is missing in any record
- data_process(data, fasta_key='fasta', n_jobs=-1)[source]
Process FASTA data in a DataFrame or list of dicts, adding SMILES output.
- Parameters:
- Return type:
sequences :type data: Union[pd.DataFrame, List[Dict[str, Any]]] :param fasta_key: Column/key for FASTA sequences in the data :type fasta_key: str :param n_jobs: Number of parallel jobs for charge adjustment :type n_jobs: int :return: DataFrame or list of dicts with ‘smiles’ column/field :rtype: Union[pd.DataFrame, List[Dict[str, Any]]]
Query Module
Request (pepkit.query.request)
Filter (pepkit.query.filter)
- pepkit.query.filter.validate_complex_pdb(pdb_id, length_cutoff=50, canonical_check=False, hetatm_check=False)
Constraint-based query (pepkit.query.query)
- pepkit.query.query.query(quality, exp_method, release_date, length_cutoff, canonical_check, hetatm_check, csv_path, fasta_path, receptor_only, n_jobs)[source]
Query, validate, and extract peptide–protein complexes from RCSB.
This function performs an end-to-end workflow:
Query RCSB for candidate peptide–protein complexes using metadata constraints (resolution, experimental method, release date).
Validate each PDB entry using structural and sequence-based criteria (peptide detection, length cutoff, canonical residues, HETATM presence).
Write a metadata table (CSV) describing valid complexes.
Extract corresponding sequences into a FASTA file for downstream modeling (e.g., AF-Multimer, docking, ML pipelines).
The function is side-effect driven: results are written to disk (CSV + FASTA) and not returned explicitly.
- Parameters:
quality (float) – Maximum allowed experimental resolution (in Å) used to query RCSB. Lower values correspond to higher-quality structures. Example:
3.0.exp_method (str) – Experimental method used to solve the structure. Must match RCSB metadata exactly. Example:
"X-RAY DIFFRACTION".release_date (dict or str) – Release date constraint for RCSB query. Can be either: - a dict with
{"from": YYYY-MM-DD, "to": YYYY-MM-DD}, or - a single date string (interpreted as lower bound).length_cutoff (int) – Maximum allowed sequence length used for peptide/protein filtering. Typically peptides are expected to be short (e.g. ≤ 50 residues).
canonical_check (bool) – If
True, discard complexes containing non-canonical amino acids (e.g.,X) in any retained chain.hetatm_check (bool) – If
True, discard PDB entries containing HETATM records (e.g., ligands, cofactors, modified residues).csv_path (str or pathlib.Path) – Output path for the CSV metadata table describing valid peptide–protein complexes.
fasta_path (str or pathlib.Path) – Output path for the FASTA file containing extracted sequences. The exact content depends on
receptor_only.receptor_only (bool) – If
True, only receptor (protein) chains are written to FASTA. IfFalse, both peptide and protein chains are included.n_jobs (int) – Number of parallel workers used for PDB validation. Passed to
joblib.Parallel.
- Raises:
RuntimeError – If RCSB query fails or no valid complexes are found.
IOError – If output files cannot be written.
- Side effects:
Writes
csv_path(CSV metadata)Writes
fasta_path(FASTA sequences)
- Example:
>>> query( ... quality=3.0, ... exp_method="X-RAY DIFFRACTION", ... release_date={"from": "2018-01-01", "to": "2018-01-08"}, ... length_cutoff=50, ... canonical_check=True, ... hetatm_check=True, ... csv_path="demo.csv", ... fasta_path="demo.fasta", ... receptor_only=True, ... n_jobs=4, ... )
Modelling Module
Analysis (pepkit.modelling.af.post.analysis)
- class pepkit.modelling.af.post.analysis.AnalysisInputs(json_path: 'Optional[Path]', pdb_path: 'Optional[Path]')[source]
Bases:
object
- class pepkit.modelling.af.post.analysis.EntryMeta(length: 'Optional[int]', processing_time: 'Optional[float]')[source]
Bases:
object
- class pepkit.modelling.af.post.analysis.BatchStats(ok: 'int' = 0, empty: 'int' = 0, error: 'int' = 0, dockq_ok: 'int' = 0, dockq_fail: 'int' = 0)[source]
Bases:
object
- class pepkit.modelling.af.post.analysis.ProgressLogger(total, step_pct)[source]
Bases:
objectLog at K% increments (10%, 20%, …).
- class pepkit.modelling.af.post.analysis.Analysis(json_path=None, pdb_path=None, peptide_chain_position='last', distance_cutoff=8.0, round_digits=2, *, pdockq2_d0=10.0, pdockq2_sym_pae=True)[source]
Bases:
BaseFeatureHigh-level feature aggregation for AF(-Multimer) outputs.
- DockQ integration (via dockq.py):
Provide –mapping_csv with pdb_id,mapping to enable DockQ.
DockQ is computed for EACH entry and EACH rank.
- Written inside each rank dict:
rankXXX[“total_dockq”] rankXXX[“avg_dockq”]
- Parameters:
- batch_analysis(batch_dir, *, delete_zips=True, mapping_by_pdbid=None, native_pdb_dir=None, progress_step_pct=10)[source]
progress_step_pct=10 => log at 10%,20%,…,100%