API Reference

This page contains the full API reference generated from docstrings. If you are new to PepKit, start with Getting Started and the module guides.

Module map 

Chemical Modeling	Parsing/standardization, conversion, properties, descriptors
Query	Fetch, constraint-based filtering

Chem Module 

Conversion (`pepkit.chem.conversion.conversion`)

Tools for parsing peptide representations (FASTA/SMILES), standardizing sequences, and filtering non-canonical FASTA records.

Convenience conversion functions exported by pepkit.chem.

pepkit.chem.conversion.conversion.smiles_to_fasta(smiles, header=None, split=False)[source]

Convert peptide SMILES to FASTA or raw sequence.

By default this returns a FASTA-formatted string:: >[header] SEQUENCE

If split=True the function returns the raw one-letter sequence (e.g. “GPG”) without any FASTA header.

Parameters:

smiles (str) – Input SMILES representing a linear peptide.
header (str | None) – Optional header (without ‘>’). Ignored when split=True.
split (bool) – If True, return the raw sequence string instead of FASTA.

Returns:

FASTA-formatted string (default) or raw sequence (if split=True).

Raises:

ValueError – On parse/decoding failure.

Return type:

str

pepkit.chem.conversion.conversion.fasta_to_smiles(fasta)[source]

Convert one-letter FASTA (no header) to canonical SMILES using RDKit.

Rejects non-canonical sequences containing the placeholder ‘X’.

Parameters:: fasta (str) – Amino-acid sequence in one-letter code.
Returns:: Canonical SMILES string.
Raises:: ValueError – If the sequence contains ‘X’ or RDKit cannot parse it.
Return type:: str

Descriptor (`pepkit.chem.desc.descriptor`)

Calculation of molecular descriptors and physicochemical properties.

class pepkit.chem.desc.descriptor.Descriptor(engine='peptides', fasta_key='peptide_sequence', id_key='id', smiles_key='smiles')[source]

Bases: object

Compute molecular or peptide descriptors for a collection of records.

This class provides descriptor calculation for peptides or small molecules, supporting two engines:

‘peptides’: Uses the peptides Python package for peptide descriptors.

‘rdkit’: Uses RDKit for general molecular descriptors from SMILES.

Parameters:

engine (str) – Descriptor engine (‘peptides’ for peptide descriptors,
fasta_key (str)
id_key (str)
smiles_key (str)

‘rdkit’ for molecular descriptors). :type engine: str :param fasta_key: Key for the peptide sequence in input records or DataFrame. :type fasta_key: str :param id_key: Key for unique record identifiers in input. :type id_key: str :param smiles_key: Key for SMILES string in input records (used only by ‘rdkit’). :type smiles_key: str

Example

>>> descriptor = Descriptor(engine='peptides')
>>> records = [{'id': 1, 'peptide_sequence': 'ACDE'}]
>>> df_out = descriptor.calculate(records, n_jobs=2)
>>> descriptor = Descriptor(engine='rdkit')
>>> records = [{'id': 1, 'smiles': 'CC(=O)O'}]
>>> df_out = descriptor.calculate(records, n_jobs=4)

SUPPORTED_ENGINES = {'peptides', 'rdkit'}

calculate(data, n_jobs=1, verbose=0)[source]

Compute descriptors in parallel for all records in data.

The output type matches the input type: if you provide a DataFrame, you get a DataFrame; if you provide a list of dicts, you get a list.

Parameters:

data (DataFrame | List[Dict[str, Any]]) – Input data (pandas DataFrame or list of dicts), with
n_jobs (int)
verbose (int)

Return type:

DataFrame | List[Dict[str, Any]]

fields for sequence/SMILES and ID. :type data: Union[pd.DataFrame, List[Dict[str, Any]]] :param n_jobs: Number of parallel jobs (joblib, -1 uses all available cores). :type n_jobs: int :param verbose: Verbosity for joblib parallel execution. :type verbose: int :raises TypeError: If input is not a DataFrame or list of dicts. :raises KeyError: If required keys are missing in input records. :raises ValueError: If SMILES cannot be parsed by RDKit. :return: Descriptor results, in the same format as the input. :rtype: Union[pd.DataFrame, List[Dict[str, Any]]]

Example

>>> descriptor = Descriptor(engine='peptides')
>>> df = pd.DataFrame([{'id': 1, 'peptide_sequence': 'ACDE'}])
>>> result = descriptor.calculate(df, n_jobs=1)

Standardize (`pepkit.chem.standardize`)

Utilities for standardizing peptide sequences and molecular representations.

class pepkit.chem.standardize.Standardizer(remove_non_canonical=False, charge_by_pH=False, pH=7.4, logger=None)[source]

Bases: object

Utility for processing peptide/protein sequences: - Validate canonical sequences - Convert FASTA to SMILES - Add pH-dependent charges - Batch and dict/DataFrame-based processing

Parameters:

remove_non_canonical (bool) – If True, filter out non-canonical sequences
charge_by_pH (bool) – If True, adjust SMILES charges at given pH
pH (float) – pH value for charge adjustment
logger (logging.Logger) – Logger instance for status messages

static is_canonical_sequence(sequence)[source]

Check if a sequence contains only canonical amino acids.

Parameters:: sequence (str) – FASTA-style one-letter amino acid sequence
Returns:: True if all residues are canonical
Return type:: bool
Raises:: TypeError – If sequence is not a string

static add_charge_by_pH(smi, pH=7.4)[source]

Adjust the protonation state of a SMILES string for a given pH.

Parameters:

smi (str) – Input SMILES string
pH (float) – Target pH for protonation correction

Returns:

pH-corrected SMILES string

Return type:

str

static process_fasta(fasta, remove_non_canonical=False, charge_by_pH=False, pH=7.4)[source]

Convert a FASTA sequence to a SMILES string, with optional filtering and charging.

Parameters:

fasta (str) – FASTA-style amino acid sequence
remove_non_canonical (bool) – If True, skip sequences containing
charge_by_pH (bool)
pH (float)

Return type:

str | None

non-canonical residues :type remove_non_canonical: bool :param charge_by_pH: If True, adjust SMILES at specified pH :type charge_by_pH: bool :param pH: pH for protonation adjustment :type pH: float :return: Generated SMILES or None if filtered out :rtype: Optional[str]

static dict_process(data, fasta_key, remove_non_canonical=False, charge_by_pH=False, pH=7.4)[source]

Process a list of dictionaries, converting FASTA sequences to SMILES.

Parameters:

data (List[Dict[str, Any]]) – List of records (dicts) containing FASTA sequences
fasta_key (str) – Key in each dict for the FASTA sequence
remove_non_canonical (bool) – Remove non-canonical sequences if True
charge_by_pH (bool) – Adjust SMILES at specified pH if True
pH (float) – pH value for charge adjustment

Returns:

New list of dicts with ‘smiles’ field added

Return type:

List[Dict[str, Any]]

Raises:

KeyError – If fasta_key is missing in any record

process_list_fasta(sequences, n_jobs=-1)[source]

Process a list of FASTA sequences in parallel.

Parameters:

sequences (List[str]) – List of FASTA sequences
n_jobs (int) – Number of parallel jobs for processing

Returns:

List of resulting SMILES or None values

Return type:

List[Optional[str]]

data_process(data, fasta_key='fasta', n_jobs=-1)[source]

Process FASTA data in a DataFrame or list of dicts, adding SMILES output.

Parameters:

data (DataFrame | List[Dict[str, Any]]) – Input pandas DataFrame or list of dicts with FASTA
fasta_key (str)
n_jobs (int)

Return type:

DataFrame | List[Dict[str, Any]]

sequences :type data: Union[pd.DataFrame, List[Dict[str, Any]]] :param fasta_key: Column/key for FASTA sequences in the data :type fasta_key: str :param n_jobs: Number of parallel jobs for charge adjustment :type n_jobs: int :return: DataFrame or list of dicts with ‘smiles’ column/field :rtype: Union[pd.DataFrame, List[Dict[str, Any]]]

Query Module 

Request (`pepkit.query.request`)

pepkit.query.request.retrieve_pdb(pdb_id, outdir='.', format='pdb')[source]

Download a .pdb file from RCSB by PDB ID.

Parameters:

pdb_id (str)
outdir (str | Path)
format (str)

Return type:

Path

Filter (`pepkit.query.filter`)

pepkit.query.filter.validate_complex_pdb(pdb_id, length_cutoff=50, canonical_check=False, hetatm_check=False)

Parameters:

pdb_id (str)
length_cutoff (int)
canonical_check (bool)
hetatm_check (bool)

pepkit.query.filter.validate_complex_pdbs(pdb_ids, length_cutoff=50, canonical_check=False, hetatm_check=False, n_jobs=8)

Parameters:

pdb_ids (list)
length_cutoff (int)
canonical_check (bool)
hetatm_check (bool)
n_jobs (int)

Constraint-based query (`pepkit.query.query`)

pepkit.query.query.query(quality, exp_method, release_date, length_cutoff, canonical_check, hetatm_check, csv_path, fasta_path, receptor_only, n_jobs)[source]

Query, validate, and extract peptide–protein complexes from RCSB.

This function performs an end-to-end workflow:

Query RCSB for candidate peptide–protein complexes using metadata constraints (resolution, experimental method, release date).
Validate each PDB entry using structural and sequence-based criteria (peptide detection, length cutoff, canonical residues, HETATM presence).
Write a metadata table (CSV) describing valid complexes.
Extract corresponding sequences into a FASTA file for downstream modeling (e.g., AF-Multimer, docking, ML pipelines).

The function is side-effect driven: results are written to disk (CSV + FASTA) and not returned explicitly.

Parameters:

quality (float) – Maximum allowed experimental resolution (in Å) used to query RCSB. Lower values correspond to higher-quality structures. Example: 3.0.
exp_method (str) – Experimental method used to solve the structure. Must match RCSB metadata exactly. Example: "X-RAY DIFFRACTION".
release_date (dict or str) – Release date constraint for RCSB query. Can be either: - a dict with {"from": YYYY-MM-DD, "to": YYYY-MM-DD}, or - a single date string (interpreted as lower bound).
length_cutoff (int) – Maximum allowed sequence length used for peptide/protein filtering. Typically peptides are expected to be short (e.g. ≤ 50 residues).
canonical_check (bool) – If True, discard complexes containing non-canonical amino acids (e.g., X) in any retained chain.
hetatm_check (bool) – If True, discard PDB entries containing HETATM records (e.g., ligands, cofactors, modified residues).
csv_path (str or pathlib.Path) – Output path for the CSV metadata table describing valid peptide–protein complexes.
fasta_path (str or pathlib.Path) – Output path for the FASTA file containing extracted sequences. The exact content depends on receptor_only.
receptor_only (bool) – If True, only receptor (protein) chains are written to FASTA. If False, both peptide and protein chains are included.
n_jobs (int) – Number of parallel workers used for PDB validation. Passed to joblib.Parallel.

Raises:

RuntimeError – If RCSB query fails or no valid complexes are found.
IOError – If output files cannot be written.

Side effects:

Writes csv_path (CSV metadata)
Writes fasta_path (FASTA sequences)

Example:

>>> query(
...     quality=3.0,
...     exp_method="X-RAY DIFFRACTION",
...     release_date={"from": "2018-01-01", "to": "2018-01-08"},
...     length_cutoff=50,
...     canonical_check=True,
...     hetatm_check=True,
...     csv_path="demo.csv",
...     fasta_path="demo.fasta",
...     receptor_only=True,
...     n_jobs=4,
... )

Modelling Module 

Analysis (`pepkit.modelling.af.post.analysis`)

class pepkit.modelling.af.post.analysis.AnalysisInputs(json_path: 'Optional[Path]', pdb_path: 'Optional[Path]')[source]

Bases: object

Parameters:

json_path (Path | None)
pdb_path (Path | None)

json_path: Path | None

pdb_path: Path | None

class pepkit.modelling.af.post.analysis.EntryMeta(length: 'Optional[int]', processing_time: 'Optional[float]')[source]

Bases: object

Parameters:

length (int | None)
processing_time (float | None)

length: int | None

processing_time: float | None

class pepkit.modelling.af.post.analysis.BatchStats(ok: 'int' = 0, empty: 'int' = 0, error: 'int' = 0, dockq_ok: 'int' = 0, dockq_fail: 'int' = 0)[source]

Bases: object

Parameters:

ok (int)
empty (int)
error (int)
dockq_ok (int)
dockq_fail (int)

ok: int = 0

empty: int = 0

error: int = 0

dockq_ok: int = 0

dockq_fail: int = 0

class pepkit.modelling.af.post.analysis.ProgressLogger(total, step_pct)[source]

Bases: object

Log at K% increments (10%, 20%, …).

Parameters:

total (int)
step_pct (int)

tick(i)[source]

Parameters:: i (int)
Return type:: None

class pepkit.modelling.af.post.analysis.Analysis(json_path=None, pdb_path=None, peptide_chain_position='last', distance_cutoff=8.0, round_digits=2, *, pdockq2_d0=10.0, pdockq2_sym_pae=True)[source]

Bases: BaseFeature

High-level feature aggregation for AF(-Multimer) outputs.

DockQ integration (via dockq.py):

Provide –mapping_csv with pdb_id,mapping to enable DockQ.
DockQ is computed for EACH entry and EACH rank.
Written inside each rank dict:
rankXXX[“total_dockq”] rankXXX[“avg_dockq”]

Parameters:

json_path (Optional[str])
pdb_path (Optional[str])
peptide_chain_position (str)
distance_cutoff (float)
round_digits (int)
pdockq2_d0 (float)
pdockq2_sym_pae (bool)

single_analysis()[source]

Return type:: Dict[str, Any]

all_analysis(dir_path)[source]

Parameters:: dir_path (str | Path)
Return type:: Dict[str, Any]

batch_analysis(batch_dir, *, delete_zips=True, mapping_by_pdbid=None, native_pdb_dir=None, progress_step_pct=10)[source]

progress_step_pct=10 => log at 10%,20%,…,100%

Parameters:

batch_dir (str | Path)
delete_zips (bool)
mapping_by_pdbid (Dict[str, Dict[str, str]] | None)
native_pdb_dir (Path | None)
progress_step_pct (int)

Return type:

Dict[str, Any]

static args()[source]

Return type:: ArgumentParser

pepkit.modelling.af.post.analysis.main()[source]

Return type:: None

API Reference

Module map

Chem Module

Conversion (pepkit.chem.conversion.conversion)

Descriptor (pepkit.chem.desc.descriptor)

Standardize (pepkit.chem.standardize)

Query Module

Request (pepkit.query.request)

Filter (pepkit.query.filter)

Constraint-based query (pepkit.query.query)

Modelling Module

Analysis (pepkit.modelling.af.post.analysis)

Module map 

Chem Module 

Conversion (`pepkit.chem.conversion.conversion`)

Descriptor (`pepkit.chem.desc.descriptor`)

Standardize (`pepkit.chem.standardize`)

Query Module 

Request (`pepkit.query.request`)

Filter (`pepkit.query.filter`)

Constraint-based query (`pepkit.query.query`)

Modelling Module 

Analysis (`pepkit.modelling.af.post.analysis`)