API Reference

This page contains the full API reference generated from docstrings. If you are new to PepKit, start with Getting Started and the module guides.

Module map

Chemical Modeling

Parsing/standardization, conversion, properties, descriptors

Query

Fetch, constraint-based filtering

Chem Module

Conversion (pepkit.chem.conversion.conversion)

Tools for parsing peptide representations (FASTA/SMILES), standardizing sequences, and filtering non-canonical FASTA records.

Convenience conversion functions exported by pepkit.chem.

pepkit.chem.conversion.conversion.smiles_to_fasta(smiles, header=None, split=False)[source]

Convert peptide SMILES to FASTA or raw sequence.

By default this returns a FASTA-formatted string:

>[header] SEQUENCE

If split=True the function returns the raw one-letter sequence (e.g. “GPG”) without any FASTA header.

Parameters:
  • smiles (str) – Input SMILES representing a linear peptide.

  • header (str | None) – Optional header (without ‘>’). Ignored when split=True.

  • split (bool) – If True, return the raw sequence string instead of FASTA.

Returns:

FASTA-formatted string (default) or raw sequence (if split=True).

Raises:

ValueError – On parse/decoding failure.

Return type:

str

pepkit.chem.conversion.conversion.fasta_to_smiles(fasta)[source]

Convert one-letter FASTA (no header) to canonical SMILES using RDKit.

Rejects non-canonical sequences containing the placeholder ‘X’.

Parameters:

fasta (str) – Amino-acid sequence in one-letter code.

Returns:

Canonical SMILES string.

Raises:

ValueError – If the sequence contains ‘X’ or RDKit cannot parse it.

Return type:

str

Descriptor (pepkit.chem.desc.descriptor)

Calculation of molecular descriptors and physicochemical properties.

class pepkit.chem.desc.descriptor.Descriptor(engine='peptides', fasta_key='peptide_sequence', id_key='id', smiles_key='smiles')[source]

Bases: object

Compute molecular or peptide descriptors for a collection of records.

This class provides descriptor calculation for peptides or small molecules, supporting two engines:

  • ‘peptides’: Uses the peptides Python package for peptide descriptors.

  • ‘rdkit’: Uses RDKit for general molecular descriptors from SMILES.

Parameters:
  • engine (str) – Descriptor engine (‘peptides’ for peptide descriptors,

  • fasta_key (str)

  • id_key (str)

  • smiles_key (str)

‘rdkit’ for molecular descriptors). :type engine: str :param fasta_key: Key for the peptide sequence in input records or DataFrame. :type fasta_key: str :param id_key: Key for unique record identifiers in input. :type id_key: str :param smiles_key: Key for SMILES string in input records (used only by ‘rdkit’). :type smiles_key: str

Example

>>> descriptor = Descriptor(engine='peptides')
>>> records = [{'id': 1, 'peptide_sequence': 'ACDE'}]
>>> df_out = descriptor.calculate(records, n_jobs=2)
>>> descriptor = Descriptor(engine='rdkit')
>>> records = [{'id': 1, 'smiles': 'CC(=O)O'}]
>>> df_out = descriptor.calculate(records, n_jobs=4)
SUPPORTED_ENGINES = {'peptides', 'rdkit'}
calculate(data, n_jobs=1, verbose=0)[source]

Compute descriptors in parallel for all records in data.

The output type matches the input type: if you provide a DataFrame, you get a DataFrame; if you provide a list of dicts, you get a list.

Parameters:
  • data (DataFrame | List[Dict[str, Any]]) – Input data (pandas DataFrame or list of dicts), with

  • n_jobs (int)

  • verbose (int)

Return type:

DataFrame | List[Dict[str, Any]]

fields for sequence/SMILES and ID. :type data: Union[pd.DataFrame, List[Dict[str, Any]]] :param n_jobs: Number of parallel jobs (joblib, -1 uses all available cores). :type n_jobs: int :param verbose: Verbosity for joblib parallel execution. :type verbose: int :raises TypeError: If input is not a DataFrame or list of dicts. :raises KeyError: If required keys are missing in input records. :raises ValueError: If SMILES cannot be parsed by RDKit. :return: Descriptor results, in the same format as the input. :rtype: Union[pd.DataFrame, List[Dict[str, Any]]]

Example

>>> descriptor = Descriptor(engine='peptides')
>>> df = pd.DataFrame([{'id': 1, 'peptide_sequence': 'ACDE'}])
>>> result = descriptor.calculate(df, n_jobs=1)

Standardize (pepkit.chem.standardize)

Utilities for standardizing peptide sequences and molecular representations.

class pepkit.chem.standardize.Standardizer(remove_non_canonical=False, charge_by_pH=False, pH=7.4, logger=None)[source]

Bases: object

Utility for processing peptide/protein sequences: - Validate canonical sequences - Convert FASTA to SMILES - Add pH-dependent charges - Batch and dict/DataFrame-based processing

Parameters:
  • remove_non_canonical (bool) – If True, filter out non-canonical sequences

  • charge_by_pH (bool) – If True, adjust SMILES charges at given pH

  • pH (float) – pH value for charge adjustment

  • logger (logging.Logger) – Logger instance for status messages

static is_canonical_sequence(sequence)[source]

Check if a sequence contains only canonical amino acids.

Parameters:

sequence (str) – FASTA-style one-letter amino acid sequence

Returns:

True if all residues are canonical

Return type:

bool

Raises:

TypeError – If sequence is not a string

static add_charge_by_pH(smi, pH=7.4)[source]

Adjust the protonation state of a SMILES string for a given pH.

Parameters:
  • smi (str) – Input SMILES string

  • pH (float) – Target pH for protonation correction

Returns:

pH-corrected SMILES string

Return type:

str

static process_fasta(fasta, remove_non_canonical=False, charge_by_pH=False, pH=7.4)[source]

Convert a FASTA sequence to a SMILES string, with optional filtering and charging.

Parameters:
  • fasta (str) – FASTA-style amino acid sequence

  • remove_non_canonical (bool) – If True, skip sequences containing

  • charge_by_pH (bool)

  • pH (float)

Return type:

str | None

non-canonical residues :type remove_non_canonical: bool :param charge_by_pH: If True, adjust SMILES at specified pH :type charge_by_pH: bool :param pH: pH for protonation adjustment :type pH: float :return: Generated SMILES or None if filtered out :rtype: Optional[str]

static dict_process(data, fasta_key, remove_non_canonical=False, charge_by_pH=False, pH=7.4)[source]

Process a list of dictionaries, converting FASTA sequences to SMILES.

Parameters:
  • data (List[Dict[str, Any]]) – List of records (dicts) containing FASTA sequences

  • fasta_key (str) – Key in each dict for the FASTA sequence

  • remove_non_canonical (bool) – Remove non-canonical sequences if True

  • charge_by_pH (bool) – Adjust SMILES at specified pH if True

  • pH (float) – pH value for charge adjustment

Returns:

New list of dicts with ‘smiles’ field added

Return type:

List[Dict[str, Any]]

Raises:

KeyError – If fasta_key is missing in any record

process_list_fasta(sequences, n_jobs=-1)[source]

Process a list of FASTA sequences in parallel.

Parameters:
  • sequences (List[str]) – List of FASTA sequences

  • n_jobs (int) – Number of parallel jobs for processing

Returns:

List of resulting SMILES or None values

Return type:

List[Optional[str]]

data_process(data, fasta_key='fasta', n_jobs=-1)[source]

Process FASTA data in a DataFrame or list of dicts, adding SMILES output.

Parameters:
  • data (DataFrame | List[Dict[str, Any]]) – Input pandas DataFrame or list of dicts with FASTA

  • fasta_key (str)

  • n_jobs (int)

Return type:

DataFrame | List[Dict[str, Any]]

sequences :type data: Union[pd.DataFrame, List[Dict[str, Any]]] :param fasta_key: Column/key for FASTA sequences in the data :type fasta_key: str :param n_jobs: Number of parallel jobs for charge adjustment :type n_jobs: int :return: DataFrame or list of dicts with ‘smiles’ column/field :rtype: Union[pd.DataFrame, List[Dict[str, Any]]]

Query Module

Request (pepkit.query.request)

pepkit.query.request.retrieve_pdb(pdb_id, outdir='.', format='pdb')[source]

Download a .pdb file from RCSB by PDB ID.

Parameters:
Return type:

Path

Filter (pepkit.query.filter)

pepkit.query.filter.validate_complex_pdb(pdb_id, length_cutoff=50, canonical_check=False, hetatm_check=False)
Parameters:
  • pdb_id (str)

  • length_cutoff (int)

  • canonical_check (bool)

  • hetatm_check (bool)

pepkit.query.filter.validate_complex_pdbs(pdb_ids, length_cutoff=50, canonical_check=False, hetatm_check=False, n_jobs=8)
Parameters:
  • pdb_ids (list)

  • length_cutoff (int)

  • canonical_check (bool)

  • hetatm_check (bool)

  • n_jobs (int)

Constraint-based query (pepkit.query.query)

pepkit.query.query.query(quality, exp_method, release_date, length_cutoff, canonical_check, hetatm_check, csv_path, fasta_path, receptor_only, n_jobs)[source]

Query, validate, and extract peptide–protein complexes from RCSB.

This function performs an end-to-end workflow:

  1. Query RCSB for candidate peptide–protein complexes using metadata constraints (resolution, experimental method, release date).

  2. Validate each PDB entry using structural and sequence-based criteria (peptide detection, length cutoff, canonical residues, HETATM presence).

  3. Write a metadata table (CSV) describing valid complexes.

  4. Extract corresponding sequences into a FASTA file for downstream modeling (e.g., AF-Multimer, docking, ML pipelines).

The function is side-effect driven: results are written to disk (CSV + FASTA) and not returned explicitly.

Parameters:
  • quality (float) – Maximum allowed experimental resolution (in Å) used to query RCSB. Lower values correspond to higher-quality structures. Example: 3.0.

  • exp_method (str) – Experimental method used to solve the structure. Must match RCSB metadata exactly. Example: "X-RAY DIFFRACTION".

  • release_date (dict or str) – Release date constraint for RCSB query. Can be either: - a dict with {"from": YYYY-MM-DD, "to": YYYY-MM-DD}, or - a single date string (interpreted as lower bound).

  • length_cutoff (int) – Maximum allowed sequence length used for peptide/protein filtering. Typically peptides are expected to be short (e.g. ≤ 50 residues).

  • canonical_check (bool) – If True, discard complexes containing non-canonical amino acids (e.g., X) in any retained chain.

  • hetatm_check (bool) – If True, discard PDB entries containing HETATM records (e.g., ligands, cofactors, modified residues).

  • csv_path (str or pathlib.Path) – Output path for the CSV metadata table describing valid peptide–protein complexes.

  • fasta_path (str or pathlib.Path) – Output path for the FASTA file containing extracted sequences. The exact content depends on receptor_only.

  • receptor_only (bool) – If True, only receptor (protein) chains are written to FASTA. If False, both peptide and protein chains are included.

  • n_jobs (int) – Number of parallel workers used for PDB validation. Passed to joblib.Parallel.

Raises:
  • RuntimeError – If RCSB query fails or no valid complexes are found.

  • IOError – If output files cannot be written.

Side effects:
  • Writes csv_path (CSV metadata)

  • Writes fasta_path (FASTA sequences)

Example:
>>> query(
...     quality=3.0,
...     exp_method="X-RAY DIFFRACTION",
...     release_date={"from": "2018-01-01", "to": "2018-01-08"},
...     length_cutoff=50,
...     canonical_check=True,
...     hetatm_check=True,
...     csv_path="demo.csv",
...     fasta_path="demo.fasta",
...     receptor_only=True,
...     n_jobs=4,
... )

Modelling Module

Analysis (pepkit.modelling.af.post.analysis)

class pepkit.modelling.af.post.analysis.AnalysisInputs(json_path: 'Optional[Path]', pdb_path: 'Optional[Path]')[source]

Bases: object

Parameters:
  • json_path (Path | None)

  • pdb_path (Path | None)

json_path: Path | None
pdb_path: Path | None
class pepkit.modelling.af.post.analysis.EntryMeta(length: 'Optional[int]', processing_time: 'Optional[float]')[source]

Bases: object

Parameters:
  • length (int | None)

  • processing_time (float | None)

length: int | None
processing_time: float | None
class pepkit.modelling.af.post.analysis.BatchStats(ok: 'int' = 0, empty: 'int' = 0, error: 'int' = 0, dockq_ok: 'int' = 0, dockq_fail: 'int' = 0)[source]

Bases: object

Parameters:
ok: int = 0
empty: int = 0
error: int = 0
dockq_ok: int = 0
dockq_fail: int = 0
class pepkit.modelling.af.post.analysis.ProgressLogger(total, step_pct)[source]

Bases: object

Log at K% increments (10%, 20%, …).

Parameters:
tick(i)[source]
Parameters:

i (int)

Return type:

None

class pepkit.modelling.af.post.analysis.Analysis(json_path=None, pdb_path=None, peptide_chain_position='last', distance_cutoff=8.0, round_digits=2, *, pdockq2_d0=10.0, pdockq2_sym_pae=True)[source]

Bases: BaseFeature

High-level feature aggregation for AF(-Multimer) outputs.

DockQ integration (via dockq.py):
  • Provide –mapping_csv with pdb_id,mapping to enable DockQ.

  • DockQ is computed for EACH entry and EACH rank.

  • Written inside each rank dict:

    rankXXX[“total_dockq”] rankXXX[“avg_dockq”]

Parameters:
  • json_path (Optional[str])

  • pdb_path (Optional[str])

  • peptide_chain_position (str)

  • distance_cutoff (float)

  • round_digits (int)

  • pdockq2_d0 (float)

  • pdockq2_sym_pae (bool)

single_analysis()[source]
Return type:

Dict[str, Any]

all_analysis(dir_path)[source]
Parameters:

dir_path (str | Path)

Return type:

Dict[str, Any]

batch_analysis(batch_dir, *, delete_zips=True, mapping_by_pdbid=None, native_pdb_dir=None, progress_step_pct=10)[source]

progress_step_pct=10 => log at 10%,20%,…,100%

Parameters:
Return type:

Dict[str, Any]

static args()[source]
Return type:

ArgumentParser

pepkit.modelling.af.post.analysis.main()[source]
Return type:

None