Chemical Modeling

The pepkit.chem package contains utilities for peptide sequences and peptide-like molecules:

  • FASTA/SMILES conversion (linear peptides)

  • Standardization & filtering (drop non-canonical residues, batch/DataFrame processing)

  • Peptide properties (net charge, molecular weight, pI)

  • Descriptor calculation for ML pipelines

At a glance

Inputs

FASTA strings, peptide sequences, SMILES, or pandas DataFrames

Outputs

Canonical SMILES, cleaned/standardized sequences, property dicts, descriptor tables

Where to look next

API Reference for full function/class docs

Sequence ⇄ SMILES conversion

Convert a peptide sequence → SMILES

Convert a sequence to a canonical SMILES string:

from pepkit.chem.conversion import fasta_to_smiles

fasta = "ACDE"
smiles = fasta_to_smiles(fasta)
print(smiles)

Example output

SMILES: C[C@H](N)C(=O)N[C@@H](CS)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](CCC(=O)O)C(=O)O

Convert SMILES → sequence

Convert a peptide-like SMILES back to a FASTA/sequence (when the SMILES corresponds to a peptide-like backbone):

from pepkit.chem.conversion import smiles_to_fasta

seq = smiles_to_fasta(smiles, header="peptide1")
print(seq)

Example output

>peptide1
ACDE

Round-trip quick check

from pepkit.chem.conversion import fasta_to_smiles, smiles_to_fasta
seq = "ACDEFGHIK"
ok = (smiles_to_fasta(fasta_to_smiles(seq), split=True) == seq)
print("Round Trip:", ok)

Note

  • Intended for linear peptides (RDKit-style). Modified/cyclic peptides or arbitrary small-molecule SMILES may not round-trip.

  • RDKit is required.

Standardization & filtering

Typical workflow: validate → standardize → featurize

Standardize a list of sequences (remove non-canonical residues; optional pH charge model):

from pepkit.chem.standardize import Standardizer

std = Standardizer(remove_non_canonical=True, charge_by_pH=True, pH=7.0)
seqs = ["ACDEFGHIK", "XYZ"]
standardized = std.process_list_fasta(seqs)
print(standardized)
# Example output (list): [SMILES, None]  # second entry removed / mapped

For pandas DataFrames (vectorized, returns a new DataFrame):

import pandas as pd
from pepkit.chem.standardize import Standardizer

df = pd.DataFrame({"id": [1, 2], "fasta": ["ACDEFGHIK", "XYZ"]})
std = Standardizer(remove_non_canonical=True, charge_by_pH=True, pH=7.0)
df_std = std.data_process(df, fasta_key="fasta")
print(df_std.head())

Example DataFrame (after standardization)

id    fasta            smiles                  valid
1     ACDEFGHIK        CC[C@H](C)[C@H]...      True
2     XYZ              None                    False

Warning

When remove_non_canonical=True, records containing non-canonical residues will be filtered out or set to None depending on the API. Confirm how your downstream pipeline handles missing values before using this option.

Descriptors

Generate ML features using peptide-sequence descriptors or RDKit molecular descriptors.

from pepkit.chem.desc import Descriptor

# Peptide sequence descriptors
data_pep = [{"id": "pep1", "peptide_sequence": "ACDE"}]
desc_pep = Descriptor(engine="peptides").calculate(data_pep)
print(desc_pep[0])

# RDKit molecular descriptors
data_mol = [{"id": "mol1", "smiles": "CCO"}]
desc_mol = Descriptor(engine="rdkit").calculate(data_mol)
print(desc_mol[0])

Note

The peptides engine requires the third-party package peptides. Install with pip install peptides when using engine="peptides".

See also

  • Getting Started — quickstart example that chains standardization → descriptors

  • Chemical Modeling — full chemistry module overview

  • doc:

    api — full API reference