Chemical Modeling
The pepkit.chem package contains utilities for peptide sequences and peptide-like molecules:
FASTA/SMILES conversion (linear peptides)
Standardization & filtering (drop non-canonical residues, batch/DataFrame processing)
Peptide properties (net charge, molecular weight, pI)
Descriptor calculation for ML pipelines
At a glance
Inputs |
FASTA strings, peptide sequences, SMILES, or pandas DataFrames |
Outputs |
Canonical SMILES, cleaned/standardized sequences, property dicts, descriptor tables |
Where to look next |
API Reference for full function/class docs |
Sequence ⇄ SMILES conversion
Convert a peptide sequence → SMILES
Convert a sequence to a canonical SMILES string:
from pepkit.chem.conversion import fasta_to_smiles
fasta = "ACDE"
smiles = fasta_to_smiles(fasta)
print(smiles)
Example output
SMILES: C[C@H](N)C(=O)N[C@@H](CS)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](CCC(=O)O)C(=O)O
Convert SMILES → sequence
Convert a peptide-like SMILES back to a FASTA/sequence (when the SMILES corresponds to a peptide-like backbone):
from pepkit.chem.conversion import smiles_to_fasta
seq = smiles_to_fasta(smiles, header="peptide1")
print(seq)
Example output
>peptide1
ACDE
Round-trip quick check
from pepkit.chem.conversion import fasta_to_smiles, smiles_to_fasta
seq = "ACDEFGHIK"
ok = (smiles_to_fasta(fasta_to_smiles(seq), split=True) == seq)
print("Round Trip:", ok)
Note
Intended for linear peptides (RDKit-style). Modified/cyclic peptides or arbitrary small-molecule SMILES may not round-trip.
RDKit is required.
Standardization & filtering
Typical workflow: validate → standardize → featurize
Standardize a list of sequences (remove non-canonical residues; optional pH charge model):
from pepkit.chem.standardize import Standardizer
std = Standardizer(remove_non_canonical=True, charge_by_pH=True, pH=7.0)
seqs = ["ACDEFGHIK", "XYZ"]
standardized = std.process_list_fasta(seqs)
print(standardized)
# Example output (list): [SMILES, None] # second entry removed / mapped
For pandas DataFrames (vectorized, returns a new DataFrame):
import pandas as pd
from pepkit.chem.standardize import Standardizer
df = pd.DataFrame({"id": [1, 2], "fasta": ["ACDEFGHIK", "XYZ"]})
std = Standardizer(remove_non_canonical=True, charge_by_pH=True, pH=7.0)
df_std = std.data_process(df, fasta_key="fasta")
print(df_std.head())
Example DataFrame (after standardization)
id fasta smiles valid
1 ACDEFGHIK CC[C@H](C)[C@H]... True
2 XYZ None False
Warning
When remove_non_canonical=True, records containing non-canonical residues
will be filtered out or set to None depending on the API. Confirm how
your downstream pipeline handles missing values before using this option.
Descriptors
Generate ML features using peptide-sequence descriptors or RDKit molecular descriptors.
from pepkit.chem.desc import Descriptor
# Peptide sequence descriptors
data_pep = [{"id": "pep1", "peptide_sequence": "ACDE"}]
desc_pep = Descriptor(engine="peptides").calculate(data_pep)
print(desc_pep[0])
# RDKit molecular descriptors
data_mol = [{"id": "mol1", "smiles": "CCO"}]
desc_mol = Descriptor(engine="rdkit").calculate(data_mol)
print(desc_mol[0])
Note
The peptides engine requires the third-party package peptides.
Install with pip install peptides when using engine="peptides".
See also
Getting Started — quickstart example that chains standardization → descriptors
Chemical Modeling — full chemistry module overview
- doc:
api— full API reference