.. _chem:

Chemical Modeling
=================

The :mod:`pepkit.chem` package contains utilities for peptide sequences and peptide-like molecules:

- **FASTA/SMILES conversion** (linear peptides)
- **Standardization & filtering** (drop non-canonical residues, batch/DataFrame processing)
- **Peptide properties** (net charge, molecular weight, pI)
- **Descriptor calculation** for ML pipelines

.. contents:: On this page
   :local:
   :depth: 2

At a glance
-----------

.. list-table::
   :widths: 25 75
   :header-rows: 0

   * - **Inputs**
     - FASTA strings, peptide sequences, SMILES, or pandas DataFrames
   * - **Outputs**
     - Canonical SMILES, cleaned/standardized sequences, property dicts, descriptor tables
   * - **Where to look next**
     - :doc:`api` for full function/class docs

Sequence ⇄ SMILES conversion
----------------------------

Convert a peptide sequence → SMILES
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Convert a sequence to a canonical SMILES string:

.. code-block:: python

   from pepkit.chem.conversion import fasta_to_smiles

   fasta = "ACDE"
   smiles = fasta_to_smiles(fasta)
   print(smiles)

.. admonition:: Example output
   :class: note

   .. code-block:: text

      SMILES: C[C@H](N)C(=O)N[C@@H](CS)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](CCC(=O)O)C(=O)O


Convert SMILES → sequence
^^^^^^^^^^^^^^^^^^^^^^^^^

Convert a peptide-like SMILES back to a FASTA/sequence (when the SMILES
corresponds to a peptide-like backbone):

.. code-block:: python

   from pepkit.chem.conversion import smiles_to_fasta

   seq = smiles_to_fasta(smiles, header="peptide1")
   print(seq)
.. admonition:: Example output
   :class: note

   .. code-block:: text

      >peptide1
      ACDE

Round-trip quick check
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from pepkit.chem.conversion import fasta_to_smiles, smiles_to_fasta
   seq = "ACDEFGHIK"
   ok = (smiles_to_fasta(fasta_to_smiles(seq), split=True) == seq)
   print("Round Trip:", ok)

.. note::

   - Intended for **linear peptides** (RDKit-style). Modified/cyclic peptides
     or arbitrary small-molecule SMILES may not round-trip.
   - RDKit is required.


Standardization & filtering
---------------------------

**Typical workflow:** **validate → standardize → featurize**

Standardize a list of sequences (remove non-canonical residues; optional pH charge model):

.. code-block:: python

   from pepkit.chem.standardize import Standardizer

   std = Standardizer(remove_non_canonical=True, charge_by_pH=True, pH=7.0)
   seqs = ["ACDEFGHIK", "XYZ"]
   standardized = std.process_list_fasta(seqs)
   print(standardized)
   # Example output (list): [SMILES, None]  # second entry removed / mapped

For pandas DataFrames (vectorized, returns a new DataFrame):

.. code-block:: python

   import pandas as pd
   from pepkit.chem.standardize import Standardizer

   df = pd.DataFrame({"id": [1, 2], "fasta": ["ACDEFGHIK", "XYZ"]})
   std = Standardizer(remove_non_canonical=True, charge_by_pH=True, pH=7.0)
   df_std = std.data_process(df, fasta_key="fasta")
   print(df_std.head())

.. admonition:: Example DataFrame (after standardization)
   :class: note

   .. code-block:: text

      id    fasta            smiles                  valid
      1     ACDEFGHIK        CC[C@H](C)[C@H]...      True
      2     XYZ              None                    False

.. warning::

   When ``remove_non_canonical=True``, records containing non-canonical residues
   will be filtered out or set to ``None`` depending on the API. Confirm how
   your downstream pipeline handles missing values before using this option.

Descriptors
-----------

Generate ML features using peptide-sequence descriptors or RDKit molecular descriptors.

.. code-block:: python

   from pepkit.chem.desc import Descriptor

   # Peptide sequence descriptors
   data_pep = [{"id": "pep1", "peptide_sequence": "ACDE"}]
   desc_pep = Descriptor(engine="peptides").calculate(data_pep)
   print(desc_pep[0])

   # RDKit molecular descriptors
   data_mol = [{"id": "mol1", "smiles": "CCO"}]
   desc_mol = Descriptor(engine="rdkit").calculate(data_mol)
   print(desc_mol[0])

.. note::

   The ``peptides`` engine requires the third-party package ``peptides``.
   Install with ``pip install peptides`` when using ``engine="peptides"``.

See also
--------

- :doc:`getting_started` — quickstart example that chains standardization → descriptors
- :doc:`chem` — full chemistry module overview  
- :doc: `api` — full API reference