The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-28 with canprot version 2.0.0.

Previous vignette: Introduction to canprot

More details on chemical metrics

Carbon oxidation state (ZC) is calculated from elemental ratios, but stoichiometric hydration state (nH2O) depends on a specific choice of thermodynamic components (or basis species). In canprot, nH2O is calculated from a theoretical reaction to form proteins from the following set of basis species, abbreviated as QEC:

• glutamine (Q)
• glutamic acid (E)
• cysteine (C)
• H2O
• O2

The rationale for this choice is described in Dick et al. (2020) and Dick (2022).

To see how it works, consider the formation reaction of alanylglycine, which can be written using functions in the CHNOSZ package:

``CHNOSZ::basis("QEC")``
``````##           C  H N O S ispecies logact state
## C5H10N2O3 5 10 2 3 0     1723   -3.2    aq
## C5H9NO4   5  9 1 4 0     1724   -4.5    aq
## C3H7NO2S  3  7 1 2 1     1721   -3.6    aq
## H2O       0  2 0 1 0        1    0.0   liq
## O2        0  0 0 2 0     2679  -80.0   gas``````
``CHNOSZ::subcrt("alanylglycine", 1)\$reaction``
``````##      coeff          name   formula state ispecies model
## 2665     1 alanylglycine C5H10N2O3    cr     2665   CGL
## 1723    -1     glutamine C5H10N2O3    aq     1723   HKF``````

As it turns out, alanylglycine has the same elemental formula as glutamine, which is one of the basis species, so there are no other basis species in the reaction. That includes H2O, so we can deduce that nH2O is zero.

Let’s do the calculation with the `nH2O()` function to see this result. We have to specify `terminal_H2O = 1` in order to account for the terminal -H and -OH groups on alanlglycine. As described below, the default for this setting is 0 because, most of the time, we want to measure the per-amino-acid contributions for proteins.

``````AG <- data.frame(Ala = 1, Gly = 1)
nH2O(AG, terminal_H2O = 1)``````
``## [1] 0``

Now let’s try an actual protein, chicken egg-white lysozome, which has the name LYSC_CHICK in UniProt with accession number P00698. The amino acid compositions of this and selected other proteins are available in the CHNOSZ package. This gets the amino acid composition and prints the protein length and ZC and nH2O:

``````AA <- CHNOSZ::pinfo(CHNOSZ::pinfo("LYSC_CHICK"))
plength(AA)``````
``````##   6
## 129``````
``Zc(AA)``
``````##          6
## 0.01631321``````
``nH2O(AA)``
``````##          6
## -0.8868217``````

By the way, lysozyme (a secreted protein) is highly oxidized compared to cytoplasmic proteins, most of which have negative ZC.

To see where the value of nH2O comes from, we can write the formation reaction of LYSC_CHICK from the QEC basis species. Recall that we set the basis species above with `CHNOSZ::basis("QEC")`; that setting is used here to balance the reaction.

``(reaction <- CHNOSZ::subcrt("LYSC_CHICK", 1)\$reaction) # print the reaction``
``````##      coeff          name             formula state ispecies
## 3487   1.0    LYSC_CHICK C613H959N193O185S10    aq     3487
## 1723 -66.4     glutamine           C5H10N2O3    aq     1723
## 1724 -50.2 glutamic acid             C5H9NO4    aq     1724
## 1721 -10.0      cysteine            C3H7NO2S    aq     1721
## 1    113.4         water                 H2O   liq        1
## 2679  60.8        oxygen                  O2   gas     2679
##               model
## 3487            HKF
## 1723            HKF
## 1724            HKF
## 1721            HKF
## 1    water.SUPCRT92
## 2679            CGL``````
``(H2Ocoeff <- with(reaction, coeff[name == "water"])) # print the coefficient on H2O``
``## [1] 113.4``

That says 113.4, but the value for nH2O above was -0.8868217. What happened? Let’s step through the logic:

• The reaction shows that 113.4 water molecules are released (the coefficient is positive) in the theoretical formation of one molecule of the protein.
• The protein has terminal -H and -OH groups. This means that shorter proteins are “wetter”, and we don’t want that complication. So, we take one H2O away from the protein and add it to the number of water molecules released, giving 113.4 + 1 = 114.4 H2O. This step was not performed by Dick et al. (2020) but is the default in canprot since version 2.0.0.
• We use the opposite of this value because we are counting how many H2O units go into forming the protein.
• Finally, we divide by the length of the protein to get the stoichiometric hydration state normalized per residue: -114.4 / 129 = -0.8868217

A general observation: This is a theoretical reaction in terms of thermodynamic components, so we are not dealing with biochemical mechanisms here. That’s one reason for calling this approach geochemical biology.

Implementation details

To save time, `nH2O()` does not calculate the formation reaction for each protein but instead uses precomputed values of nH2O for each amino acid. The two methods give equivalent results, as described in Dick et al. (2020).

Similarly, `Zc()` uses precomputed values of ZC and nC (number of carbon atoms) for each amino acid. NOTE: Calculating ZC of proteins from amino acid frequencies (i.e. abundances or counts in a protein sequence) requires weighting the amino-acid ZC by the number of carbon atoms in each amino acid, in addition to weighting by amino acid frequency. Using the unweighted mean of ZC of amino acids leads to artificially higher values for proteins.

GRAVY, pI, and molecular weight

There are also functions for calculating the grand average of hydropathy (GRAVY) and isoelectric point (pI) of proteins. Below, values for representative proteins are compared with results from the ProtParam tool (Gasteiger et al., 2005) in UniProt.

We first look up a few proteins in CHNOSZ’s list of proteins, then get the amino acid compositions.

``````iprotein <- CHNOSZ::pinfo(c("LYSC_CHICK", "RNAS1_BOVIN", "AMYA_PYRFU", "CSG_HALJP"))
AAcomp <- CHNOSZ::pinfo(iprotein)``````

Calculate GRAVY and compare it to reference values from https://web.expasy.org/protparam/.

``````G_calc <- GRAVY(AAcomp)
# https://web.expasy.org/cgi-bin/protparam/protparam1?P00698@19-147@
# https://web.expasy.org/cgi-bin/protparam/protparam1?P61823@27-150@
# https://web.expasy.org/cgi-bin/protparam/protparam1?P49067@2-649@
G_ref <- c(-0.472, -0.663, -0.325)
stopifnot(all.equal(round(G_calc[1:3], 3), G_ref, check.attributes = FALSE))``````

Calculate pI and compare it to reference values.

``````pI_calc <- pI(AAcomp)
# Reference values calculated with ProtParam
# LYSC_CHICK: residues 19-147 (sequence v1)
# RNAS1_BOVIN: residues 27-150 (sequence v1)
# AMYA_PYRFU: residues 2-649 (sequence v2)
# CSG_HALJP: residues 35-862 (sequence v1)
pI_ref <- c(9.32, 8.64, 5.46, 3.37)
stopifnot(all.equal(as.numeric(pI_calc), pI_ref))``````

Calculate molecular weight (`MW`) and compare it to reference values

``````# Per-residue molecular weight multiplied by number of residues
MWcalc <- MW(AAcomp) * plength(AAcomp)