Tutorial
In this section, the user can find a tutorial on how to use EpiK for inference of sequence-function relationships on a small dataset, how to interpret the inferred hyperparameters of the different kernels, and how to use EpiK to make calibrated phenotypic predictions in unobserved sequences.
Import required libraries and functions
[1]:
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from epik.utils import encode_seqs, split_training_test
from epik.model import EpiK
from epik.kernel import ConnectednessKernel, ExponentialKernel, JengaKernel
[KeOps] Warning : Cuda libraries were not detected on the system or could not be loaded ; using cpu only mode
Load SMN1 data
We will use a previously published dataset, in which the activity of nearly every possible GU/GC 5´splice site sequence was measured in parallel in a single experiment (Wong et al. 2018). The measured phenotype y is the percent spliced in, this is, the percentage of times exon 7 is included in the mature mRNA. The assay was performed in different biological replicates, providing a measure of uncertainty of our estimates given by y_var, as
previously analyzed (Zhou et al. 2022)
We start by loading the data from the pre-processed CSV file
[2]:
data = pd.read_csv('../data/smn1.csv', index_col=0)
data
[2]:
| y | y_var | |
|---|---|---|
| seq | ||
| AAACAAAA | 0.223881 | 0.001738 |
| AAACAAAC | 0.514961 | 0.009195 |
| AAACAAAG | 0.208009 | 0.001500 |
| AAACAAAU | 0.235378 | 0.002109 |
| AAACAACA | 0.357162 | 0.004423 |
| ... | ... | ... |
| UUUUUUGU | 0.199017 | 0.001373 |
| UUUUUUUA | 0.074571 | 0.000536 |
| UUUUUUUC | 0.193181 | 0.001294 |
| UUUUUUUG | 0.121894 | 0.000536 |
| UUUUUUUU | 0.152185 | 0.000803 |
30732 rows × 2 columns
Prepare data for model fitting
EpiK models assume that sequences are provided in a one-hot encoding X, so we provide the utility function encode_seqs to generate this encoding from the raw sequences
[3]:
X = encode_seqs(data.index.values, alphabet='ACGU')
y = data['y'].values
y_var = data['y_var'].values
We next split the data into training and test subsets with split_training_test. To keep this tutorial as simple and fast to run as possible, we will use only a small subset of the dataset for both training and testing.
[4]:
splits = split_training_test(X, y, y_var, ptrain=0.075, dtype=torch.float32)
train_x, train_y, test_x, test_y, train_y_var = splits
# Subsample test data
idx = np.random.uniform(size=test_y.shape[0]) < 0.05
test_x, test_y = test_x[idx, :], test_y[idx]
Fit Connectedness model regression
To define an EpiK model, we first need to specify the kernel function that we want to use for Gaussian process regression. In this case, we will use the ConnectednessKernel, in which the predictability of mutational effects decreases depending on the sites at which two sequences differ.
As the number of parameters is equal to sequence length, this kernel provides a good compromise in datasets with relatively few measurements, such as this one. For every kernel class, we need to specify the configuration of sequence space, consisting of the number of alleles n_alleles and sequence length seq_length.
[5]:
kernel = ConnectednessKernel(n_alleles=4, seq_length=8)
model = EpiK(kernel, track_progress=True)
model.set_data(train_x, train_y, train_y_var)
After defining the EpiK model, we load the training data in the model and optimize the kernel hyperparameters by maximizing the marginal likelihood using the Adam optimizer for a number of predefined iterations. The argument track_progress=True allows us to follow the progress of the optimization and track how the loss function decreases
[6]:
model.fit(n_iter=100, learning_rate=0.1)
Optimizing hyperparameters: 100%|██████████| 100/100 [00:14<00:00, 6.91it/s, MLL=-8918.879, Mem(alloc/res)=0.00/0.00MB]
Make predictions on test sequences
Once we have optimized the hyperparameters of the kernel function, we can make predictions by computing the posterior distribution over the sequence-function relationship evaluated at sequences of interest. This allows us to obtain point estimates test_y_pred that maximize this posterior probability. However, we can additionally compute the posterior variance test_y_var quantifying the uncertainty of the predictions by setting calc_variance=True.
Lets now make predictions in the held-out test sequences.
[7]:
pred = model.predict(test_x, calc_variance=True)
pred
[7]:
| coef | stderr | lower_ci | upper_ci | |
|---|---|---|---|---|
| 0 | 2.116530 | 12.501842 | -22.887154 | 27.120216 |
| 1 | 1.571670 | 12.174002 | -22.776333 | 25.919674 |
| 2 | -0.843107 | 12.838286 | -26.519680 | 24.833466 |
| 3 | 3.053030 | 12.459997 | -21.866964 | 27.973024 |
| 4 | -2.211174 | 12.297050 | -26.805273 | 22.382925 |
| ... | ... | ... | ... | ... |
| 1492 | 0.437592 | 12.282001 | -24.126410 | 25.001593 |
| 1493 | -2.043804 | 12.466360 | -26.976524 | 22.888916 |
| 1494 | -3.416660 | 12.331523 | -28.079706 | 21.246386 |
| 1495 | -0.705414 | 12.133599 | -24.972612 | 23.561785 |
| 1496 | 1.321175 | 12.068440 | -22.815706 | 25.458055 |
1497 rows × 4 columns
Once we have the predictions, we can compare them with the measurements for these held-out sequences. We can evaluate the accuracy of the point estimates by computing the \(R^2\) and uncertainty calibration by calculating the percentage of times the measurement is within the 95% posterior predictive interval.
[8]:
r2 = pearsonr(pred['coef'], test_y)[0] ** 2
sd = pred['stderr']
obs = test_y.numpy()
calibration = ((obs > pred['coef'] - 2 * sd) & (obs < pred['coef'] + 2 * sd)).mean()
[9]:
fig, axes = plt.subplots(1, 1, figsize=(4, 3.5))
lims = (-20, 200)
axes.scatter(pred['coef'], obs, s=5, c='black', lw=0, alpha=0.5)
axes.plot(lims, lims, lw=0.75, linestyle='--', alpha=0.5)
axes.grid(alpha=0.2)
axes.text(0.05, 0.98, '$R^2$=' + '{:.2f}\nCalibration = {:.2f}'.format(r2, calibration),
transform=axes.transAxes, ha='left', va='top')
axes.set(xlabel='Predicted PSI (%)', xlim=lims,
ylabel='Measured PSI (%)', ylim=lims, aspect='equal')
[9]:
[Text(0.5, 0, 'Predicted PSI (%)'),
(-20.0, 200.0),
Text(0, 0.5, 'Measured PSI (%)'),
(-20.0, 200.0),
None]
This scatterplot demonstrates the relatively good performance of the model in predicting the phenotype of held-out sequences in terms of \(R^2\). Importantly, the model provides well-calibrated uncertainties, as the posterior distribution includes the actual measurements the expected number of times. This offers an estimate of how confident we can be about predictions for specific sequences with unknown phenotypes, aiding in decision-making.
Compare different models
When analyzing a new dataset, we often want to fit different models and compare them. This is easy to do by just providing a different kernel to the EpiK model. Here, we fit models with kernels of increasing complexity and compute the \(R^2\) of the predictions in held out data to track model performance.
[10]:
kernels = [ExponentialKernel, ConnectednessKernel, JengaKernel]
labels = ['Exponential', 'Connectedness', 'Jenga']
r2s = []
for label, kernel_type in zip(labels, kernels):
kernel = kernel_type(n_alleles=4, seq_length=8)
model = EpiK(kernel, track_progress=True, max_n_lanczos_iterations=200)
model.set_data(train_x, train_y, train_y_var)
model.fit(n_iter=100, learning_rate=0.1)
test_y_pred = model.predict(test_x, calc_variance=False)['coef']
r2s.append(pearsonr(test_y_pred, test_y)[0] ** 2)
results = pd.DataFrame({'kernel': labels, 'r2': r2s})
results
Optimizing hyperparameters: 100%|██████████| 100/100 [00:39<00:00, 2.53it/s, MLL=-9160.443, Mem(alloc/res)=0.00/0.00MB]
Optimizing hyperparameters: 100%|██████████| 100/100 [00:44<00:00, 2.26it/s, MLL=-8903.045, Mem(alloc/res)=0.00/0.00MB]
Optimizing hyperparameters: 100%|██████████| 100/100 [00:46<00:00, 2.16it/s, MLL=-8276.521, Mem(alloc/res)=0.00/0.00MB]
[10]:
| kernel | r2 | |
|---|---|---|
| 0 | Exponential | 0.559659 |
| 1 | Connectedness | 0.577626 |
| 2 | Jenga | 0.653812 |
Thus, this analysis shows that mutations at different sites affect the predictability of other mutations in a different manner, given the improved performance of the ConnectednessKernel over the ExponentialKernel. However, the JengaKernel provides even better predictive power, suggesting that the decay in predictability of mutational effects also depends on the specific alleles that are mutated at each site.
Interpreting Jenga model parameters
Once we have settled with the Jenga model, we aim use the inferred hyperparameters of the kernel function to gain some insigths about the sequence-function relationship we are studying. We can extract the allele and site specific decay rates with the get_decay_rates method as follows
[11]:
alleles = ['A', 'C', 'G', 'U']
positions = ['-3', '-2', '-1', '+2', '+3', '+4', '+5', '+6']
delta = kernel.get_delta().detach().numpy()
delta = pd.DataFrame(delta, index=positions, columns=alleles)
delta.loc['+2', ['A', 'G']] = np.nan
delta
[11]:
| A | C | G | U | |
|---|---|---|---|---|
| -3 | 0.020853 | 0.025342 | 0.438904 | 0.170491 |
| -2 | 0.288113 | 0.039503 | 0.612494 | 0.129223 |
| -1 | 0.076758 | 0.081544 | 0.820145 | 0.430954 |
| +2 | NaN | 0.141537 | NaN | 0.713322 |
| +3 | 0.774562 | 0.136644 | 0.406120 | 0.128953 |
| +4 | 0.284240 | 0.052869 | 0.256599 | 0.123743 |
| +5 | 0.016301 | 0.034593 | 0.713966 | 0.014079 |
| +6 | 0.013091 | 0.012482 | 0.041129 | 0.073866 |
We can visualize these values in a heatmap
[13]:
fig, axes = plt.subplots(1, 1, figsize=(4.5, 2.))
axes.set_facecolor('lightgrey')
sns.heatmap(delta.T * 100, cmap='Blues', ax=axes,
cbar_kws={'label': 'Decay rate (%)'},
vmin=0, vmax=100)
axes.set(xlabel='5´ splice site position',
ylabel='Allele')
axes.set_yticklabels(alleles, rotation=0)
sns.despine(left=False, bottom=False, top=False, right=False)
The decay rate of a mutation is the percent decrease in the predictability of other mutatiob. These effects are constant and thus accumulate as discount coupons when two sequences differ by sevelal mutations. In the Jenga model, a mutation decreases the predictability of other mutations by accumulating the decay rates associated to the alleles that are exchanged. Next, we show examples of mutations, even at the same position, that have very different effects on the predictability of other mutations under this prior.
[14]:
d = (1 - (1 - delta.loc['-1', 'G']) * (1 - delta.loc['-1', 'U'])) * 100
print('For instance, a G-1U mutation is expected to decrease the predictability of other mutations by {:.2f}%'.format(d))
For instance, a G-1U mutation is expected to decrease the predictability of other mutations by 89.77%
[15]:
d = (1 - (1 - delta.loc['-1', 'A']) * (1 - delta.loc['-1', 'C'])) * 100
print('In contrast, a mutation at the same position, A-1C, is expected to decrease the predictability of other mutations by only {:.2f}%'.format(d))
In contrast, a mutation at the same position, A-1C, is expected to decrease the predictability of other mutations by only 15.20%
In general, we see that the positions close to the exon-intron boundary, which are known to be more relevant for the recognition of the 5´splice site during the splicing reaction, have stronger decay factors. But we also see that, within each position, alleles with weak decay factors are more likely to be generally exchangeable without affecting the effects of other mutations.