API Reference
Model
- class epik.model.EpiK(kernel, device='cpu', train_mean=False, train_noise=False, preconditioner_size=0, cg_tol=1.0, num_trace_samples=50, max_n_lanczos_iterations=50, track_progress=False)
Gaussian process regression model for inferring sequence-function relationships from experimental measurements using GPyTorch and KeOps backend.
- Parameters:
- kernelepik.kernel.Kernel
An instance of a kernel class that defines the covariance structure between pairs of sequences for Gaussian process regression.
- devicestr, optional
The device on which computations will be performed. Options are “cpu” or “cuda”. Default is “cpu”.
- train_meanbool, optional
Whether to optimize the mean function of the Gaussian Process. By default, it assumes a zero-mean function. Default is False.
- train_noisebool, optional
Whether to learn an additional noise parameter for the Gaussian Process. By default, it assumes that the provided error estimates are reliable. Default is False.
- preconditioner_sizeint, optional
The size of the preconditioner used to accelerate conjugate gradient convergence. By default, no preconditioner is computed. Default is 0.
- cg_tolfloat, optional
The tolerance level for the conjugate gradient solver. This parameter controls the precision of the solver. Lower values result in higher precision but may increase computation time. Default is 1.0.
- num_trace_samplesint, optional
The number of samples used to estimate the trace of a matrix during computations. Increasing this value improves the accuracy of the trace estimation but increases computational cost. Default is 50.
- max_n_lanczos_iterationsint, optional
The maximum number of Lanczos iterations to perform during matrix decompositions. Higher values may improve accuracy but increase computation time. Default is 50.
- track_progressbool, optional
Whether to display a progress bar during model fitting. Default is False.
Methods
fit([n_iter, learning_rate, optimizer])Optimize model hyperparameters by maximizing the marginal likelihood.
load(fpath, **kwargs)Load model parameters from a file.
make_contrasts(contrast_matrix, X[, ...])Compute phenotypic contrasts across sets of genotypes using the Gaussian process model.
predict(X[, calc_variance, labels])Function to make phenotypic predictions under the Gaussian process model
predict_epistatic_coeffs(seq0, alleles[, ...])Compute epistatic coefficients across sets of genotypes using the Gaussian process model.
predict_mut_effects(seq0, alleles[, ...])Predict the effects of single mutations on the phenotype using the Gaussian process model.
save(fpath)Save the model parameters to a file for future use.
set_data(X, y[, y_var])Set the training data for the model.
simulate(X[, n, sigma2])Sample random sequence-function relationships from the prior evaluated at the input sequences
- fit(n_iter=100, learning_rate=0.1, optimizer='Adam')
Optimize model hyperparameters by maximizing the marginal likelihood.
This process adjusts kernel parameters, as well as optional mean and additional noise parameters, to improve the model’s performance.
- Parameters:
- n_iterint, optional (default=100)
Number of iterations for the optimization process.
- learning_ratefloat, optional (default=0.1)
Learning rate for the optimizer.
- optimizerstr, optional (default=”Adam”)
Optimizer to use for hyperparameter optimization. Options are “Adam” or “SGD”.
- Raises:
- ValueError
If the specified optimizer is not recognized.
- load(fpath, **kwargs)
Load model parameters from a file.
- Parameters:
- fpathstr
Path to the file containing the stored model parameters.
- **kwargsdict, optional
Additional arguments to pass to torch.load for loading the parameters.
- make_contrasts(contrast_matrix, X, calc_variance=False)
Compute phenotypic contrasts across sets of genotypes using the Gaussian process model.
- Parameters:
- contrast_matrixtorch.Tensor of shape (n_contrasts, n_sequences)
A tensor representing the linear combinations of sequences encoded by X to compute the posterior distribution of the contrasts.
- Xtorch.Tensor of shape (n_sequences, n_features)
A tensor containing the one-hot encoded sequences for which predictions are to be made.
- calc_variancebool, optional (default=False)
If True, computes the posterior (co)-variance in addition to the posterior mean.
- Returns:
- outputtorch.Tensor or tuple of torch.Tensor
If calc_variance=False, returns a tensor containing the phenotypic predictions for the desired sequences. If calc_variance=True, returns a tuple where the first element is the phenotypic predictions and the second element is the covariance matrix of the posterior contrasts.
- predict(X, calc_variance=False, labels=None)
Function to make phenotypic predictions under the Gaussian process model
- Parameters:
- Xtorch.Tensor of shape (n_sequences, n_features)
Tensor containing the one-hot encoding of the sequences to make predictions
- calc_variancebool (False)
Option to compute the posterior variance in addition to the posterior mean reported by default
- labelsarray-like of shape (n_sequences,) or None
Sequence labels to use as rownames in the output pd.DataFrame
- Returns:
- outputpd.DataFrame of shape (n_sequences, 1 or 4)
DataFrame containing phenotypic predictions at the desired sequences. If calc_variance=True, posterior standard deviations and 95% credible interval bounds are added
- predict_epistatic_coeffs(seq0, alleles, calc_variance=False, max_size=100)
Compute epistatic coefficients across sets of genotypes using the Gaussian process model.
- Parameters:
- seq0str
The reference sequence for which epistatic coefficients are to be predicted.
- alleleslist of str
A list of possible alleles for each position in the sequence.
- calc_variancebool, optional (default=False)
If True, computes the posterior (co)-variance in addition to the posterior mean.
- max_sizeint, optional (default=100)
The maximum number of contrasts to process in a single batch. Larger values may increase memory usage.
- Returns:
- pd.DataFrame
A DataFrame containing the predicted epistatic coefficients for the desired sequences. If calc_variance=True, the DataFrame includes posterior standard deviations and 95% credible interval bounds.
- predict_mut_effects(seq0, alleles, calc_variance=False, max_size=100)
Predict the effects of single mutations on the phenotype using the Gaussian process model.
- Parameters:
- seq0str
The reference sequence for which single mutation effects are to be predicted.
- alleleslist of str
A list of possible alleles for each position in the sequence.
- calc_variancebool, optional (default=False)
If True, computes the posterior variance in addition to the posterior mean.
- max_sizeint, optional (default=100)
The maximum number of contrasts to process in a single batch. Larger values may increase memory usage.
- Returns:
- pd.DataFrame
A DataFrame containing the predicted effects of single mutations on the phenotype. If calc_variance=True, the DataFrame includes posterior standard deviations and 95% credible interval bounds.
- save(fpath)
Save the model parameters to a file for future use.
- Parameters:
- fpathstr
The file path where the model parameters will be saved.
- set_data(X, y, y_var=None)
Set the training data for the model.
- Parameters:
- Xtorch.Tensor
A tensor of shape (n_sequences, n_features) containing the one-hot encoded input sequences.
- ytorch.Tensor
A tensor of shape (n_sequences,) containing the phenotypic measurements corresponding to each sequence in X.
- y_vartorch.Tensor, optional
A tensor of shape (n_sequences,) representing the variance of the measurements in y. If None, it is assumed that there is no uncertainty in the measurements.
- simulate(X, n=1, sigma2=0.0001)
Sample random sequence-function relationships from the prior evaluated at the input sequences
- Parameters:
- Xtorch.Tensor of shape (n_sequence, n_features)
Tensor containing the one-hot encoding of the sequences to make predictions
- nint (1)
Number of sequence-function relationships to sample from the prior
- sigma2float (1e-4)
Additional random noise to add to the simulated landscapes
- Returns:
- ytorch.Tensor of shape (n_sequences, n)
Tensor containing the simulated landscapes evaluated in the input sequences
Kernels
- class epik.kernel.AdditiveKernel(n_alleles, seq_length, log_lambdas0=None, **kwargs)
Additive kernel for functions on sequence space.
This kernel computes the covariance between two sequences as a linear function of the Hamming distance separating them. The parameters are derived from the variance contributions of the constant and additive components.
\[K(x, y) = c_0 + c_1 \cdot d(x, y)\]where:
\[c_0 = \lambda_0 + \ell \cdot (\alpha - 1) \cdot \lambda_1\]\[c_1 = -\alpha \cdot \lambda_1\]Here, \(\lambda_0\) and \(\lambda_1\) are variance parameters, \(\ell\) is the sequence length, and \(\alpha\) is the number of alleles.
When applied to one-hot encoded sequence embeddings \(x_1\) and \(x_2\), this kernel returns a linear operator that facilitates efficient matrix-vector products without explicitly constructing the full covariance matrix.
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- class epik.kernel.PairwiseKernel(n_alleles, seq_length, log_lambdas0=None, **kwargs)
Pairwise kernel for functions on sequence space.
The covariance between two sequences is quadratic in the Hamming distance that separates them, with coefficients determined by the variance explained by the constant, additive and pairwise components.
\[K(x, y) = c_0 + c_1 \cdot d(x, y) + c_2 \cdot d(x, y)^2\]These coefficients result from expanding the Krawtchouk polynomials of order 2, as in the additive kernel, and allows computing the covariance matrix easily for any number of sequences of any length.
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- class epik.kernel.VarianceComponentKernel(n_alleles, seq_length, log_lambdas0=None, max_k=None, **kwargs)
Variance Component Kernel for functions on sequence space.
This kernel computes the covariance between two sequences using Krawtchouk polynomials.
\[K(x, y) = \sum_{k=0}^{\ell} \lambda_k \cdot K_k(x, y)\]To ensure differentiability in PyTorch, the covariance for each distance class is precomputed, and a kernel interpolation approach is used to compute the covariance between input sequence pairs.
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- class epik.kernel.ExponentialKernel(n_alleles, seq_length, log_var0=None, theta0=None, **kwargs)
Exponential Kernel for functions on sequence space.
This kernel computes the covariance between two sequences as a geometrically decaying function of the Hamming distance separating them.
\[K(x, y) = \left( \frac{ 1-\rho }{ 1 + (\alpha - 1)\rho } \right)^d\]where:
\(\rho\) is a parameter controlling the decay rate.
\(\alpha\) is the number of alleles.
\(d\) is the Hamming distance between sequences \(x\) and \(y\).
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- class epik.kernel.ConnectednessKernel(n_alleles, seq_length, log_var0=None, theta0=None, **kwargs)
Connectedness Kernel for functions on sequence space.
This kernel computes the covariance between two sequences where mutations at different sites have different effects on the predictability of other mutations
\[K(x, y) = \prod_p^{\ell}\frac{1-\rho_p}{1 + (\alpha - 1)\rho_p}\]where:
\(\rho_p\) is a parameter controlling the decay rate of site \(p\).
\(\alpha\) is the number of alleles.
\(\ell\) is the sequence length.
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
Compute the decay factors of the kernel.
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- get_delta()
Compute the decay factors of the kernel.
The decay factors represent the percentage decrease in predictability when introducing a specific mutation.
- Returns:
- deltatorch.Tensor
A tensor containing the decay factors.
- class epik.kernel.JengaKernel(n_alleles, seq_length, log_var0=None, theta0=None, **kwargs)
Jenga Kernel for functions on sequence space.
This kernel computes the covariance between two sequences as the product of allele- and site-specific factors at the alleles where they differ.
\[K(x, y) = \prod_{p: x_p \neq y_p} \sqrt{\frac{1-\rho_p}{1 + \frac{1-\pi_p^{x_p}}{\pi_p^{x_p}}\rho_p}} \sqrt{\frac{1-\rho_p}{1 + \frac{1-\pi_p^{y_p}}{\pi_p^{y_p}}\rho_p}}\]where:
\(\rho_p\) is a parameter controlling the decay rate at site \(p\).
\(\pi_p^{x_p}\) and \(\pi_p^{y_p}\) are site and allele specific probabilities.
\(\ell\) is the sequence length.
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
Compute the decay factors of the kernel.
Compute the mutation-specific decay factors of the kernel.
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- get_delta()
Compute the decay factors of the kernel.
The decay factors represent the percentage decrease in predictability when introducing a specific mutation.
- Returns:
- deltatorch.Tensor
A tensor containing the decay factors.
- get_mutation_delta()
Compute the mutation-specific decay factors of the kernel.
The decay factors represent the percentage decrease in predictability when introducing a specific mutation.
- Returns:
- deltatorch.Tensor
A tensor containing the decay factors for each possible mutation.
- class epik.kernel.GeneralProductKernel(n_alleles, seq_length, theta0=None, **kwargs)
General Product Kernel for sequence data.
This kernel computes the covariance between two sequences as the product of site-specific kernels, where each site kernel is parameterized by the Cholesky factor of a correlation matrix.
\[K(x, y) = \prod_{p=1}^\ell K_p(x_p, y_p),\]where:
\[K_p = L L^T,\]and \(L\) is the Cholesky factor of the correlation matrix, parameterized using the LKJ transform.
Methods
forward(x1, x2[, diag])Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\).
Compute the decay factors of the kernel.
- forward(x1, x2, diag=False, **kwargs)
Computes the covariance between \(\mathbf x_1\) and \(\mathbf x_2\). This method should be implemented by all Kernel subclasses.
- Parameters:
x1 – First set of data (… x N x D).
x2 – Second set of data (… x M x D).
diag – Should the Kernel compute the whole kernel, or just the diag? If True, it must be the case that x1 == x2. (Default: False.)
last_dim_is_batch – If True, treat the last dimension of x1 and x2 as another batch dimension. (Useful for additive structure over the dimensions). (Default: False.)
- Returns:
The kernel matrix or vector. The shape depends on the kernel’s evaluation mode:
full_covar: … x N x M
full_covar with last_dim_is_batch=True: … x K x N x M
diag: … x N
diag with last_dim_is_batch=True: … x K x N
- get_delta()
Compute the decay factors of the kernel.
The decay factors represent the percentage decrease in predictability when introducing a specific mutation.
- Returns:
- deltatorch.Tensor
A tensor containing the decay factors.
Utilities
- epik.utils.split_training_test(X, y, y_var=None, ptrain=0.8, dtype=None)
Splits a dataset into training and test subsets
- Parameters:
- Xtorch.Tensor of shape (n_sequences, n_features)
Tensor with the numerical encoding of the input sequences
- ytorch.Tensor of shape (n_sequence,)
Tensor containing the phenotypic measurements for each sequence in X
- y_vartorch.Tensor of shape (n_sequence,) or None
If y_var=None it is assumed that there is no uncertainty in the measurements. Otherwise, Tensor containing the variance of the measurements in y.
- ptrainfloat (0.8)
Proportion of the dataset to keep as training set
- dtypetorch.dtype (torch.float32)
dtype to use for storing the output tensors
- Returns:
- output: tuple of size (5,)
Tuple containing (train_X, train_y, test_X, test_y, test_y_var), where X is the matrix encoding the input sequences, y the associated measurements and y_var the variance of those measurements.
- epik.utils.encode_seqs(seqs, alphabet, encoding_type='one_hot', max_n=500)
Returns a torch.Tensor with the encoding of the provided sequences
- Parameters:
- seqsarray-like of shape (n_sequences,)
Array containing a list of sequences
- alphabetarray-like of shape (n_alleles, )
Iterable object with the list of alleles in the sequences
- encoding_type‘one_hot’ or ‘binary’
Type of encoding to use. If encoding_type=’one_hot’, sequences will be encoded under the one-hot encoding with one column for every allele and site. If encoding_type=’binary’ and there are only two possible alleles, then sequences will be encoded as binary, with a single column for every site taking values (-1, 1)
- max_nint (500)
Maximum number of features to allow in the encoding
- Returns:
- Xtorch.Tensor of shape (n_sequences, n_features)
Tensor with the numerical encoding of the input sequences