6. RMSF Baseline Models¶
On this package we are focusing a lot on the RMSF value as a mean of selecting which residues have the most separability between the Agonist - Antagonist classes. In order to evaluate our residue selection techniques we need a way to quantify their separability.
For this reason we created some baseline models which are simple and can be used as any ML model (fit
, predict
)
and calculate values such as accuracy.
In the module we have an abstract base class called BaselineClassifier
which contains some basic helper methods
for reading and preparing the RMSF values. Any subclass must implement the fit
and predict
methods.
We suggest reading this example to feel familiar with the flow.
Warning
These models should not be used as the final classification models due to their simplicity. Their goal is evaluate the residue selection techniques.
The module also includes a method for bootstrapping the dataset of ligands.
-
class
MDSimsEval.rmsf_baseline_models.
AggregatedResidues
(start, stop, rmsf_cache, method=<function mean>)¶ This simple model fits on the training data by calculating the average RMSF value of the agonist and the antagonist sets. The RMSF is calculated on the residues given and aggregated to one value. This means that for the agonists we calculate the average/median value for each residue and then we aggregate again ending with a scalar value.
-
start
¶ The starting frame of the window
- Type
int
-
stop
¶ The last frame of the window
- Type
int
-
method
¶ Function that we will use for summarizing each residue, eg np.mean, np.median
- Type
func
-
agonist_baseline
¶ The aggregated RMSF value of the agonist class
- Type
float
-
antagonist_baseline
¶ The aggregated RMSF value of the antagonist class
- Type
float
-
selected_residues
¶ A list of size total_residues, where True on the indexes of the residue ids selected
- Type
List[boolean]
-
fit
(train_analysis_actors, residues)¶ The function that initializes the aggregated RMSF value for each class.
- Parameters
train_analysis_actors –
{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }
residues –
A list of residue ids that the model will use. For all the residues give
np.arange(290)
. Can also be a dictionary with residue ids as keys and values a list of[start, stop]
. This argument allows us to use a residue in more than one window. This is used in the residue cherry picking part of the thesis.Example
# This will use the window saved as an attribute when we created the model object residues = [10, 11, 27, 52, 83] or residues = { 115: [[0, 500], [2000, 2500]], 117: [[2000, 2500]], 81: [[2000, 2500]], 78: [[1000, 1500], [1500, 2000]], 254: [[0, 500], [1500, 2000]], }
-
predict
(ligand)¶ Checks the distance of the unknown ligand from the agonist and antagonist averages and returns as a label the class that is closest.
- Parameters
ligand (AnalysisActorClass) – The ligand we want to predict its class
- Returns
The class label, 1 for Agonist, 0 for Antagonist.
-
-
class
MDSimsEval.rmsf_baseline_models.
BaselineClassifier
(start, stop, rmsf_cache, method)¶ An abstract class that the baseline classifiers models extend from. The subclasses must implement the abstract methods
fit
andpredict
.The class provides some helper methods in order to stack the RMSF of the selected residues for training and getting the RMSF of the unknown ligand we want to predict its class.
If you want to extend the method we suggest reading the other classes in this model. This will help in order to better understand how
fit
andpredict
are implemented and how we use the helper methods in this class.
-
class
MDSimsEval.rmsf_baseline_models.
KSDistance
(start, stop, rmsf_cache, method=<function mean>)¶ Used for evaluating residue selections based on their RMSF.
We calculate for each class the average/median RMSF of each residue. Then when we receive an unknown ligand we use the k-s test which returns the “distance” of the distributions of the unknown ligand and the class. We calculate the distance for the other class too and classify on the class with the smallest distance.
-
start
¶ The starting frame of the window
- Type
int
-
stop
¶ The last frame of the window
- Type
int
-
method
¶ Function that we will use for summarizing each residue, eg
np.mean
,np.median
- Type
func
-
agonist_residue_baseline
¶ A list of the aggregated RMSF value of the agonists of each residue
- Type
List[float]
-
antagonist_residue_baseline
¶ A list of the aggregated RMSF value of the antagonists of each residue
- Type
List[float]
-
selected_residues
¶ A list of size total_residues, where True on the indexes of the residue ids selected
- Type
List[boolean]
-
fit
(train_analysis_actors, residues)¶ The function that initializes the aggregated RMSF value for each residue for each class.
- Parameters
train_analysis_actors –
{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }
residues –
A list of residue ids that the model will use. For all the residues give
np.arange(290)
. Can also be a dictionary with residue ids as keys and values a list of[start, stop]
. This argument allows us to use a residue in more than one window. This is used in the residue cherry picking part of the thesis.Example
# This will use the window saved as an attribute when we created the model object residues = [10, 11, 27, 52, 83] or residues = { 115: [[0, 500], [2000, 2500]], 117: [[2000, 2500]], 81: [[2000, 2500]], 78: [[1000, 1500], [1500, 2000]], 254: [[0, 500], [1500, 2000]], }
-
predict
(ligand)¶ Performs the K-S test. If we have a mismatch of outcomes (one class accepts it, the other one rejects it) then we classify as the one that accepted. Else, we classify using the distance of K-S.
- Parameters
ligand (AnalysisActorClass) – The ligand we want to predict its class
- Returns
The class label, 1 for Agonist, 0 for Antagonist
-
-
class
MDSimsEval.rmsf_baseline_models.
MDStoKNN
(start, stop, rmsf_cache, metric, neighbors)¶ This is a model that classifies ligands based on the K nearest neighbors on a MDS 2D projection.
We first calculate the pairwise distances of all the ligands creating an
agons_numb
+antagons_numb
xagons_numb
+antagons_numb
matrix. We perform a non-linear projection using MDS transforming the matrix to aagons_numb
+antagons_numb
x 2 shape.We provide the indexes of the ligands that the labels are known. These will be considered our train set. We fit a KNN model on them. We then provide the index of the ligand we want to predict. For example we may have 20 agonists, 20 antagonists and we want to predict an unknown ligand. The transformed shape of our 2D projection will be 20 + 20 + 1 x 2. The indexes [0, 1, …, 39] will for the train set and will be passed on the fit of the KNN. We then pass the index 40 in the
predict
method in order to predict the unknown ligand.Note
This model is more complex than the others which have a straight forward approach, similar to the
sklearn
models. The main idea is that the model knows all the data points a priori in order to create the 2D mapping. We then reveal to the model the labels of the known ligands in order to predict the unknown ligands.- Parameters
start (int) – The starting frame of the window
stop (int) – The last frame of the window
metric – A method used to calculate the pairwise distance of the ligands. Possible metrics are K-S distance and Spearman’s r
neigh (KNeighborsClassifier) – The KNN model used for predicting the unknown ligands
-
choose_known_ligands
(agonist_inds, antagonists_inds)¶ Give the indexes of the agonists and antagonists ligands that will be form the train set and fits the KNN model using the transformed pairwise distances.
- Parameters
agonist_inds – The indexes of the train agonists
antagonists_inds – The indexes of the train antagonists
-
create_pairwise_distances
(analysis_actors_dict, residues)¶ Creates the pairwise distance matrix of all the input ligands that will be projected on a 2D manifold using MDS
- Parameters
analysis_actors_dict –
{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }
residues – A list of residue ids that the model will use. For all the residues give
np.arange(290)
.
- Returns
A DataFrame of shape
agons_numb
+antagons_numb
xagons_numb
+antagons_numb
-
fit
(analysis_actors_dict, residues)¶ Create the pairwise distance matrix and perform MDS to transform it to a 2D matrix.
- Parameters
analysis_actors_dict –
{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }
residues – A list of residue ids that the model will use. For all the residues give
np.arange(290)
.
-
predict
(ligand_ind)¶ Given an index of the unknown ligand predict it using the KNN fitted KNN model
- Parameters
ligand_ind – The index of the unknown ligand
- Returns
The class label, 1 for Agonist, 0 for Antagonist
-
class
MDSimsEval.rmsf_baseline_models.
ResidueMajority
(start, stop, rmsf_cache, method=<function mean>)¶ Used for evaluating residue selections based on their RMSF.
This is a simple baseline model able to quantify how good our residue selection is. Given a k - k training set of agonists - antagonists,for an unknown ligand we iterate through the residues. If the residue is closer to the median/average of the training agonists (of the RMSF values of the specific residue) then the residue votes that the ligand is an agonist. Else it votes that the ligand is an antagonist. | | At then end we see which class had the most votes.
-
start
¶ The starting frame of the window
- Type
int
-
stop
¶ The last frame of the window
- Type
int
-
method
¶ Function that we will use for summarizing each residue, eg
np.mean
,np.median
- Type
func
-
agonist_residue_baseline
¶ A list of the aggregated RMSF value of the agonists of each residue
- Type
List[float]
-
antagonist_residue_baseline
¶ A list of the aggregated RMSF value of the antagonists of each residue
- Type
List[float]
-
selected_residues
¶ A list of size total_residues, where True on the indexes of the residue ids selected
- Type
List[boolean]
-
fit
(train_analysis_actors, residues)¶ The function that initializes the aggregated RMSF value for each residue for each class.
- Parameters
train_analysis_actors –
{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }
residues –
A list of residue ids that the model will use. For all the residues give
np.arange(290)
. Can also be a dictionary with residue ids as keys and values a list of[start, stop]
. This argument allows us to use a residue in more than one window. This is used in the residue cherry picking part of the thesis.Example
# This will use the window saved as an attribute when we created the model object residues = [10, 11, 27, 52, 83] or residues = { 115: [[0, 500], [2000, 2500]], 117: [[2000, 2500]], 81: [[2000, 2500]], 78: [[1000, 1500], [1500, 2000]], 254: [[0, 500], [1500, 2000]], }
-
predict
(ligand)¶ Performs the majority voting and returns the predicted class.
- Parameters
ligand (AnalysisActorClass) – The ligand we want to predict its class
- Returns
The class label, 1 for Agonist, 0 for Antagonist.
-
-
MDSimsEval.rmsf_baseline_models.
bootstrap_dataset
(analysis_actors_dict, samples, sample_size)¶ Creates a given number of bootstrapped samples of the Agonist - Antagonist dataset. | Also the remaining ligands are returned as a validation set. | Eg if sample_size = 20 and on each class we have 27 ligands, then we create a dict of 20 - 20 unique ligands and the remaining 7 ligands are returned as a validation set.
- Parameters
analysis_actors_dict –
{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }
samples (int) – Number of bootstrapped samples generated
sample_size (int) – How many ligands of each class the training set will have
- Returns
A tuple of (
train_dicts
,test_dicts
)