6. RMSF Baseline Models

On this package we are focusing a lot on the RMSF value as a mean of selecting which residues have the most separability between the Agonist - Antagonist classes. In order to evaluate our residue selection techniques we need a way to quantify their separability.

For this reason we created some baseline models which are simple and can be used as any ML model (fit, predict) and calculate values such as accuracy.

In the module we have an abstract base class called BaselineClassifier which contains some basic helper methods for reading and preparing the RMSF values. Any subclass must implement the fit and predict methods.

baseline uml diagram missing

We suggest reading this example to feel familiar with the flow.

Warning

These models should not be used as the final classification models due to their simplicity. Their goal is evaluate the residue selection techniques.

The module also includes a method for bootstrapping the dataset of ligands.

class MDSimsEval.rmsf_baseline_models.AggregatedResidues(start, stop, rmsf_cache, method=<function mean>)

This simple model fits on the training data by calculating the average RMSF value of the agonist and the antagonist sets. The RMSF is calculated on the residues given and aggregated to one value. This means that for the agonists we calculate the average/median value for each residue and then we aggregate again ending with a scalar value.

start

The starting frame of the window

Type

int

stop

The last frame of the window

Type

int

method

Function that we will use for summarizing each residue, eg np.mean, np.median

Type

func

agonist_baseline

The aggregated RMSF value of the agonist class

Type

float

antagonist_baseline

The aggregated RMSF value of the antagonist class

Type

float

selected_residues

A list of size total_residues, where True on the indexes of the residue ids selected

Type

List[boolean]

fit(train_analysis_actors, residues)

The function that initializes the aggregated RMSF value for each class.

Parameters
  • train_analysis_actors{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }

  • residues

    A list of residue ids that the model will use. For all the residues give np.arange(290) . Can also be a dictionary with residue ids as keys and values a list of [start, stop]. This argument allows us to use a residue in more than one window. This is used in the residue cherry picking part of the thesis.

    Example

    # This will use the window saved as an attribute when we created the model object
    residues = [10, 11, 27, 52, 83]
    or
    residues = {
                 115: [[0, 500], [2000, 2500]],
                 117: [[2000, 2500]],
                 81: [[2000, 2500]],
                 78: [[1000, 1500], [1500, 2000]],
                 254: [[0, 500], [1500, 2000]],
              }
    

predict(ligand)

Checks the distance of the unknown ligand from the agonist and antagonist averages and returns as a label the class that is closest.

Parameters

ligand (AnalysisActorClass) – The ligand we want to predict its class

Returns

The class label, 1 for Agonist, 0 for Antagonist.

class MDSimsEval.rmsf_baseline_models.BaselineClassifier(start, stop, rmsf_cache, method)

An abstract class that the baseline classifiers models extend from. The subclasses must implement the abstract methods fit and predict.

The class provides some helper methods in order to stack the RMSF of the selected residues for training and getting the RMSF of the unknown ligand we want to predict its class.

If you want to extend the method we suggest reading the other classes in this model. This will help in order to better understand how fit and predict are implemented and how we use the helper methods in this class.

class MDSimsEval.rmsf_baseline_models.KSDistance(start, stop, rmsf_cache, method=<function mean>)

Used for evaluating residue selections based on their RMSF.

We calculate for each class the average/median RMSF of each residue. Then when we receive an unknown ligand we use the k-s test which returns the “distance” of the distributions of the unknown ligand and the class. We calculate the distance for the other class too and classify on the class with the smallest distance.

start

The starting frame of the window

Type

int

stop

The last frame of the window

Type

int

method

Function that we will use for summarizing each residue, eg np.mean, np.median

Type

func

agonist_residue_baseline

A list of the aggregated RMSF value of the agonists of each residue

Type

List[float]

antagonist_residue_baseline

A list of the aggregated RMSF value of the antagonists of each residue

Type

List[float]

selected_residues

A list of size total_residues, where True on the indexes of the residue ids selected

Type

List[boolean]

fit(train_analysis_actors, residues)

The function that initializes the aggregated RMSF value for each residue for each class.

Parameters
  • train_analysis_actors{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }

  • residues

    A list of residue ids that the model will use. For all the residues give np.arange(290) . Can also be a dictionary with residue ids as keys and values a list of [start, stop]. This argument allows us to use a residue in more than one window. This is used in the residue cherry picking part of the thesis.

    Example

    # This will use the window saved as an attribute when we created the model object
    residues = [10, 11, 27, 52, 83]
    or
    residues = {
                 115: [[0, 500], [2000, 2500]],
                 117: [[2000, 2500]],
                 81: [[2000, 2500]],
                 78: [[1000, 1500], [1500, 2000]],
                 254: [[0, 500], [1500, 2000]],
              }
    

predict(ligand)

Performs the K-S test. If we have a mismatch of outcomes (one class accepts it, the other one rejects it) then we classify as the one that accepted. Else, we classify using the distance of K-S.

Parameters

ligand (AnalysisActorClass) – The ligand we want to predict its class

Returns

The class label, 1 for Agonist, 0 for Antagonist

class MDSimsEval.rmsf_baseline_models.MDStoKNN(start, stop, rmsf_cache, metric, neighbors)

This is a model that classifies ligands based on the K nearest neighbors on a MDS 2D projection.

We first calculate the pairwise distances of all the ligands creating an agons_numb + antagons_numb x agons_numb + antagons_numb matrix. We perform a non-linear projection using MDS transforming the matrix to a agons_numb + antagons_numb x 2 shape.

We provide the indexes of the ligands that the labels are known. These will be considered our train set. We fit a KNN model on them. We then provide the index of the ligand we want to predict. For example we may have 20 agonists, 20 antagonists and we want to predict an unknown ligand. The transformed shape of our 2D projection will be 20 + 20 + 1 x 2. The indexes [0, 1, …, 39] will for the train set and will be passed on the fit of the KNN. We then pass the index 40 in the predict method in order to predict the unknown ligand.

Note

This model is more complex than the others which have a straight forward approach, similar to the sklearn models. The main idea is that the model knows all the data points a priori in order to create the 2D mapping. We then reveal to the model the labels of the known ligands in order to predict the unknown ligands.

Parameters
  • start (int) – The starting frame of the window

  • stop (int) – The last frame of the window

  • metric – A method used to calculate the pairwise distance of the ligands. Possible metrics are K-S distance and Spearman’s r

  • neigh (KNeighborsClassifier) – The KNN model used for predicting the unknown ligands

choose_known_ligands(agonist_inds, antagonists_inds)

Give the indexes of the agonists and antagonists ligands that will be form the train set and fits the KNN model using the transformed pairwise distances.

Parameters
  • agonist_inds – The indexes of the train agonists

  • antagonists_inds – The indexes of the train antagonists

create_pairwise_distances(analysis_actors_dict, residues)

Creates the pairwise distance matrix of all the input ligands that will be projected on a 2D manifold using MDS

Parameters
  • analysis_actors_dict{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }

  • residues – A list of residue ids that the model will use. For all the residues give np.arange(290).

Returns

A DataFrame of shape agons_numb + antagons_numb x agons_numb + antagons_numb

fit(analysis_actors_dict, residues)

Create the pairwise distance matrix and perform MDS to transform it to a 2D matrix.

Parameters
  • analysis_actors_dict{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }

  • residues – A list of residue ids that the model will use. For all the residues give np.arange(290).

predict(ligand_ind)

Given an index of the unknown ligand predict it using the KNN fitted KNN model

Parameters

ligand_ind – The index of the unknown ligand

Returns

The class label, 1 for Agonist, 0 for Antagonist

class MDSimsEval.rmsf_baseline_models.ResidueMajority(start, stop, rmsf_cache, method=<function mean>)

Used for evaluating residue selections based on their RMSF.

This is a simple baseline model able to quantify how good our residue selection is. Given a k - k training set of agonists - antagonists,for an unknown ligand we iterate through the residues. If the residue is closer to the median/average of the training agonists (of the RMSF values of the specific residue) then the residue votes that the ligand is an agonist. Else it votes that the ligand is an antagonist. | | At then end we see which class had the most votes.

start

The starting frame of the window

Type

int

stop

The last frame of the window

Type

int

method

Function that we will use for summarizing each residue, eg np.mean, np.median

Type

func

agonist_residue_baseline

A list of the aggregated RMSF value of the agonists of each residue

Type

List[float]

antagonist_residue_baseline

A list of the aggregated RMSF value of the antagonists of each residue

Type

List[float]

selected_residues

A list of size total_residues, where True on the indexes of the residue ids selected

Type

List[boolean]

fit(train_analysis_actors, residues)

The function that initializes the aggregated RMSF value for each residue for each class.

Parameters
  • train_analysis_actors{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }

  • residues

    A list of residue ids that the model will use. For all the residues give np.arange(290) . Can also be a dictionary with residue ids as keys and values a list of [start, stop]. This argument allows us to use a residue in more than one window. This is used in the residue cherry picking part of the thesis.

    Example

    # This will use the window saved as an attribute when we created the model object
    residues = [10, 11, 27, 52, 83]
    or
    residues = {
                 115: [[0, 500], [2000, 2500]],
                 117: [[2000, 2500]],
                 81: [[2000, 2500]],
                 78: [[1000, 1500], [1500, 2000]],
                 254: [[0, 500], [1500, 2000]],
              }
    

predict(ligand)

Performs the majority voting and returns the predicted class.

Parameters

ligand (AnalysisActorClass) – The ligand we want to predict its class

Returns

The class label, 1 for Agonist, 0 for Antagonist.

MDSimsEval.rmsf_baseline_models.bootstrap_dataset(analysis_actors_dict, samples, sample_size)

Creates a given number of bootstrapped samples of the Agonist - Antagonist dataset. | Also the remaining ligands are returned as a validation set. | Eg if sample_size = 20 and on each class we have 27 ligands, then we create a dict of 20 - 20 unique ligands and the remaining 7 ligands are returned as a validation set.

Parameters
  • analysis_actors_dict{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }

  • samples (int) – Number of bootstrapped samples generated

  • sample_size (int) – How many ligands of each class the training set will have

Returns

A tuple of (train_dicts, test_dicts)