This tutorial guides the user through the process of doing multiple linear regression and data exploration on 16 p38 MAP kinase inhibitors with the software package R. Explorative data analysis is carried out on this dataset, containing precalculated physicochemical descriptors. Multiple linear regression and correlation analysis are utilized to identify descriptors influencing koff.
In addition the following R packages are needed for this tutorial.
The code snippet below installs all packages needed.
We recommend the excellent book R for Data Science by Hadley Wickham and Garrett Grolemund for a guided tour into R for Data Science.
For advanced classification and regression analysis, the R package caret developed by Max Kuhne is recommended. More on this topic
The input file has to be provided as a csv file. Each row represents a compound and the columns describe its physicochemical and kinetic parameters.
We extracted kinetic parameters for 16 MAP 38 alpha inhibitors from Regan et al. For these 16 inhibitors we calculated a set of five physicochemical descriptors using the MOE software. However any non-proprietary software can be used to generate the descriptors of interest, e.g. RDkit. This data is stored in the map38.csv file.
column name | description |
---|---|
compound_key | compound identifier taken from Regan et al. |
pkon | −log10 transformed kon value, as reported in Regan et al. |
pkoff | −log10 transformed koff value, as reported in Regan et al. |
pkD | −log10 transformed KD value, as reported in Regan et al. |
vsa_positiv | postitive surface area; descr. Q_VSA_POS in MOE |
vsa_negative | negative surface area; descr. Q_VSA_NEG in MOE |
vsa_polar | polar surface area; descr. Q_VSA_POL in MOE |
lipophilicity | lipophilicity; descr. logP(o/w) in MOE |
weight | Molecular weight, Weight in MOE |
Download the provided map38.csv file on your computer and assign the file path to path_to_file. We then use the read.csv function to read the map38.csv file into R assign it to the map38 variable.
compound_key |
pKD |
pkon |
pkoff |
vsa_positiv | vsa_negativ | lipophilicity | vsa_polar | weight |
---|---|---|---|---|---|---|---|---|
2 | 10.01 | -4.93 | 5.08 | 369.66 | 183.17 | 4.94 | 147.38 | 527.67 |
3 | 5.92 | -5.07 | 0.85 | 204.59 | 125.31 | 2.97 | 81.18 | 306.80 |
5 | 10.01 | -5.19 | 4.82 | 342.38 | 193.22 | 4.64 | 147.38 | 513.64 |
6 | 8.23 | -4.81 | 3.42 | 315.11 | 193.22 | 4.29 | 147.38 | 499.61 |
7 | 7.80 | -4.40 | 3.40 | 371.93 | 190.94 | 3.74 | 180.42 | 543.67 |
8 | 7.64 | -5.16 | 2.48 | 357.09 | 125.24 | 2.99 | 170.62 | 451.57 |
column | n | min | mean | max | range |
---|---|---|---|---|---|
pKD | 16 | 5.92 | 8.201875 | 10.01 | 4.09 |
pkon | 16 | -5.19 | -4.517500 | -3.20 | 1.99 |
pkoff | 16 | 0.85 | 3.685625 | 5.08 | 4.23 |
vsa_positiv | 16 | 204.59 | 343.281250 | 459.19 | 254.60 |
vsa_negativ | 16 | 125.24 | 173.359375 | 193.22 | 67.98 |
lipophilicity | 16 | 2.97 | 4.463750 | 5.69 | 2.72 |
vsa_polar | 16 | 57.94 | 143.048125 | 207.05 | 149.11 |
weight | 16 | 306.80 | 489.707500 | 584.77 | 277.97 |
The pipe operator %>% is handy way to run a sequence of functions. The broom::tidy function provides us with a summary of the dataset. The dplyr::select function is used to select a set of columns, it can also be used to exclude columns using the "-" sign as a prefix, e.g. select(-compound_key)
Copy the code snippet below into R or R Studio. The output of this section is a kinetic map, plotting pkon values on the x-axis against pkoff values on the y-axis. This is a quick way to draw a kinetic map, using ggplot2 . It allows visual inspection of the kinetic data distribution and facilitates the detection of outliers. Values of xintercept and yintercept can be altered according to covered pkon and pkoff ranges.
ggplot2::geom_abline draws diagonal lines
ggplot2::geom_vline draws vertical lines
ggplot2::geom_hline draws horizontal lines
ggplot2::ggrepel::geom_text_repel is used to avoid olverlapping labels
The following code snippets provides a routine to standardise your data using scale.
pKD | pkon | pkoff | vsa_positiv | vsa_negativ | lipophilicity | vsa_polar | weight |
---|---|---|---|---|---|---|---|
1.6740729 | -0.7547556 | 1.2434457 | 0.3460330 | 0.4508013 | 0.5678886 | 0.0885297 | 0.4607009 |
-2.1126997 | -1.0109151 | -2.5286925 | -1.8193335 | -2.2078841 | -1.7811728 | -1.2643874 | -2.2197076 |
1.6740729 | -1.2304803 | 1.0115885 | -0.0118225 | 0.9126020 | 0.2101635 | 0.0885297 | 0.2904373 |
0.0260398 | -0.5351903 | -0.2368733 | -0.3695467 | 0.9126020 | -0.2071824 | 0.0885297 | 0.1201736 |
-0.3720805 | 0.2149910 | -0.2547085 | 0.3758105 | 0.8078353 | -0.8630118 | 0.7637621 | 0.6548718 |
-0.5202183 | -1.1755890 | -1.0751263 | 0.1811414 | -2.2111006 | -1.7573245 | 0.5634813 | -0.4628246 |
To prepare for the boxplot the data needs to be transformed using the tidyr::gather function. More on data transformation in R can be found here
A scatterplot matrix can be used to identify correlations between variables. The R package GGally provides the ggpairs function which outputs a scatterplot and correlation matrix from a data frame.
The leaps package includes the regsubsets function which provides forward, backward and exhaustive search for the best variable subset of a linear regression model. In this example we used the exhaustive method which tries all possible variable combinations. The output of the regbusbets function can be forwarded to the plot function, which sorts the variable combinations according to the adjusted R2.
According to the regsubsets function the best variable combination for a linear model is vsa_polar and weight.
This combination yields in an adjusted R2 = 0.79.
The R function lm returns a linear regression model. From the regsubsets function in the prior step, we found that weight and vsa_polar yielded in the best adjusted R2.
The model shows significant coefficients, however due to the strong intercorrelation between weight and vsa_polar one should be careful about interpretation. The ggcoef function can be used to visualize the coeffients of a regression model, including their standard errors.
The plotly package allows the production of interactive 3D plots. The R code below is translated into HMTL and Javascript and thus enable interactive behaviour. Here, we created a point plot using the variables vsa_polar, weight and pkoff.
This plot can only be viewed in HTML mode and WebGL enabled browsers.
An excellent introduction on regression diagnostics can be found here. More on regression diagnostics in R. The stastistical significance tests for the coefficients and the regression model are based on assumptions of the underlying data and error terms. The ggfortify provides a set of diagnostic plots to evaluate our obtained regression model.