--- title: "Preparation of the example data" author: "Joris Meys" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Preparation of the example data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The example data used in this package was originally published by [Yamanishi et al, 2008](https://doi.org/10.1093/bioinformatics/btn162). They used [the KEGG data base](https://www.kegg.jp/) to get information drug-target interaction for different groups of enzymes. We used their supplementary material as a basis for the example data provided to the package. Their supplementary datasets can be downloaded from [here](http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/). ## Obtaining the original data The original adjacency matrix and similarity of the targets were downloaded from that website using the following code: ```{r getFiles, eval = FALSE} adjAddress <- "http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/nr_admat_dgc.txt" targetAddress <- "http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/nr_simmat_dg.txt" drugTargetInteraction <- as.matrix( read.table(adjAddress, header = TRUE, row.names = 1, sep = "\t") ) targetSim <- as.matrix( read.table(targetAddress, header =TRUE, row.names = 1, sep = "\t") ) ``` This data was used as is from the website. ## Processing the drug similarities In the original paper the authors relied on the SIMCOMP algorithm, but this method returns non-symmetric matrices and hence the original data cannot be used in a meaningful way for a two-step kernel ridge regression. Hence we decided to recreate the similarities between the different drugs, this time using the algorithms provided in the [fmcsR package v1.20.0](https://bioconductor.org/packages/release/bioc/html/fmcsR.html). The code used to obtain and process the drug similarities is heavily based on code kindly provided by Dr. Thomas Girke on the [BioConductor support forum](https://support.bioconductor.org/p/106712/#106744). ### Obtaining the data To read in the structural data for all compounds we create a function that constructs the actual link and retrieves the data from KEGG. This function is based on the tools provided in the [ChemmineR package v2.30.2](http://bioconductor.org/packages/ChemmineR/): ```{r importKEGG, eval = FALSE} library(ChemmineR) importKEGG <- function(ids){ sdfset <- SDFset() # creates an empty SDF set # We use the link format for obtaining the data urlp <- "http://www.genome.jp/dbget-bin/www_bget?-f+m+drug+" # Combine everything in an sdfset for(i in ids){ url <- paste0(urlp, i) tmp <- as(read.SDFset(url), "SDFset") cid(tmp) <- i sdfset <- c(sdfset, tmp) } return(sdfset) } # Now read the SDF information for all compounds in the research keggsdf <- importKEGG(colnames(drugTargetInteraction)) ``` ### Calculating the similarities The `fmcs` function in the `fmcsR` package allows to compute a similarity score between two compounds. It returns a few different similarity measures, including the Tanimoto coefficient. This coefficient turns out to be a valid kernel for chemical similarities ([Ralaivola et al, 2005](https://doi.org/10.1016/j.neunet.2005.07.009) , [Bajusz et al, 2015](https://doi.org/10.1186/s13321-015-0069-3)). So in this example we continue with the Tanimoto coefficients. ```{r tanimoto, eval = FALSE} # Keep in mind this needs some time to run! drugSim <- sapply(cid(keggsdf), function(x){ fmcsBatch(keggsdf[x], keggsdf, au = 0, bu = 0)[,"Tanimoto_Coefficient"] }) ``` All data is stored in the package and can be accessed using ```{r} data(drugtarget) ```