---
title: "Preparation of the example data"
author: "Joris Meys"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Preparation of the example data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The example data used in this package was originally published by
[Yamanishi et al, 2008](https://doi.org/10.1093/bioinformatics/btn162). They 
used [the KEGG data base](https://www.kegg.jp/) to get information drug-target
interaction for different groups of enzymes. We used their supplementary
material as a basis for the example data provided to the package. Their
supplementary datasets can be downloaded from [here](http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/).

## Obtaining the original data

The original adjacency matrix and similarity of the targets
were downloaded from that website using the following code:

```{r getFiles, eval = FALSE}
adjAddress <- "http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/nr_admat_dgc.txt"

targetAddress <- "http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/nr_simmat_dg.txt"

drugTargetInteraction <- as.matrix(
  read.table(adjAddress, header = TRUE, row.names = 1, sep = "\t")
)
targetSim <- as.matrix(
  read.table(targetAddress, header =TRUE, row.names = 1, sep = "\t")
)
```

This data was used as is from the website.

## Processing the drug similarities

In the original paper the authors relied on the SIMCOMP algorithm, but this
method returns non-symmetric matrices and hence the original data cannot
be used in a meaningful way for a two-step kernel ridge regression. Hence we decided to recreate the similarities between the different drugs,
this time using the algorithms provided in the 
[fmcsR package v1.20.0](https://bioconductor.org/packages/release/bioc/html/fmcsR.html). 
The code used to obtain and process the drug similarities is heavily based on
code kindly provided by Dr. Thomas Girke on the 
[BioConductor support forum](https://support.bioconductor.org/p/106712/#106744).

### Obtaining the data

To read in the structural data for all compounds we create a function that
constructs the actual link and retrieves the data from KEGG. This function
is based on the tools provided in the 
[ChemmineR package v2.30.2](http://bioconductor.org/packages/ChemmineR/):

```{r importKEGG, eval = FALSE}
library(ChemmineR)
importKEGG <- function(ids){
  sdfset <- SDFset() # creates an empty SDF set
  
  # We use the link format for obtaining the data
  urlp <- "http://www.genome.jp/dbget-bin/www_bget?-f+m+drug+"
  
  # Combine everything in an sdfset
  for(i in ids){
    url <- paste0(urlp, i)
    tmp <- as(read.SDFset(url), "SDFset")
    cid(tmp) <- i
    sdfset <- c(sdfset, tmp)
  }
  return(sdfset)
}
# Now read the SDF information for all compounds in the research
keggsdf <- importKEGG(colnames(drugTargetInteraction))
```

### Calculating the similarities

The `fmcs` function in the `fmcsR` package allows to compute a similarity
score between two compounds. It returns a few different similarity measures,
including the Tanimoto coefficient. This coefficient turns out to be a 
valid kernel for chemical similarities 
([Ralaivola et al, 2005](https://doi.org/10.1016/j.neunet.2005.07.009) ,
[Bajusz et al, 2015](https://doi.org/10.1186/s13321-015-0069-3)). So
in this example we continue with the Tanimoto coefficients.

```{r tanimoto, eval = FALSE}
# Keep in mind this needs some time to run! 
drugSim <- sapply(cid(keggsdf),
                  function(x){
                    fmcsBatch(keggsdf[x], keggsdf,
                              au = 0, bu = 0)[,"Tanimoto_Coefficient"]
                  })
```

All data is stored in the package and can be accessed using

```{r}
data(drugtarget)
```