qmtsvr is a R package for fitting (Quasi) Multi-task Support Vector Regression methods with a focus on the prediction of complex traits using genomic data. The package expands the commonly used ɛ -insensitive loss function for augmenting the number of observations for a target variable (the trait of interest) by aggregating the complementary information of correlated traits. The high computational burden for the hyperparameter fine-tuning is alleviated by considering trait-common values for the model bias, regularization parameter, and epsilon constant, leading to an optimization problem that can be handled in standard quadratic programming libraries. The H-dimensional hyperparameter space (with , where t is the number of traits) is then optimized via a stochastic evolutionary algorithm.
The package stable version is not hosted in the CRAN yet. However, the development version can be downloaded directly from the github page.
#Remember to properly install devtools first
library(devtools)
install_github('alvesand/qmtsvr')
The following data augmentation scheme is employed:
where t refers to the number of traits, and are the vectors of positive Lagrange multipliers related to the trait k, is a t-dimensional vector of ones, ⊗ is the Kronecker product, are vectors of observed values for t target variables standardized to the same scale, and Q is a block symmetric matrix, partitioned into blocks of dimension N x N. The Q matrix in (1) is a multi-task scaled RBF kernel, proposed as follows:
Using the dual formulation in eq. (1), the function for predicting yet-to-be-seen observations depends solely on the mapping kernel and can be written as , for k = 1, 2, … t (3), in which maps the relationship in the feature space between the individual i and every column j in the matrix .
The predicted value is then returned to the original scale by using the proper math. Equations 1, 2 and 3 are automatically used with the function qmtsvr.fit. Notice that the bias value, the regularization hyperparameter (), and the epsilon threshold () are assumed to be constant for all traits in equations 1 and 3, consequently, one can fit this extended SVR model with the standard SMO algorithm, available in most kernel-based methods libraries, just by precomputing the block matrix . Our implementation uses the kernlab package (Karatzoglou et al., 2004) as the backend for solving the quadratic programming problem. The model has hyperparameters to be tuned, the global and constants, besides the kernel bandwidth hyperparameters () and the association constants () in .
The function qmtsvr.GA uses a genetic algorithm (GA) for finding an optimal hyperparameter combination that maximizes the QMTSVR prediction ability. This stochastic evolutionary algorithm compares a population of candidate models (with binary arrays representing different hyperparameter combinations or the chromosome in the GA terminology) according to their fitness scores (fs). The arrays for the models with the best fs are then selected and crossed for composing the next generation. The resulting child arrays inherit features from both parent chromosomes from the previous generation.
The relevant parameters in the present GA implementation are the population size (PS), the number of generations (NG), crossover rate (CR), mutation rate (MR), and Tournament Size (tsize). The PS indicates the number of models tested per generation; the CR controls the rate that a child’s chromosome will result from the crossing-over of two parents instead of being an identical copy of one of them. The MR is the probability of a single bit (or gene) on the binary array changing randomly, implying slight modifications to the current model. The selection operator in the GA was the Tournament Selection, i.e., for each child chromosome to be created, tsize individuals are drawn at random from the current population and the one with the highest fs is selected for integrating the pair of crossing chromosomes.
The GA-based fine-tuning of the QMTSVR hyperparameters is illustrated in Figure 1.
Figure 1. The optimization process of the QMTSVR method via genetic algorithm (GA). During the GA-based fine-tuning each specific hyperparameter set is coded as a binary array (the chromosome). The fitness function is computed for each chromosome based on the predictive ability that this specific hyperparameter set achieves in the validation population (omitted during the GA-based optimization). The tournament selection operator selects a pair of individuals with the best fitness scores of two subsets randomly drawn with replacements from the current population. A child chromosome is created for each chromosome pair sampled with the selection operator and P new chromosomes are created for the next generation using the two-point crossing-over and mutation processes. The v worst individuals (set of hyperparameters) of the current generation are replaced with the v best individuals from the previous generation. The GA algorithm repeats the process for T generations. The best hyperparameter set () is then used for retraining the model with all available observations.
The package main functions are:
Returns an (weighted) Euclidean Distance Matrix (EDM) based on a numeric matrix .
the qmtsvr.dist function has the following arguments:
x
| A matrix of genotypes or other features (rows for individuals, columns for predictor variables)w
| A vector of size containing the weighting factors to be assigned to the features in during the EDM computation (Optional).u
| the power of the final EDM values (default = 2).verbose
A logical argument indicating if the time elapsed for computing the EDM must be printed (default = FALSE).vardiag
| A logical argument specifying if EDM diagonal elements must have variation around zero (default=FALSE). If vardiag
= TRUE, the zero values in EDM diagonals will be replaced with in which is the diagonal of and m is the median of .Fits a QMTSVR model with with user pre-defined hyper parameters. Returns an numeric prediction matrix.
the qmtsvr.fit function has the following arguments:
Y
| An numeric matrix of correlated response variables, where is the number of observations and is the number of traits (missing values are coded as NA).X
| A matrix of features (e.g., SNP markers), in which is the number of features.set_hyper
| a list specifying the QMTSVR model hyperparameters.w
| A list containing the trait-specific and trait-common weighting factors to be assigned to the features in during the EDM computation (Optional).D
| a list with all pre-computed EDM. It can be passed as an argument alternatively to the matrix (Optional).verbose
| A logical argument indicating if the function log must be printed at the console.vardiag
| A logical argument specifying if diagonal elements of the EMD should have variation around zero. (default=FALSE). If vardiag
= TRUE, the zero values in the EMD diagonal will be replaced with: , in which is the diagonal of and m is the median of . This quantity yields a variation around 1 for the diagonal elements of the RBF kernel, computed as , where is a hyperparameter constant and is the Euclidean Distance Matrix. When the SNP matrix is used as input, the diagonal elements in correlate with those from the matrix presented in VanRaden (2008).Optimizes the QMTSVR hyperparameters using a Genetic Algorithm.
the qmtsvr.GA function has the following arguments:
Y
| A matrix of response variables, where N is the number of observations and t is the number of traits (missing values are coded as NA).X
| matrix of features (e.g., SNP markers ), in which is the number of columns.hyper
| A list object containing the name specification and range for each hyperparameter to be optimized in the QMTSVR model.ngen
| The number of generations to run in the Genetic Algorithm.popsize
The population size (PS) to grow, it indicates the number of models tested per generation.mut_rate
| The mutation rate (e.g., 0.05) of the genetic algorithm. It is the probability of a single bit (or gene) on the binary array changing randomly, implying slight modifications to the current model.cross_rate
| The crossing over (CR) rate for the genetic algorithm. The CR controls the rate that a child’s chromosome will result from the crossing-over of two parents instead of being an identical copy of one of them.tsize
| The tournament size to be used in the GA(default = 5). For each child chromosome to be created, tsize
individuals are drawn at random from the current population and the one with the highest fitness scores is selected for integrating the pair of crossing chromosomes.elitism
| The number of models with the best performance in the current generation to keep for the next generarion (default = 1).cost
| The cost function to be optimized in the GA. The values accepted are: cor, rmse and mae for Pearson correlation between observed and predicted values, Root-mean squared error and Mean absolute error, respectively. The current version only accepts cost functions for continuous target variables.w
| A vector of size with the weighting factors to be assigned to the features in for the EDM computation (Optional).tgTrait
| A constant for indexing the column where the target variable is located at Y (default = 1).val_pop
| A character object indicating the validation strategy adopted in the GA. cross indicates cross-validation, custom indicates that the cost for an specific group must be optimized and closest indicates that the target population to be optimized will be composed by the closest individuals (based on the Euclidean distance of ) to those with NA values in the target variable.nfolds
| If val_pop = cross, nfolds
is the number of folds to be used in the cross-validation.k
| If val_pop
= closest, k
is the number closest neighbors to be identified in the EDM for each NA value.custom_val
| If val_pop
= custom, custom_val
is a vector of size N assuming 0 (train) and 1 (validation) values for indicating which individual will compose the validation population in the GA.vartype
| The target variable class. The current version only accepts continuous.MRCR
| it can be fixed (default) or dynamic; if the second option is chosen, the GA will gradatively reduce CR and increase MR when the algorithm get stuck in local minima.verbose
| A logical argument indicating if the GA output must be printed at the console.dopar
A logical argument indicating if the GA will be performed using a parallel process. It will speed up the GA run time. Attention: in the current version it will only work in Linux machines.ncores
| If dopar
= TRUE, the number of cores to be used must be specified (default = number of cores available - 1).In this section, we provide some examples on how to use each function of the qmtsvr package.
library(qmtsvr)
# Create toy data
values = rnorm(50, 0, 1)
X = matrix(values, 5, 10)
D = qmtsvr.dist (x=X)
D
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 4.733590 5.138939 4.242613 3.660351
#> [2,] 4.733590 0.000000 3.754960 3.951034 5.198133
#> [3,] 5.138939 3.754960 0.000000 4.023742 6.386911
#> [4,] 4.242613 3.951034 4.023742 0.000000 4.351771
#> [5,] 3.660351 5.198133 6.386911 4.351771 0.000000
#Computing the distance with the R base function
D2 = dist(X,method = "euclidean",diag = T)
D2
#> 1 2 3 4 5
#> 1 0.000000
#> 2 4.733590 0.000000
#> 3 5.138939 3.754960 0.000000
#> 4 4.242613 3.951034 4.023742 0.000000
#> 5 3.660351 5.198133 6.386911 4.351771 0.000000
# Give more weight for the columns 1 and 5
w = rep(1,10)
w[1] = 5
w[5] = 15.5
D2 = qmtsvr.dist (x=X, w = w)
D2
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 12.62712 10.646467 7.016424 3.783053
#> [2,] 12.627117 0.00000 5.062200 7.698930 12.493760
#> [3,] 10.646467 5.06220 0.000000 5.519211 10.862839
#> [4,] 7.016424 7.69893 5.519211 0.000000 6.637414
#> [5,] 3.783053 12.49376 10.862839 6.637414 0.000000
library(qmtsvr)
#Load dataset
pheno = data_ex$mydata[1:300,5:7]
head(pheno)
#> EBV_A EBV_B EBV_C
#> 1 4.8772368 -19.8837656 -15.63683
#> 2 0.4379619 -0.8912274 -60.40955
#> 3 -33.5370259 -6.0417011 -56.89957
#> 4 -13.8551773 55.6229989 17.61085
#> 5 -11.7870029 -7.9005613 22.94181
#> 6 -34.1528759 13.1498998 -60.99747
X = data_ex$SNP[1:300,]
obs = pheno #copy the observations matrix
pheno[200:300,1] = NA #omit the last 100 observations
tail(pheno)
#> EBV_A EBV_B EBV_C
#> 295 NA 3.2237448 23.52403
#> 296 NA 48.9010131 -37.82736
#> 297 NA 13.3215306 -62.29969
#> 298 NA -0.7356337 -49.03513
#> 299 NA -20.8358373 -31.73936
#> 300 NA -1.1012199 -68.66523
#Setting arbitrary hyperparameters values for a single trait SVR model
hyper_st = list("C" = 2, "eps" = 0.009, "b1" = 2.5)
#Predict the values
YHAT = qmtsvr.fit(Y=as.matrix(pheno[,1]), X = X, set_hyper = hyper_st,
verbose = F, vardiag = T)
#> Loading required package: kernlab
#Setting arbitrary hyperparameters values for a multi-trait SVR model with 3 traits
hyper_mt = list("C" = 0.2, "eps" = 0.09, "b1" = 0.15,
"b2"= 6.67, "b3" = 2.21,"b4" = 4.26,
"b5" = .53, "b6" = 1.4, "r1" = -0.4,
"r2" = 0.7, "r3" = 0.14)
#The sequence of bandwidth parameters (b) according to the traits (t1, t2, t3) is:
# t1 t2 t3
# t1: b1 b2 b3
# t2: . b4 b5
# t3: . . b6
#The sequence of weights r according to the traits (t1, t2, t3) is:
# t1 t2 t3
# t1: . r1 r2
# t2: . . r3
# t3: . . .
#Predict the values
YHAT2 = qmtsvr.fit(Y=pheno, X = X, set_hyper = hyper_mt,
verbose = T, vardiag = T)
#>
#>
#> #------- The University of Wisconsin - Madison --------#
#> #------ Department of Animal and Dairy Sciences -------#
#> # /) ( #
#> # .-._((.~~.))_.-. #
#> # `-. @@ .-.- #
#> # / .o--o. #
#> # ( ( .__. ) ) #
#> #------------------------------------------------------#
#> # QMTSVR v.0.1.4 (beta) - June 2022 #
#> # (Quasi)Multitask support vector regression #
#> # Anderson A.C. Alves (alves.zootecnista@gmail.com) #
#> #------------------------------------------------------#
#>
#> #--| Euclidean Distance Matrices (EDM) not provided |--#
#> Computing EDMs...
#> Done...
#> Building the kernel blocks...
#> Done...
#Correlate the predictions with the observed values
cor(YHAT[200:300,1],obs[200:300,1]) #Single Task
#> [1] 0.1557956
cor(YHAT2[200:300,1],obs[200:300,1]) #Multiple Task
#> [1] 0.3726077
#compute Root-mean squared error
mean(sqrt((YHAT[200:300,1] - obs[200:300,1])^2))#Single Task
#> [1] 23.76864
mean(sqrt((YHAT2[200:300,1] - obs[200:300,1])^2))#Multiple Task
#> [1] 22.60743
#plot the results
par(mfrow=c(1,2))
plot(YHAT[200:300,1],obs[200:300,1], pch = 19,
lty = 2, xlab = "Predicted values",
ylab = "Observed values",
col = rgb(0.7,0.5,0.20,0.7), main = "Single Task")
plot(YHAT2[200:300,1],obs[200:300,1],
pch = 19, lty = 2, xlab = "Predicted values",
ylab = "Observed values", col = rgb(0.7,0.5,0.20,0.7), main = "Multiple Task")
library(qmtsvr)
pheno = data_ex$mydata[1:300,5:7]
X = data_ex$SNP[1:300,]
obs = pheno #copy the observations matrix
pheno[200:300,1] = NA #omit the last 100 observations
hyper_st = list(c("C",0.01,5,128), c("eps",0.001,0.2,128), c("b1",0.1,7,128))
#The regularization parameter ("C") will range from 0.01 to 5 (128 values tested)
#The epsilon constant ("eps") will range from 0.001 to 0.2 (128 values tested)
#The bandwidth parameter ("b1") will range from 0.1 to 7 (128 values tested)
st_svr = qmtsvr.GA(Y = as.matrix(pheno[1:300,1]),
X = X[1:300,], hyper = hyper_st,
ngen = 30, popsize = 50, mut_rate = 0.05,
cross_rate = 0.95, elitism = 2,
cost = "cor", tsize = 5,
val_pop = "closest",
k = 5,vardiag=T, verbose = F)
#'val_pop' = closest, with k = 5
#This means that the 5 closest neighbors for each NA value
#will be used as internal validation (based on the Euclidean Distance).
#Since the different NA observations may have the same neighbors,
#the size of the internal validation population is
#generally smaller than the number of NAs
#Get the list with optimized hyperparameters
st_svr$set_hyper
#> C eps b1
#> 4.1748819 0.0010000 0.3173228
#QMT-SVR with 3 response variables
#Define the range for hyperparameters in a multi-task SVR with 3 variables
hyper_mt = list(c("C",0.1,5,128), c("eps",0.001,0.3,128),
c("b1",0.1,7,128), c("b2",0.1,7,128), c("b3",0.1,7,128),
c("b4",0.1,7,128), c("b5",0.1,7,128), c("b6",0.1,7,128),
c("r1",-0.9,0.9,128),c("r2",-0.9,0.9,128),c("r3",-0.9,0.9,128))
#Run the GA for optimizing the model hyperparameters
mt_svr = qmtsvr.GA(Y = pheno, X = X[1:300,],
hyper = hyper_mt, ngen = 30,
popsize = 50,mut_rate = 0.05,
cross_rate = 0.95, elitism = 2,
cost = "cor", tsize = 5, val_pop = "closest",
k=5, vardiag=T, verbose = F)
#Summary of the GA parameters
mt_svr$GA_parameters
#> Parameter value
#> 1 Number of generations 30
#> 2 Population size 50
#> 3 Mutation Rate 0.05
#> 4 Crossing-over Rate 0.95
#> 5 Tournament size 5
#> 6 Elitism 2
#> 7 MRCR fixed
#> 8 Cost cor
#Get the list with optimized hyperparameters
mt_svr$set_hyper
#> C eps b1 b2 b3 b4
#> 4.92283465 0.02454331 1.62125984 2.16456693 4.98976378 6.51102362
#> b5 b6 r1 r2 r3
#> 5.69606299 5.09842520 -0.33307087 0.44645669 0.55984252
#Plot GA history
par(mfrow = c(1,2))
plot.GA(st_svr)
plot.GA(mt_svr)
Karatzoglou A., Smola A., Hornik K., and Zeileis A. (2004). kernlab - An S4 Package for Kernel Methods in R. J. Stat. Soft. 11:1–20. doi: https://doi.org/10.18637/jss.v011.i09
VanRaden P. M. (2008). Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. doi: https://doi.org/10.3168/jds.2007-0980