Kernel Ridge Regression in R (for Drug-Target Interaction) - r

It's very hard to find any information on implementing KRR, therefore any minor input will be truly highly appreciated.
I want to run Kernel Ridge Regression on a set of kernels I have computed, but I do not know how to do this in R. I found the constructKRRLearner function from CVST package, but the manual is not clear at all, especially for me being a complete beginner in Machine Learning. The function needs and x and y, but I have no idea what to input there, as I only have a data frame that has the pairwise kernel computed as kronecker product between drugs and proteins.
How can I do a Kernel Ridge Regression task in R?
Ideally I also want to visualize my data points and then illustrate the regression line on the plot! For instance like this:
http://scikit-learn.org/stable/_images/plot_kernel_ridge_regression_0011.png
MORE INFO ON MY DATASET
I have a drug-target interactions (DTI) data set. The data set comprises of 100 drug compounds (rows) and 100 protein kinase targets (columns). there are some NAN's (missing values) in this data set. Values in this data set reflect how tightly a compound binds to a target.
I have drugs' SMILES and CHEMBL IDs.
I have the protein's (targets) sequences and UNIPROT IDs.
For drugs [100 drugs]: I converted drug SMILES to SDFset, and then I computed the fingerprints for each drug using OpenBabel. Based on these fingerprints I computed Tanimoto kernels for all possible combinations between drugs. (using "fpSim" function), e.g. Drug 1 with Drug 2, 3, 4, ... 10. Then Drug 2 with Drug 1, 3, 4... 100 and so on until Drug 99 with Drug 100. I named this BASE_DRUG_KERNELS
For proteins: I had the protein sequences, so I computed Smith-Waterman scores for all combination of protein pairs; e.g. Protein 1 with Protein 2, 3, ... 100, then Protein 2 with Protein 1, 3, 4, ... 100 and so on until Protein 99 with Protein 100. I named this BASE_PROTEIN_KERNELS
Then I computed the Kronecker between BASE_DRUG_KERNELS and BASE_PROTEIN_KERNELS which gave me a matrix of 100,000,000 elements. I named this matrix KRONECKER_PRODUCTS
I wish to run Kernel Ridge Regression on the matrix KRONECKER_PRODUCTS.

Related

Is it possible to write a function in R to perform a discriminant analysis with a cumulative, variable number of factors?

I am attempting to perform a least discriminant analysis on geometric morphometric data. Because geometric morphometric data typically produces large numbers of variables and discriminant analyses require more data points than variables to accurately classify specimens, a common solution in the literature is to perform a principal component analysis and then use a variable number of PCs representing less than 99% of the cumulative variance but returning the highest reclassification rate as input for the LDA.
Right now the way I am doing this is running the LDA in R (using the functions in the Morpho and MASS packages) under every possible number of PCs used and noting the classification accuracy by hand until I found the lowest number of PCs that returned the highest accuracy, but this is highly inefficient.
I was wondering if there was any way to write a function that would run an LDA for all possible numbers of the first N PCs (up to a certain, user defined level representing 99% of the cumulative variance) and return the percent reclassification rate for each level, producing something like the following:
PCs percent_accuracy
20 72.2
19 76.3
18 77.4
17 80.1
16 75.4
15 50.7
... ...
1 20.2
So row 1 would be the reclassification rate when the first 20 PCs are used, row 2 is the rate when the first 19 PCs are used, and so on and so forth.

Differential expression gene analysis: how to do t.test on expression matrix with goups different clinical matrix?

I am beginning my PhD in transcriptome (affymetrix assay) analysis.
I have an expression matrix (trans_data : 32000 genes x 620 samples), and a clinical matrix (clin_data : 620 samples x 42 clinical caracteristics).
The samples belong to 1 of the 4 populations A-B-C-D. I'd like to draw comparision of gene expression between population A and B without triying to bind the two matrix.
I'd like to optain a matrix with mean expression of each genes in the two population, then pvalue, then adjusted p value.
Then I could select only differentially expressed genes (padj < 0,05).
thanks for your help.
Alain
Can't answer your question directly without a clear reproducible example but you might want to check out the rather excellent tableone package

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Computing ROC curves without scores

I am working with a gene priorization program and I want to analyze its performance by making use of ROC curves and the ROCR package for R. The problem is that the program does not give me a score for each gene, it only orders the genes.
ROCR only uses continuous data as predictions. Is it possible to assign to each gene a number in descending order? For example, I have these genes ordered:
Gene A
Gene B
Gene C
I could assign these values?
Gene A: 3
Gene B: 2
Gene C: 1

simulation of genetic data in R

I am looking for the best way or best package available for simulating a genetic association between a specific SNP and a quantitative phenotype, with the simulated data being the most similar to my real data, except that I know the causal variant.
All of the packages I saw in R seem to be specialised in pedigree data or in population data where coalescence and other evolutionary factors are specified, but I don't have any experience in population genetics and I only want to simulate the simple case of European
population with a similar characteristics to my real data
(i.e. normal distribution for the trait and an additive effect for the genotype, similar allele frequancies…)
So for example if my genetic data is X and my quantitative variable is Y:
X <-rbinom(1000,2,0.4)
Y <- rnorm(1000,1,0.4)
I am looking for something in R similar to the function in Plink where one needs to specify a range of allele frequencies, a range for the phenotype, and specify a specific variant which should result associated with the genotype (this is important because I need to repeat these associations in different datasets with the causal variant being the same)
Can someone please help me?
If the genotype changes only the mean of the phenotype, this is very simple.
phenotype.means <- c(5, 15, 20) # phenotype means for genotypes 0, 1, and 2
phenotype.sd <- 5
X <- rbinom(1000,2,0.4)
Y <- rnorm(1000, phenotype.means[X], phenotype.sd)
This will lead to Y containing 1000 normally distributed variables, where those with homozygous recessive genotypes (aa, or 0) will have a mean of 5, those with heterozyous genotypes (Aa, or 1) will have a mean of 15, and those with homozygous dominant genotypes (AA, or 2) will have a mean of 20.
If you want a more traditional 2 setting phenotype (AA/Aa versus aa), just set phenotype.means to something like c(5, 20, 20).

Resources