I am working with a gene priorization program and I want to analyze its performance by making use of ROC curves and the ROCR package for R. The problem is that the program does not give me a score for each gene, it only orders the genes.
ROCR only uses continuous data as predictions. Is it possible to assign to each gene a number in descending order? For example, I have these genes ordered:
Gene A
Gene B
Gene C
I could assign these values?
Gene A: 3
Gene B: 2
Gene C: 1
Related
I have a dataset containing about 500 curves. In my dataset every row is a curve (which comes from some experimental measurements) and in the columns there are the measurement intervals (I don't think it's important, but intervals are not time measurements but frequency measurements).
Here you can find the data:
https://drive.google.com/file/d/1q1F1any8RlCIrn-CcQEzLWyrsyTBCCNv/view?usp=sharing
curves
t1
t2
1
-57.48
-57.56
2
-56.22
-56.28
3
-57.06
-57.12
I want to divide this dataset into 2 - 4 homogeneous groups of curves.
I've seen that there are some packages in R (fda and funHDDC) that allow you to find clusters but I don't know how to create the list with which to start the analysis, and I also don't understand why the initial dataset doesn't fit. How can I transform the data I have into a list suitable for processing with the above packages?
What results should I expect?
I am beginning my PhD in transcriptome (affymetrix assay) analysis.
I have an expression matrix (trans_data : 32000 genes x 620 samples), and a clinical matrix (clin_data : 620 samples x 42 clinical caracteristics).
The samples belong to 1 of the 4 populations A-B-C-D. I'd like to draw comparision of gene expression between population A and B without triying to bind the two matrix.
I'd like to optain a matrix with mean expression of each genes in the two population, then pvalue, then adjusted p value.
Then I could select only differentially expressed genes (padj < 0,05).
thanks for your help.
Alain
Can't answer your question directly without a clear reproducible example but you might want to check out the rather excellent tableone package
This is the result of cluster analysis through the k-means function.
>weseg2<-read.csv("WE_SEG DATA.csv",header=TRUE)
>training.data2<-scale(weseg2)
>aaaa<-kmeans(training.data2, centers=4, iter.max=10000, nstart=20)
I want to know what characteristics each cluster has.
So I got the average of each variable by clusters.
this is my code that calculate the mean of variables.
mean of first cluster
>rank1<-colMeans(training.data2[aaaa$cluster==1,])
mean of second cluster
>rank2<-colMeans(training.data2[aaaa$cluster==2,])
mean of third cluster
>rank3<-colMeans(training.data2[aaaa$cluster==3,])
mean of fourth cluster
>rank4<-colMeans(training.data2[aaaa$cluster==4,])
If so,what code should be entered so taht it can rank the clusters for each variable?
For example, if you have the variables a,b and c, you have rank of four clusters in the a variable, and rank of four clusters in the b and c variables.
Use the apply and rank function.
like this
>rank5<-cbind(rank1,rank2,rank3,rank4)
>apply(rank5,1,rank)
Then, you can get a rank of columns.
And if you want to get rank as decrease, use the order() function.
good luck.
It's very hard to find any information on implementing KRR, therefore any minor input will be truly highly appreciated.
I want to run Kernel Ridge Regression on a set of kernels I have computed, but I do not know how to do this in R. I found the constructKRRLearner function from CVST package, but the manual is not clear at all, especially for me being a complete beginner in Machine Learning. The function needs and x and y, but I have no idea what to input there, as I only have a data frame that has the pairwise kernel computed as kronecker product between drugs and proteins.
How can I do a Kernel Ridge Regression task in R?
Ideally I also want to visualize my data points and then illustrate the regression line on the plot! For instance like this:
http://scikit-learn.org/stable/_images/plot_kernel_ridge_regression_0011.png
MORE INFO ON MY DATASET
I have a drug-target interactions (DTI) data set. The data set comprises of 100 drug compounds (rows) and 100 protein kinase targets (columns). there are some NAN's (missing values) in this data set. Values in this data set reflect how tightly a compound binds to a target.
I have drugs' SMILES and CHEMBL IDs.
I have the protein's (targets) sequences and UNIPROT IDs.
For drugs [100 drugs]: I converted drug SMILES to SDFset, and then I computed the fingerprints for each drug using OpenBabel. Based on these fingerprints I computed Tanimoto kernels for all possible combinations between drugs. (using "fpSim" function), e.g. Drug 1 with Drug 2, 3, 4, ... 10. Then Drug 2 with Drug 1, 3, 4... 100 and so on until Drug 99 with Drug 100. I named this BASE_DRUG_KERNELS
For proteins: I had the protein sequences, so I computed Smith-Waterman scores for all combination of protein pairs; e.g. Protein 1 with Protein 2, 3, ... 100, then Protein 2 with Protein 1, 3, 4, ... 100 and so on until Protein 99 with Protein 100. I named this BASE_PROTEIN_KERNELS
Then I computed the Kronecker between BASE_DRUG_KERNELS and BASE_PROTEIN_KERNELS which gave me a matrix of 100,000,000 elements. I named this matrix KRONECKER_PRODUCTS
I wish to run Kernel Ridge Regression on the matrix KRONECKER_PRODUCTS.
I have a file with the results of a microarray expression experiment. The first column holds the gene names. The next 15 columns are 7 samples from the post-mortem brain of people with Down's syndrome, and 8 from people not having Down's syndrome. The data are normalized. I would like to know which genes are differentially expressed between the groups.
There are two groups and the data is nearly normally distributed, so a t-test has been performed for each gene. The p-values were added in another column at the end. Afterwards, I did a correction for multiple testing.
I need to cluster the data to see if the differentially expressed genes (FDR<0.05) can discriminate between the groups.
Also, I would like to visualize the clustering using a heatmap with gene names on the rows and some meaningful names on the samples (columns)
I have written this code for the moment:
ds <- read.table("down_syndroms.txt", header=T)
names(ds) <- c("Gene",paste0("Down",1:7),paste0("Control",1:8), "pvalues")
pvadj <- p.adjust(ds$pvalue, method = "BH")
# # How many genes do we get with a FDR <=0.05
sum(pvadj<=0.05)
[1] 5641
# Cluster the data
ds_matrix<-as.matrix(ds[,2:18])
ds_dist_matrix<-dist(ds_matrix)
my_clustering<-hclust(ds_dist_matrix)
# Heatmap
library(gplots)
hm <- heatmap.2(ds_matrix, trace='none', margins=c(12,12))
The heatmap I have done doesn't look the way I would like. Also, I think I should remove the pvalues from it. Besides, R usually crashes when I try to plot the clustering (probably due to the big size of the data file, with more than 22 thousand genes).
How could I do a better looking tree (clustering) and heatmap?