I'm a beginner with R. I would need your help to automate these analyses and to get a summary output with the results.
I have 4 different data frames like this (see below), with the same headers and the same values in the Threshold column:
Set Threshold R2 P Coefficient Standard.Error Num_SNP
Base 0.0001 0.000233304 0.66047 0.0332613 0.0757204 47
Base 0.001 0.000387268 0.571772 -0.0438782 0.0775996 475
Base 0.05 0.00302399 0.114364 0.129474 0.082004 14164
Base 0.1 0.00252797 0.14897 0.117391 0.0813418 24616
Base 0.2 0.00481908 0.0465384 0.163571 0.0821767 41524
Base 0.3 0.00514761 0.0398082 0.170058 0.0827237 55307
Base 0.4 0.00699506 0.0166685 0.200571 0.083783 66943
Base 0.5 0.00634181 0.0226301 0.192314 0.0843623 76785
For each matching value in the Threshold columns, I would like to use the package metafor in R to meta-analyse the corresponding effect sizes (in the Coefficient column) and standard errors over the 4 data frames.
Using the metafor package:
rma.uni(yi=c(Coefficient_1,Coefficient_2,Coefficient_3,Coefficient_4),sei=c(Standard.Error_1,Standard.Error_2,Standard.Error_3,Standard.Error_4), measure="GEN", method='FE',intercept=T,weights=c(sample_size1,sample_size2,sample_size3,sample_size4))
How could I automate the analyses and get a summary data frame with the results for each Threshold?
Hi there this should get you started. Essentially you can loop over all thresholds
extract the rows matching each threshold from all 4 dataframes into a new dataframe and run your meta-analysis
library(metafor)
# Make some fake data resembling your own
df1 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
df2 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
df3 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
df4 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
Thresholds = unique(df1$Threshold)
Results <- NULL
for(i in 1:length(Thresholds)){
idf = rbind(df1[df1$Threshold==Thresholds[i],],
df2[df2$Threshold==Thresholds[i],],
df3[df3$Threshold==Thresholds[i],],
df4[df4$Threshold==Thresholds[i],])
i.meta <- rma.uni(yi=idf$Coefficient,sei=idf$Standard.Error, measure="GEN", method='FE',intercept=T,
weights=idf$Num_SNP)
Results <- rbind(Results, c(Threshold=Thresholds[i],beta=i.meta$beta,se=i.meta$se,
zval=i.meta$zval,pval=i.meta$pval,ci.lb=i.meta$ci.lb,
ci.ub=i.meta$ci.ub,QEp=i.meta$QEp))
}
Results <- data.frame(Results)
Results
should give you :
Threshold beta se zval pval ci.lb ci.ub QEp
1 1e-04 -0.012079013 0.04715546 -0.2561530 0.79783270 -0.104502022 0.0803440 0.08700919
2 1e-03 0.068932388 0.04006086 1.7206917 0.08530678 -0.009585452 0.1474502 0.22294419
3 5e-02 0.050069503 0.04094881 1.2227340 0.22143020 -0.030188694 0.1303277 0.07342661
4 1e-01 0.102598016 0.04188183 2.4497022 0.01429744 0.020511132 0.1846849 0.07380669
5 2e-01 0.069482160 0.04722693 1.4712401 0.14122619 -0.023080930 0.1620452 0.95494364
6 3e-01 0.009793206 0.05098346 0.1920859 0.84767489 -0.090132542 0.1097190 0.12191340
7 4e-01 0.030432884 0.03967771 0.7670021 0.44308028 -0.047333994 0.1081998 0.86270334
8 5e-01 0.073511575 0.03997485 1.8389458 0.06592316 -0.004837683 0.1518608 0.12333557
Related
I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491
I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.
I am in interested in finding Pearson correlation coefficients between a list of genes. Basically, I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I have annotation data of microarray experiment observation where sample ID in the rows and description identification on the columns.
data
> expr_mat[1:8, 1:3]
Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at 6.062215 6.125023 5.875502
10_at 3.796484 3.805305 3.450245
100_at 5.849338 6.191562 6.550525
1000_at 3.567779 3.452524 3.316134
10000_at 6.166815 5.678373 6.185059
100009613_at 4.443027 4.773199 4.393488
100009676_at 5.836522 6.143398 5.898364
10001_at 6.330018 5.601745 6.137984
> anodat[1:8, 1:3]
V1 V2 V3
1 SampleID GA Batch
2 Tarca_001_P1A01 11 1
3 Tarca_013_P1B01 15.3 1
4 Tarca_025_P1C01 21.7 1
5 Tarca_037_P1D01 26.7 1
6 Tarca_049_P1E01 31.3 1
7 Tarca_061_P1F01 32.1 1
8 Tarca_051_P1E03 19.7 1
goal:
I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data, then generate sub expression matrix of keeping high correlated genes with target observation data anodat$GA.
my attempt:
gene_corrs <- function(expr_mat, anno_mat){
stopifnot(ncol(expr_mat)==nrow(anno_mat))
res <- list()
lapply(colnames(expr_mat), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(anno_mat)){
cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
res <- as.dara.frame(res)
}
})
})
return(res)
}
however, my above implementation didn't return what I expected, I need to filter out the genes by finding genes which has a strong correlation with anodat$GA.
Another attempt:
I read few post about similar issue and some people discussed about using limma package. Here is my attempt by using limma. Here I used anodat$GA as a covariate to fit limma linear model:
library(limma)
fit <- limma::lmFit(expr_mat, design = model.matrix( ~ 0 + anodat$GA)
fit <- eBayes(fit)
topTable(fit, coef=2)
then I am expecting to get a correlation matrix from the above code, and would like to do following in order to get filtered sub expression matrix:
idx <- which( (abs(cor) > 0.8) & (upper.tri(cor)), arr.ind=TRUE)
idx <- unique(c(idx[, 1],idx[, 2])
correlated.genes <- matrix[idx, ]
but I still didn't get the right answer. I am confident about using limma approach but I couldn't figure out what went wrong above code again. Can anyone point me out how to make this work? Is there any efficient way to make this happen?
Don't have your data so hard to double check, but in the abstract I would try this:
library(matrixTests)
cors <- row_cor_pearson(expr_mat, anodat$GA)
which(cors$cor > 0.9) # to get the indeces of genes with correlation > 0.9
I have found a solution in python, but was looking for a solution in R. Is there an R equivalent of chi2.isf(p, df)? I know the R equivalent of chi2.sf(p, df) is 1-qchisq(p,df).
There is family of functions in R to deal with Chi-square distribution: dchisq(), pchisq(), qchisq(), rchisq(). As for your case, you would need qchisq to get Chi-square statistics from p-values and degrees of freedom:
qchisq(p = 0.01, df = 7)
To build a matrix with qchisq, I would do something like this. Feel free to change p-values and degrees of freedom as you need.
# Set p-values
p <- c(0.995, 0.99, 0.975, 0.95, 0.90, 0.10, 0.05, 0.025, 0.01, 0.005)
# Set degrees of freedom
df <- seq(1,20)
# Calculate a matrix of chisq statistics
m <- outer(p, df, function(x,y) qchisq(x,y))
# Transpose for a better view
m <- t(m)
# Set column and row names
colnames(m) <- p
rownames(m) <- df
m
0.995 0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01 0.005
1 7.879439 6.634897 5.023886 3.841459 2.705543 0.01579077 0.00393214 0.0009820691 0.0001570879 0.00003927042
2 10.596635 9.210340 7.377759 5.991465 4.605170 0.21072103 0.10258659 0.0506356160 0.0201006717 0.01002508365
3 12.838156 11.344867 9.348404 7.814728 6.251389 0.58437437 0.35184632 0.2157952826 0.1148318019 0.07172177459
4 14.860259 13.276704 11.143287 9.487729 7.779440 1.06362322 0.71072302 0.4844185571 0.2971094805 0.20698909350
5 16.749602 15.086272 12.832502 11.070498 9.236357 1.61030799 1.14547623 0.8312116135 0.5542980767 0.41174190383
6 18.547584 16.811894 14.449375 12.591587 10.644641 2.20413066 1.63538289 1.2373442458 0.8720903302 0.67572677746
7 20.277740 18.475307 16.012764 14.067140 12.017037 2.83310692 2.16734991 1.6898691807 1.2390423056 0.98925568313
8 21.954955 20.090235 17.534546 15.507313 13.361566 3.48953913 2.73263679 2.1797307473 1.6464973727 1.34441308701
9 23.589351 21.665994 19.022768 16.918978 14.683657 4.16815901 3.32511284 2.7003895000 2.0879007359 1.73493290500
10 25.188180 23.209251 20.483177 18.307038 15.987179 4.86518205 3.94029914 3.2469727802 2.5582121602 2.15585648130
11 26.756849 24.724970 21.920049 19.675138 17.275009 5.57778479 4.57481308 3.8157482522 3.0534841066 2.60322189052
12 28.299519 26.216967 23.336664 21.026070 18.549348 6.30379606 5.22602949 4.4037885070 3.5705689706 3.07382363809
13 29.819471 27.688250 24.735605 22.362032 19.811929 7.04150458 5.89186434 5.0087505118 4.1069154715 3.56503457973
14 31.319350 29.141238 26.118948 23.684791 21.064144 7.78953361 6.57063138 5.6287261030 4.6604250627 4.07467495740
15 32.801321 30.577914 27.488393 24.995790 22.307130 8.54675624 7.26094393 6.2621377950 5.2293488841 4.60091557173
16 34.267187 31.999927 28.845351 26.296228 23.541829 9.31223635 7.96164557 6.9076643535 5.8122124701 5.14220544304
17 35.718466 33.408664 30.191009 27.587112 24.769035 10.08518633 8.67176020 7.5641864496 6.4077597777 5.69721710150
18 37.156451 34.805306 31.526378 28.869299 25.989423 10.86493612 9.39045508 8.2307461948 7.0149109012 6.26480468451
19 38.582257 36.190869 32.852327 30.143527 27.203571 11.65091003 10.11701306 8.9065164820 7.6327296476 6.84397144548
20 39.996846 37.566235 34.169607 31.410433 28.411981 12.44260921 10.85081139 9.5907773923 8.2603983325 7.43384426293
if you want to create the contingency table you can use the function table().
With this function you get a table of this type:
table(var1,var2)
var1$1 var1$2
var2$1 n n
var2$2 n n
where $1 and $2 are the modalities of the variables 1 and 2.
If you have a dataframe containing the variables you must set the function like this table(namedf$var1, namedf$var2).
If you want do the chi-squared test the function is chisq.test()
chisq.test(namedf$var1, namedf$var2, correct=FALSE)
I made a function g to calculate maximum likelihood following a multivariated normal distribution in order to classify the rows of a data frame test between 34 classes. So I need to use this function over a data frame and with several different set of parameters. My code works but it's to slow. I want to get it faster, maybe by removing the for loop and using some other apply's family function (which I don't have any experience).
g = function(x,p,mu,Sigma){
log(p) - log(det(Sigma)) - as.matrix(x-mu)%*%solve(Sigma)%*%t(as.matrix(x-mu))
}
mu = summaryBy(.~class,train,FUN=mean)
Sigma = by(train[,1:8],train$class,cov)
p = as.data.frame(table(train$class))
p$Freq = p$Freq/n
k = length(levels(train$class))
logver = NULL
for(j in 1:k){
logver = cbind(logver,apply(test,1,g,p=p$Freq[j],mu=mu[j,-1],Sigma=Sigma[[j]]))
}
preds = apply(logver,1,which.max)
The output logver must be a data frame with one row for each row of test and one col for each j, so in this case with 340000 rows and 34 cols.
p$Freq is a numeric vector 1x34. mu is a data frame 34x9 (fist col is a factor generated by summaryBy mean). Sigma is a list with 34 elements and each element is a 8x8 covariace matrix. test is a data frame 340000x8.
head(test)
band1 band2 band3 band4 band5 band6 band7 band9
1 2.0592 4.3630 6.6506 10.5952 18.4566 37.3683 36.9154 33.9467
2 2.5772 4.0766 6.0116 10.1476 18.8585 36.7654 36.2221 33.3717
3 2.8240 4.0766 6.4183 9.6813 18.5148 37.3113 35.7318 33.8367
4 2.8999 4.4317 7.4529 10.2842 18.4566 37.2513 37.3219 33.8367
5 2.8684 3.5324 7.5845 10.9021 19.2262 37.4758 36.0219 33.3472
6 2.4069 4.3256 6.0241 10.6668 20.0381 36.7203 36.4816 33.3472
head(train)
band1 band2 band3 band4 band5 band6 band7 band9 class
1 5.1224 8.1723 11.6837 15.6408 22.5884 33.9782 32.2985 32.1805 Green
2 5.4430 6.1158 9.8344 14.7719 23.2234 34.1247 32.0722 32.4367 Dry
3 4.5048 7.7364 11.9494 15.7740 22.6291 33.7642 32.4599 32.5217 Conifer
4 5.3120 6.9558 9.8344 14.3223 22.8088 34.0513 32.2985 32.4045 Snow
5 5.2907 6.6837 10.5367 16.0684 22.6291 33.7156 32.1650 32.4900 Ice
6 5.3120 6.8131 10.9727 15.7114 22.8088 34.6136 32.4772 32.4367 Soil
Thank you all
I have a fitted binomial logit model and want to calculate the cumulative probability of experiencing an event <= some value of a covariate.
For example, If I have a fitted model that predicts and outcome based on a continuous distance range (0-8.5 km) I might want to find out the cumulative probability for distance <= to 4.5 km.
I have vectors of estimated probabilities and the associated distances as below
dat <- structure(list(km = c(0, 0.447368421052632, 0.894736842105263,
1.34210526315789, 1.78947368421053, 2.23684210526316, 2.68421052631579,
3.13157894736842, 3.57894736842105, 4.02631578947368, 4.47368421052632,
4.92105263157895, 5.36842105263158, 5.81578947368421, 6.26315789473684,
6.71052631578947, 7.15789473684211, 7.60526315789474, 8.05263157894737,
8.5), prob = c(0.99010519543441, 0.985413663823809, 0.97854588563623,
0.968547716962174, 0.954108659036907, 0.933496091194704, 0.904551377544634,
0.864833064332603, 0.81202174997839, 0.744668375529677, 0.663191827576796,
0.570704402277059, 0.47300143764816, 0.377323442817887, 0.290336664745317,
0.216433162546689, 0.157174982015906, 0.111825887625402, 0.0783449309507567,
0.054275681518511)), .Names = c("km", "prob"), row.names = c(NA,
-20L), class = "data.frame")
What I ultimately want to say is "x% of observations within x distance are predicted to experience an event". Is this the right way to go about that?
Also is there an easy way to calculate at which distance (from 0 - whatever) encompasses the 50% cumulative probability.
Thanks, Tim
There is probably some way to extract this from your model, but if you were doing it from scratch I would try to fit your data to a distribution, then extract your relevant data points.
First define an error function:
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x)) # or some other error fxn
Now let's say your data sort of looks like a gamma distribution, so try:
gdf <- function(x, d=dat$km) pgamma(d,shape=x[1], scale=x[2])
So your function to optimize will be the error function of your data and the fit distribution:
error_fxn <- function(x) rmse(rev(dat$prob),gdf(x)) # rev data to make ascending
Now optimize this function to get your parameters for the distribution of interest:
rr <- optim(c(1,1),error_fxn)
And let's see how good the fit is (just ok...);
rr
# $par
# [1] 3.108392 1.112584
# $value
# [1] 0.0333369
# $counts
# function gradient
119 NA
# $convergence
# [1] 0
# $message
# NULL
Or graphicaly:
with(dat,plot(km,prob,xlim=c(10,0)))
with(dat,lines(rev(km),pgamma(km,shape=rr$par[1], scale=rr$par[2]),col='red'))
Take a look at the values for the CDF:
kms <- seq(0,8.5,0.5)
data.frame(dist = kms, cdf = pgamma(kms,shape=rr$par[1], scale=rr$par[2]))
# dist cdf
# 1 0.0 0.000000000
# 2 0.5 0.008634055
# 3 1.0 0.053615340
# 4 1.5 0.137291689
# 5 2.0 0.245961242
# 6 2.5 0.363956061
# 7 3.0 0.479070721
# 8 3.5 0.583659363
# 9 4.0 0.673982194
# 10 4.5 0.749075757
# 11 5.0 0.809691054
# 12 5.5 0.857478086
# 13 6.0 0.894431622
# 14 6.5 0.922551998
# 15 7.0 0.943661710
# 16 7.5 0.959325076
# 17 8.0 0.970830577
# 18 8.5 0.979207658
And to answer your final question, get the distance at 50% of the CDF:
qgamma(0.5,shape=rr$par[1], scale=rr$par[2])
# [1] 3.095395