I am running PCR on a data set, but my results from PCR is giving me the same values for both CV and adjCV, is this correct or there is anything wrong with the data.
Here is my code:
pcr <- pcr(F1~., data = data, scale = TRUE, validation = "CV")
summary(PCR)
validationplot(pcr)
validationplot(pcr, val.type = "MSEP")
validationplot(pcr, val.type = "R2")
predplot(pcr)
coefplot(PCR)
set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE,
prob = c(0.8,0.2))
train <- data[ind ==1,]
test <- data[ind ==2,]
pcr_train <- pcr(F1~., data = train, scale =TRUE, validation = "CV")
y_test <- test[, 1]
pcr_pred <- predict(pcr, test, ncomp = 4)
mean((pcr_pred - y_test) ^2)
And I am getting this error when I print the mean command
Warning in mean.default((pcr_pred - y_test)^2) :
argument is not numeric or logical: returning NA
Sample data:
F1 F2 F3 F4 F5
4.378 2.028 -5.822 -3.534 -0.546
4.436 2.064 -5.872 -3.538 -0.623
4.323 1.668 -5.954 -3.304 -0.782
5.215 3.319 -5.863 -4.139 -0.632
4.074 1.497 -6.018 -3.176 -0.697
4.403 1.761 -6 -3.339 -0.847
4.99 3.105 -5.985 -3.97 -0.638
4.783 2.968 -5.94 -3.903 -0.481
4.361 1.786 -5.866 -3.397 -0.685
4.594 1.958 -5.985 -3.457 -0.91
0.858 -4.734 -6.104 -0.692 -0.87
0.878 -3.846 -6.289 -1.064 -0.618
0.876 -4.479 -6.148 -0.803 -0.801
0.937 -5.498 -5.958 -0.376 -1.184
0.953 -4.71 -6.123 -0.705 -0.96
0.738 -5.386 -5.877 -0.444 -0.884
0.833 -5.562 -5.937 -0.343 -1.104
1.184 -3.52 -6.221 -1.234 -0.38
1.3 -4.129 -6.168 -0.963 -0.73
3.359 -3.618 -5.302 0.481 -0.649
3.483 -2.938 -5.361 0.157 -0.482
3.673 -3.779 -5.326 0.516 -1.053
2.521 -6.577 -4.499 1.861 -1.374
2.52 -4.757 -4.866 1.182 -0.736
2.482 -4.732 -4.857 1.142 -0.708
2.543 -6.699 -4.496 1.947 -1.426
2.458 -3.182 -5.219 0.514 -0.255
2.558 -5.66 -4.757 1.558 -1.142
2.627 -1.806 -5.313 -1.808 1.054
3.773 -0.526 -5.236 -0.6 -0.23
3.65 -0.954 -4.97 -0.361 -0.413
3.816 -1.18 -5.228 -0.284 -0.575
3.752 -0.522 -5.346 -0.562 -0.293
3.961 -0.24 -5.423 -0.69 -0.408
3.734 -0.711 -5.307 -0.479 -0.347
4.094 -0.415 -5.103 -0.729 -0.35
3.894 -0.957 -5.133 -0.435 -0.457
3.741 -0.484 -5.363 -0.574 -0.279
3.6 -0.698 -5.422 -0.435 -0.306
3.845 -0.351 -5.306 -0.666 -0.269
3.886 -0.481 -5.332 -0.596 -0.39
3.552 -2.106 -5.043 0.128 -0.634
4.336 -10.323 -2.95 3.346 -3.494
3.918 -0.809 -5.315 -0.442 -0.567
3.757 -0.502 -5.347 -0.572 -0.288
3.712 -0.627 -5.353 -0.505 -0.314
3.954 -0.72 -5.492 -0.428 -0.691
4.088 -0.588 -5.412 -0.53 -0.688
3.728 -0.641 -5.338 -0.505 -0.321
I'm trying to do a meta-analysis on many genes using R package "metafor", I know how to do it one gene at a time but it would be ridiculous to do so for thousands of genes. Could somebody help me out of this! Appreciate any suggestions!
I have all the results of se and HR for all the genes named 'se_summary' and 'HR_summary' respectively.
I need to use both se and HR of these genes from five studies "ICGC, TCGA, G71, G62, G8" as input to conduct the meta analysis.
The code I used to do the meta analysis for one single gene (using gene AAK1 as an example) is:
library(metafor)
se.AAK1 <- as.numeric(se_summary[rownames(se_summary) == 'AAK1',][,-1])
HR.AAK1 <- as.numeric(HR_summary[rownames(HR_summary) == 'AAK1',][,-1])
beta.AAK1 <- log(HR.AAK1)
####First I need to use the random model to see if the test for Heterogeneity is significant or not.
pool.AAK1 <- rma(beta.AAK1, sei=se.AAK1)
summary(pool.AAK1)
#### and this gives the following output:
#>Random-Effects Model (k = 5; tau^2 estimator: REML)
#> logLik deviance AIC BIC AICc
#> -2.5686 5.1372 9.1372 7.9098 21.1372
#>tau^2 (estimated amount of total heterogeneity): 0.0870 (SE = 0.1176)
#>tau (square root of estimated tau^2 value): 0.2950
#>I^2 (total heterogeneity / total variability): 53.67%
#>H^2 (total variability / sampling variability): 2.16
#>Test for Heterogeneity:
#>Q(df = 4) = 8.5490, p-val = 0.0734
#>Model Results:
#>estimate se zval pval ci.lb ci.ub
#> -0.3206 0.1832 -1.7500 0.0801 -0.6797 0.0385 .
#>---
#>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
####If the I^2 > 50%, we still use the Random-effect Model but if the I^2 <= 50%, we then use the Fixed-effect Model
pool.AAK1 <- rma(beta.AAK1, sei=se.AAK1, method="FE")
summary(pool.AAK1)
####this gives the following output:
#>Fixed-Effects Model (k = 5)
#> logLik deviance AIC BIC AICc
#> -2.5793 8.5490 7.1587 6.7681 8.4920
#>Test for Heterogeneity:
#>Q(df = 4) = 8.5490, p-val = 0.0734
#>Model Results:
#>estimate se zval pval ci.lb ci.ub
#> -0.2564 0.1191 -2.1524 0.0314 -0.4898 -0.0229 *
#>---
#>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This works just fine if I got only one gene, but I need to do it all at one time for all these genes and then export the output including "Heterogeneity p-val", and all the model results "estimate, se, zval, pval, ci.lb, ci.ub " to one .txt file, each row for a gene, the output should be like this:
Gene_symbol Heterogeneity_p-val estimate se zval pval ci.lb ci.ub
AAK1 0.0734 -0.2564 0.1191 -2.1524 0.0314 -0.4898 -0.0229
A2M 0.9664 0.1688 0.1173 1.4388 0.1502 -0.0611 0.3987
In case of need, here is a piece of sample data "se_summary"
Gene_symbol ICGC_se TCGA_se G71_se G62_se G8_se
A1CF 0.312 0.21 0.219 0.292 0.381
A2M 0.305 0.21 0.219 0.292 0.387
A2ML1 0.314 0.211 0.222 0.289 0.389
A4GALT 0.305 0.21 0.225 0.288 0.388
A4GNT 0.306 0.211 0.222 0.288 0.385
AAAS 0.308 0.213 0.223 0.298 0.38
AACS 0.307 0.209 0.221 0.287 0.38
AADAC 0.302 0.212 0.221 0.293 0.404
AADAT 0.308 0.214 0.22 0.288 0.391
AAK1 0.304 0.209 0.22 0.303 0.438
AAMP 0.303 0.211 0.222 0.288 0.394
And a piece of sample data "HR_summary"
Gene_symbol ICGC_HR TCGA_HR G71_HR G62_HR G8_HR
A1CF 1.689 1.427 0.864 1.884 1.133
A2M 1.234 1.102 1.11 1.369 1.338
A2ML1 0.563 0.747 0.535 1.002 0.752
A4GALT 0.969 0.891 0.613 0.985 0.882
A4GNT 1.486 0.764 1.051 1.317 1.465
AAAS 1.51 1.178 1.076 0.467 0.681
AACS 1.4 1.022 1.255 1.006 1.416
AADAC 0.979 0.642 1.236 1.581 1.234
AADAT 1.366 1.405 1.18 1.057 1.408
AAK1 1.04 0.923 0.881 0.469 0.329
AAMP 1.122 0.639 1.473 0.964 1.284
point 1: if your data is collected from different populations, you should not use fixed effect model. because HR could be difference among your populations.
point 2: if you convert HR to log(HR), therefore SE should be calculated for log(HR).
your data:
se_summary=data.frame(
Gene_symbol=c("A1CF","A2M","A2ML1","A4GALT","A4GNT","AAAS","AACS","AADAC","AADAT","AAK1","AAMP"),
ICGC_se=c(0.312,0.305,0.314,0.305,0.306,0.308,0.307,0.302,0.308,0.304,0.303),
TCGA_se=c(0.21,0.21,0.211,0.21,0.211,0.213,0.209,0.212,0.214,0.209,0.211),
G71_se=c(0.219,0.219,0.222,0.225,0.222,0.223,0.221,0.221,0.22,0.22,0.222),
G62_se=c(0.292,0.292,0.289,0.288,0.288,0.298,0.287,0.293,0.288,0.303,0.288),
G8_se=c(0.381,0.387,0.389,0.388,0.385,0.38,0.38,0.404,0.391,0.438,0.394))
and
HR_summary=data.frame(
Gene_symbol=c("A1CF","A2M","A2ML1","A4GALT","A4GNT","AAAS","AACS","AADAC","AADAT","AAK1","AAMP"),
ICGC_HR=c(1.689,1.234,0.563,0.969,1.486,1.51,1.4,0.979,1.366,1.04,1.122),
TCGA_HR=c(1.427,1.102,0.747,0.891,0.764,1.178,1.022,0.642,1.405,0.923,0.639),
G71_HR=c(0.864,1.11,0.535,0.613,1.051,1.076,1.255,1.236,1.18,0.881,1.473),
G62_HR=c(1.884,1.369,1.002,0.985,1.317,0.467,1.006,1.581,1.057,0.469,0.964),
G8_HR=c(1.133,1.338,0.752,0.882,1.465,0.681,1.416,1.234,1.408,0.329,1.284))
1)merge data
data=cbind(se_summary,log(HR_summary[,-1]))
2) a function to calculate meta-log HR
met=function(x) {
y=rma(as.numeric(x[7:11]), sei=as.numeric(x[2:6]))
y=c(y$b,y$beta,y$se,y$zval,y$pval,y$ci.lb,y$ci.ub,y$tau2,y$I2)
y
}
3)perform function for all rows
results=data.frame(t(apply(data,1,met)))
rownames(results)=rownames(data)
colnames(results)=c("b","beta","se","zval","pval","ci.lb","ci.ub","tau2","I2")
4)results
> results
b beta se zval pval
A1CF 0.27683114 0.27683114 0.1538070 1.7998601 0.071882735
A2M 0.16877042 0.16877042 0.1172977 1.4388214 0.150201136
A2ML1 -0.37676308 -0.37676308 0.1182825 -3.1852811 0.001446134
A4GALT -0.18975044 -0.18975044 0.1179515 -1.6087159 0.107678477
A4GNT 0.09500277 0.09500277 0.1392486 0.6822528 0.495079085
AAAS -0.07012629 -0.07012629 0.2000932 -0.3504680 0.725987468
AACS 0.15333550 0.15333550 0.1170061 1.3104915 0.190029610
AADAC 0.04902471 0.04902471 0.1738017 0.2820727 0.777887764
AADAT 0.23785528 0.23785528 0.1181503 2.0131593 0.044097875
AAK1 -0.32062727 -0.32062727 0.1832183 -1.7499744 0.080122725
AAMP 0.02722082 0.02722082 0.1724461 0.1578512 0.874574077
ci.lb ci.ub tau2 I2
A1CF -0.024625107 0.57828740 0.04413257 37.89339
A2M -0.061128821 0.39866965 0.00000000 0.00000
A2ML1 -0.608592552 -0.14493360 0.00000000 0.00000
A4GALT -0.420931120 0.04143024 0.00000000 0.00000
A4GNT -0.177919527 0.36792508 0.02455208 25.35146
AAAS -0.462301836 0.32204926 0.12145183 62.23915
AACS -0.075992239 0.38266324 0.00000000 0.00000
AADAC -0.291620349 0.38966978 0.07385974 50.18761
AADAT 0.006285038 0.46942552 0.00000000 0.00000
AAK1 -0.679728455 0.03847392 0.08700387 53.66905
AAMP -0.310767314 0.36520895 0.07266674 50.07330
Put the data in long format, with both the effect sizes and the se data side by side, then use a split and apply rma to each of these. You can make your own version of broom's tidy function just for rma objects.
library(metafor)
library(reshape)
se_summary<-read.table(text="
Gene_symbol ICGC_se TCGA_se G71_se G62_se G8_se
AADAT 0.308 0.214 0.22 0.288 0.391
AAK1 0.304 0.209 0.22 0.303 0.438
AAMP 0.303 0.211 0.222 0.288 0.394
",header=T)
HR_summary<-read.table(text="
Gene_symbol ICGC_HR TCGA_HR G71_HR G62_HR G8_HR
AADAT 0.308 0.214 0.22 0.288 0.391
AAK1 0.304 0.209 0.22 0.303 0.438
AAMP 0.303 0.211 0.222 0.288 0.394
",header=T)
HR_summary<-melt(HR_summary,id.vars = "Gene_symbol")%>%
mutate(.,variable=sapply(strsplit(as.character(variable), split='_', fixed=TRUE), function(x) (x[1])))%>%
rename(gene=variable)
se_summary<-melt(se_summary,id.vars = "Gene_symbol")%>%
mutate(.,variable=sapply(strsplit(as.character(variable), split='_', fixed=TRUE), function(x) (x[1])))%>%
rename(gene=variable)
HR_summary<-merge(HR_summary,se_summary,by=c("Gene_symbol","gene"),suffixes=c(".HR",".se"))
tidy.rma<-function(x) {
return(data.frame(estimate=x$b,se=x$se,zval=x$zval,ci.lb=x$ci.lb,ci.ub=x$ci.ub,k=x$k,Heterog_pv=x$QEp#the main stuff: overall ES, etc
#variance components( random effects stuff): nlvls is n sites
)) #test for heterogeneity q value and p-value
}
rbindlist(lapply(split(HR_summary, droplevels(HR_summary$Gene_symbol)),
function(x)with(x, tidy.rma(rma(yi=value.HR, sei=value.se,method="FE")))),idcol = "Gene_symbol2")
I have to plot data from immunized animals in a way to visualize possible correlations in protection. As a background, when we vaccinate an animal it produces antibodies, which might or not be linked to protection. We immunized bovine with 9 different proteins and measured antibody titers which goes up to 1.5 (Optical Density (O.D.)). We also measured tick load that goes up to 5000. Each animal have different titers for each protein and different tick loads, maybe some proteins are more important for protection than the others, and we think that a heatmap could illustrate it.
TL;DR: Plot a heatmap with one variable (Ticks) that goes from 6 up to 5000, and another variable (Prot1 to Prot9) that goes up to 1.5.
A sample of my data:
Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
G1-54-102 control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
G1-130-102 control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
G1-133-102 control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
G3-153-102 vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
G3-200-102 vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
G3-807-102 vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035
I have little knowledge in R, but I'm really excited to learn more about it. So feel free to put whatever code you want and I will try my best to understand it.
Thank you in advance.
Luiz
Here is an option to use the ggplot2 package to create a heatmap. You will need to convert your data frame from wide format to long format. It is also important to convert the Ticks column from numeric to factor if the numbers are discrete.
library(tidyverse)
library(viridis)
dat2 <- dat %>%
gather(Prot, Value, starts_with("Prot"))
ggplot(dat2, aes(x = factor(Ticks), y = Prot, fill = Value)) +
geom_tile() +
scale_fill_viridis()
DATA
dat <- read.table(text = "Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
'G1-54-102' control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
'G1-130-102' control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
'G1-133-102' control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
'G3-153-102' vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
'G3-200-102' vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
'G3-807-102' vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035",
header = TRUE, stringsAsFactors = FALSE)
In the newest version of ggplot2 / the tidyverse, you don't even need to explicitly load the viridis-package. The scale is included via scale_fill_viridis_c(). Exciting times!
I am using effects R package and effect function on a cox model. There is a default method for this function so it somehow should work for any model.
When I try to use this function I get this error:
Any idea how to fix this and what is wrong?
> eff_cf <- effect("TP53:MDM2", model)
Error in mod.matrix %*% mod$coefficients[!is.na(coef(mod))] :
non-conformable arguments
My model looks like this:
> model
Call:
coxph(formula = Surv(times, patient.vital_status) ~ TP53 + MDM2 +
TP53:MDM2, data = clinForPlot)
coef exp(coef) se(coef) z p
TP53Other -0.163 0.850 0.217 -0.752 4.5e-01
TP53WILD -1.086 0.337 0.277 -3.928 8.6e-05
MDM2(1183.7,1674.7] -0.669 0.512 0.235 -2.851 4.4e-03
MDM2(1674.7,2248.5] -0.744 0.475 0.305 -2.444 1.5e-02
MDM2(2248.5,50339] -0.867 0.420 0.375 -2.308 2.1e-02
TP53Other:MDM2(1183.7,1674.7] 0.394 1.483 0.412 0.958 3.4e-01
TP53WILD:MDM2(1183.7,1674.7] 0.133 1.142 0.413 0.323 7.5e-01
TP53Other:MDM2(1674.7,2248.5] -0.192 0.825 0.517 -0.372 7.1e-01
TP53WILD:MDM2(1674.7,2248.5] 0.546 1.726 0.433 1.260 2.1e-01
TP53Other:MDM2(2248.5,50339] -0.140 0.869 0.650 -0.215 8.3e-01
TP53WILD:MDM2(2248.5,50339] 0.786 2.195 0.484 1.623 1.0e-01
Likelihood ratio test=72.8 on 11 df, p=3.54e-11 n= 1321, number of events= 258
And the model and the data.frame used for model can be reproduced using this code
library(archivist)
model <- loadFromGitub("68eeefba87be70364eb3801cec58eb3d",
user = "MarcinKosinski",
repo = "Museum",
value = TRUE)
clinForPlot <- loadFromGitub("cfa5145e6b98964d5f8b760bf749e426",
user = "MarcinKosinski",
repo = "Museum",
value = TRUE)
Any idea how to fix this and what is wrong?
Using R, what is the best way to read a symmetric matrix from a file that omits the upper triangular part. For example,
1.000
.505 1.000
.569 .422 1.000
.602 .467 .926 1.000
.621 .482 .877 .874 1.000
.603 .450 .878 .894 .937 1.000
I have tried read.table, but haven't been successful.
Here's a read.table and loopless and *apply-less solution:
txt <- "1.000
.505 1.000
.569 .422 1.000
.602 .467 .926 1.000
.621 .482 .877 .874 1.000
.603 .450 .878 .894 .937 1.000"
# Could use clipboard or read this from a file as well.
mat <- data.matrix( read.table(text=txt, fill=TRUE, col.names=paste("V", 1:6)) )
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
> mat
V1 V2 V3 V4 V5 V6
[1,] 1.000 0.505 0.569 0.602 0.621 0.603
[2,] 0.505 1.000 0.422 0.467 0.482 0.450
[3,] 0.569 0.422 1.000 0.926 0.877 0.878
[4,] 0.602 0.467 0.926 1.000 0.874 0.894
[5,] 0.621 0.482 0.877 0.874 1.000 0.937
[6,] 0.603 0.450 0.878 0.894 0.937 1.000
I copied your text, and then used tt <- file('clipboard','rt') to import it. For a standard file:
tt <- file("yourfile.txt",'rt')
a <- readLines(tt)
b <- strsplit(a," ") #insert delimiter here; can use regex
b <- lapply(b,function(x) {
x <- as.numeric(x)
length(x) <- max(unlist(lapply(b,length)));
return(x)
})
b <- do.call(rbind,b)
b[is.na(b)] <- 0
#kinda kludgy way to get the symmetric matrix
b <- b + t(b) - diag(b[1,1],nrow=dim(b)[1],ncol=dim(b)[2]
I'm posting but I like Blue Magister's approach wat better. But maybe there's something in this that's of use.
mat <- readLines(n=6)
1.000
.505 1.000
.569 .422 1.000
.602 .467 .926 1.000
.621 .482 .877 .874 1.000
.603 .450 .878 .894 .937 1.000
nmat <- lapply(mat, function(x) unlist(strsplit(x, "\\s+")))
lens <- sapply(nmat, length)
dlen <- max(lens) -lens
bmat <- lapply(seq_along(nmat), function(i) {
as.numeric(c(nmat[[i]], rep(NA, dlen[i])))
})
mat <- do.call(rbind, bmat)
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
mat
Here is an approach which also works if the dimensions of the matrix are unknown.
# read file as a vector
mat <- scan("file.txt", what = numeric())
# calculate the number of columns (and rows)
ncol <- (sqrt(8 * length(mat) + 1) - 1) / 2
# index of the diagonal values
diag_idx <- cumsum(seq.int(ncol))
# generate split index
split_idx <- cummax(sequence(seq.int(ncol)))
split_idx[diag_idx] <- split_idx[diag_idx] - 1
# split vector into list of rows
splitted_rows <- split(mat, f = split_idx)
# generate matrix
mat_full <- suppressWarnings(do.call(rbind, splitted_rows))
mat_full[upper.tri(mat_full)] <- t(mat_full)[upper.tri(mat_full)]
[,1] [,2] [,3] [,4] [,5] [,6]
0 1.000 0.505 0.569 0.602 0.621 0.603
1 0.505 1.000 0.422 0.467 0.482 0.450
2 0.569 0.422 1.000 0.926 0.877 0.878
3 0.602 0.467 0.926 1.000 0.874 0.894
4 0.621 0.482 0.877 0.874 1.000 0.937
5 0.603 0.450 0.878 0.894 0.937 1.000
This won't work in the OP's case because the diagonal was 1, but if the diagonal is zero or missing, then you can use as.dist%>%as.matrix to copy the lower diagonal to the upper diagonal and set the diagonal to zero:
input=" Pop0 Pop1 Pop2
Pop0
Pop1 0.015
Pop2 0.079 0.083
Pop3 0.014 0.016 0.073"
as.matrix(as.dist(cbind(read.table(text=input,fill=T),NA)))
Result:
Pop0 Pop1 Pop2 Pop3
Pop0 0.000 0.015 0.079 0.014
Pop1 0.015 0.000 0.083 0.016
Pop2 0.079 0.083 0.000 0.073
Pop3 0.014 0.016 0.073 0.000
In my case the input had column names, so read.table(fill=T) was automatically able to determine the number of columns and IRTFM's trick of specifying col.names=1:4 was not neeeded.