Dear Stackoverflow Users,
here is an example table, simillar to what I have, only mine has over 1000 protein and here I've placed 2:
`#for stack overflow#
Accession <- rep(c("AT1G01320.1", "AT1G01050.1"), each =14)
Description<- rep(c("protein1", "protein2"), each = 14)
genotype <- c("WT", "WT","WT", "WT", "m", "m", "m", "f", "f", "f", "f", "ntrc", "ntrc", "ntrc")
genotype <- c("WT", "WT","WT", "WT", "m", "m", "m", "f", "f", "f", "f", "ntrc", "ntrc", "ntrc")
variable <- c("WT1", "WT2","WT3", "WT4", "m1", "m2", "m3", "f1", "f2", "f3", "f4", "ntrc1", "ntrc2", "ntrc3", "WT1", "WT2","WT3", "WT4", "m1", "m2", "m3", "f1", "f2", "f3", "f4", "ntrc1", "ntrc2", "ntrc3")
value <- c(5535705, 8034106, 4879639, 6817736, 23109581, 3778870, 6020611, 4480108, 6131362, 4210275, 27630841, 4702864,2966520, 9065916, 151903.67, 417423.81, 2895121.80, 810620.92, 822284.83, 6477122.14, 12266704.79, 11196940.77, 12143974.82, 1040832.60, 136497.86, 9294097.54, 506386.62, 32266.71)
prot<- data.frame(Accession, Description, genotype, variable, value)
> prot
Accession Description genotype variable value
1 AT1G01320.1 protein1 WT WT1 5535705.00
2 AT1G01320.1 protein1 WT WT2 8034106.00
3 AT1G01320.1 protein1 WT WT3 4879639.00
4 AT1G01320.1 protein1 WT WT4 6817736.00
5 AT1G01320.1 protein1 m m1 23109581.00
6 AT1G01320.1 protein1 m m2 3778870.00
7 AT1G01320.1 protein1 m m3 6020611.00
8 AT1G01320.1 protein1 f f1 4480108.00
9 AT1G01320.1 protein1 f f2 6131362.00
10 AT1G01320.1 protein1 f f3 4210275.00
11 AT1G01320.1 protein1 f f4 27630841.00
12 AT1G01320.1 protein1 ntrc ntrc1 4702864.00
13 AT1G01320.1 protein1 ntrc ntrc2 2966520.00
14 AT1G01320.1 protein1 ntrc ntrc3 9065916.00
15 AT1G01050.1 protein2 WT WT1 151903.67
16 AT1G01050.1 protein2 WT WT2 417423.81
17 AT1G01050.1 protein2 WT WT3 2895121.80
18 AT1G01050.1 protein2 WT WT4 810620.92
19 AT1G01050.1 protein2 m m1 822284.83
20 AT1G01050.1 protein2 m m2 6477122.14
21 AT1G01050.1 protein2 m m3 12266704.79
22 AT1G01050.1 protein2 f f1 11196940.77
23 AT1G01050.1 protein2 f f2 12143974.82
24 AT1G01050.1 protein2 f f3 1040832.60
25 AT1G01050.1 protein2 f f4 136497.86
26 AT1G01050.1 protein2 ntrc ntrc1 9294097.54
27 AT1G01050.1 protein2 ntrc ntrc2 506386.62
28 AT1G01050.1 protein2 ntrc ntrc3 32266.71
>
I want to write a loop that will first subset the original data frame containing >1000 entries into subsets based on single protein ID, than do one way ANOVA and Tukeys HSD, get p adj from Tukeys, than print it into pdf.
so far I have:
`IDs<-unique((prot$Accession))
tukey_fullAA <- list()
table_fullAA <- NULL
for (i in 1:length(IDs)){
temp <- prot[(prot$Accession)==IDs[i],]
AV<- summary(aov(temp$value ~ temp$genotype))
tukey_fullAA <- list(TukeyHSD(aov(temp$value ~ temp$genotype)))
}
for(j in 1:length(tukey_fullAA))## important loop over whole list
{
tukey <- tukey_fullAA[[j]]
factor_table <- unlist(lapply(tukey, function(x) nrow(x)))
factor_table <- rep(names(factor_table), factor_table)
tukey_bound <- NULL
for (k in 1:length(tukey))
{
tukey_bound <- rbind(tukey_bound, tukey[[k]])
}
pairs <- rownames(tukey_bound)
rownames(tukey_bound) <- NULL
tukey_bound <- as.data.frame(tukey_bound)
tukey_bound$parameter <- factor_table
tukey_bound$pairs <- pairs
table_fullAA <- rbind(table_fullAA, tukey_bound)
}
as it is now it doesn't loop, I struggled to get the Tukey HSD into table, When I have it I want to find the significant values p adj, I get also confused about adding a column that would say what protein are these values for. I imagine it as a column first column containing one name for as many rows as it needs for out put.
thanks a lot!
Related
I'm carrying out a meta-analysis of within-subject studies (crossover studies). I've read some papers that used the esc package (esc_mean_sd function, more precisely) to calculate Hedges'g to perform it. However, its output is doubling the "n" of each study.
Please, look that the "n" in the data is n=12 for all the three studies, while in the output there are n=24.
ID mean_exp mean_con sd_exp sd_con n
1 A 150 130 15 22 12
2 B 166 145 10 8 12
3 C 179 165 11 14 12
# What I did:
e1 <- esc_mean_sd(data[1,2],data[1,4],data[1,6],
data[1,3],data[1,5],data[1,6],
r = .9,es.type = "g")
e2 <- esc_mean_sd(data[2,2],data[2,4],data[2,6],
data[2,3],data[2,5],data[2,6],
r = .9,es.type = "g")
e3 <- esc_mean_sd(data[3,2],data[3,4],data[3,6],
data[3,3],data[3,5],data[3,6],
r = .9,es.type = "g")
data2 <- combine_esc(e1, e2, e3)
colnames(data2) <- ("study","es","weight","n","se","var","lCI","uCI","measure")
head(data2, 3)
# study es weight n se var lCI uCI measure
# 1 1.80 4.18 24 0.489 0.239 0.842 2.76 g
# 2 4.53 1.60 24 0.791 0.626 2.983 6.08 g
# 3 2.14 3.71 24 0.519 0.269 1.126 3.16 g
I am very new in R and I need some advice about very basic issues.
I want to create a new column that is the sum of existent columns in my data frame Data4
The extended code is this:
Data4$E<-(Data4$E1+Data4$E2+Data4$E3+Data4$E4+Data4$E5)
I would like to simplify the code and find a way to not write the sequence of the column's name every time.
I tried this, but it indeed wrong
Data4$E<-(Data4$E[1:5])
Do you know a way to do it?
Thank you!
Among your options are:
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
# base R
Data4$E <- rowSums(Data4) # if there are just columns E1 to E5
Data4$E_option2 <- rowSums(subset(Data4, select = paste0("E", 1:5))) # if there are other columns ..
# "tidy"
library(tidyverse)
Data4 <- Data4 %>%
mutate(E_option3 = pmap_dbl(Data4 %>%
select(E1:E5),
sum))
# E1 E2 E3 E4 E5 E E_option2 E_option3
#1 8.519432 9.727704 9.222280 9.296536 10.223641 46.98959 46.98959 46.98959
#2 11.577169 9.684651 8.706118 11.188879 12.007201 53.16402 53.16402 53.16402
#3 9.043256 9.371745 9.220433 10.340512 11.011979 48.98793 48.98793 48.98793
#4 9.079995 9.893536 10.011952 10.506968 9.697541 49.18999 49.18999 49.18999
#5 8.002358 10.428015 9.847584 9.706695 8.974755 46.95941 46.95941 46.95941
Use functions like sum or rowSums. It seems you want row sums. These functions are better than + because they have na.rm argument that controls wether or not NAs are ignored.
Data4$E <- rowSums(Data[, c("E1", "E2", "E3", "E4", "E5")], na.rm = TRUE)
An easy way to generate column names is to paste them with numbers. Equivalently, we could write it so we can reuse this for other such operations:
E_col_names <- sprintf("E%d", 1:5)
Data4$E <- rowSums(Data[, E_col_names], na.rm = TRUE)
One more way to do it in dplyr demonstrating it on toy_data created in one of the above answers. Just use E1:E5 inside c_across. Of course you may also use select helper functions e.g. starts_with here
#toy_data
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
library(dplyr)
Data4 %>% rowwise() %>%
mutate(E = sum(c_across(E1:E5)))
#> # A tibble: 5 x 6
#> # Rowwise:
#> E1 E2 E3 E4 E5 E
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 8.52 9.73 9.22 9.30 10.2 47.0
#> 2 11.6 9.68 8.71 11.2 12.0 53.2
#> 3 9.04 9.37 9.22 10.3 11.0 49.0
#> 4 9.08 9.89 10.0 10.5 9.70 49.2
#> 5 8.00 10.4 9.85 9.71 8.97 47.0
Created on 2021-05-25 by the reprex package (v2.0.0)
When I adjust values within a "pairs" function and when I extract pvalues and adjust I get different outcome. What is actually happening in the pairs function? Is there a right way to perform it?
rm(list=ls())
id <- rep(1:5, each=3)
trt <- rep(LETTERS[1:3],5)
set.seed(1)
q1 <- runif(15)
set.seed(2)
q2 <- runif(15)
set.seed(3)
q3 <- runif(15)
df <- data.frame(id,trt,q1,q2,q3)
library(lme4)
lm <- lmer(formula= df[,3] ~trt+ (1|id), data=df)
Anova(lm)
emm <- emmeans(lm,"trt")
a <- pairs(emm) # no adjustment
a
contrast estimate SE df t.ratio p.value
A - B 0.2085 0.134 8 1.560 0.3159
A - C -0.0359 0.134 8 -0.269 0.9612
B - C -0.2444 0.134 8 -1.829 0.2211
Note: contrasts are still on the [ scale
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 3 estimates
b <- pairs(emm, adjust="bonferroni") # adjust with bonferroni
b
contrast estimate SE df t.ratio p.value
A - B 0.2085 0.134 8 1.560 0.4721
A - C -0.0359 0.134 8 -0.269 1.0000
B - C -0.2444 0.134 8 -1.829 0.3145
Note: contrasts are still on the [ scale
Degrees-of-freedom method: kenward-roger
P value adjustment: bonferroni method for 3 tests
e <- data.frame(a)
e <- e$p.value
e <- p.adjust(e, method="bonferroni") # extract p-values and adjust
e
[1] 0.9475749 1.0000000 0.6633197
I have two data frames. One of it has codes (1 or -1) for different IDs.
data.1 <- read.csv(text = "
IDs qt1 qt2 qt3
pl1 -1 -1 -1
pl2 1 -1 1
pl3 1 1 1
pl4 -1 -1 -1
pl5 1 1 1
pl6 1 1 1
pl7 1 -1 1
pl8 1 1 1
pl9 -1 -1 -1
pl0 -1 -1 -1
")
And have another dataframe, with three variables, parameters and estimates.
Data.2 <- read.csv(text = "
variable parameter estimate
varA a0 2.3
varA a1 0.859
varA a2 0.527
varA a3 0.774
VarB b0 19.08
VarB b1 0.412
VarB b2 0.022
VarB b3 0.448
VarC c0 5.4
VarC c1 0.492
VarC c2 0.094
VarC c3 0.971
")
For each IDs, I need to estimate the value of each variable. For example, for pl1 and VarA, the value I need to calculate is a0 + (a1*qt1) + (a2*qt2) + (a3*qt3).
The expected result for each of the IDs would be somethin like this:
Of course this is a mock up example, and I have hundres of IDs, and Variables. Therefore, I'd need some automatic way to do this. I was exploring options with dplyr::rowwise and trying to write a function, but couldn't find a way to make a sensible code.
Any help would be really appreciated.
Thanks
You can split the qt values by row and insert a 1 as the first value, split the estimates by variable and then multiply and sum:
qt_vals <- split(cbind(qt0 = 1, data.1[-1]), f = data.1$IDs)
vals <- split(Data.2$estimate, f = Data.2$variable)
sapply(vals, function(x) sapply(qt_vals, function(y) sum(x * y)))
varA VarB VarC
pl0 0.140 18.198 3.843
pl1 0.140 18.198 3.843
pl2 3.406 19.918 6.769
pl3 4.460 19.962 6.957
pl4 0.140 18.198 3.843
pl5 4.460 19.962 6.957
pl6 4.460 19.962 6.957
pl7 3.406 19.918 6.769
pl8 4.460 19.962 6.957
pl9 0.140 18.198 3.843
Note that you have pl10 in the image but pl0 in the example data which is the source of the discrepancy between the image and the result above.
Consider a cross join merge between the data frames after slight reshaping to wide format. Then, run your specified calculation without any loops.
# ADD COLUMN + RESHAPE WIDE
wide_data.2 <- reshape(transform(data.2, var_letter=gsub("[a-z]", "", parameter)),
idvar = "variable", v.names = "estimate", drop = "parameter",
timevar = "var_letter", direction = "wide")
# CROSS JOIN MERGE + CALCULATION
merge_data <- within(merge(wide_data.2, data.1, by=NULL), {
calc_value <- estimate.0 + (estimate.1*qt1) + (estimate.2*qt2) + (estimate.3*qt3)
})
# RESHAPE WIDE
wide_merge_data <- reshape(merge_data[c("IDs", "calc_value", "variable")],
idvar = "IDs", v.names = "calc_value",
timevar = "variable", new.row.names = 1:nrow(data.1),
direction = "wide")
wide_merge_data
# IDs calc_value.VarA calc_value.VarB calc_value.VarC
# 1 pl1 0.140 18.198 3.843
# 2 pl2 3.406 19.918 6.769
# 3 pl3 4.460 19.962 6.957
# 4 pl4 0.140 18.198 3.843
# 5 pl5 4.460 19.962 6.957
# 6 pl6 4.460 19.962 6.957
# 7 pl7 3.406 19.918 6.769
# 8 pl8 4.460 19.962 6.957
# 9 pl9 0.140 18.198 3.843
# 10 pl0 0.140 18.198 3.843
To create the data frame:
num <- sample(1:25, 20)
x <- data.frame("Day_eclosion" = num, "Developmental" = c("AP", "MA",
"JU", "L"), "Replicate" = 1:5)
model <- glmer(Day_eclosion ~ Developmental + (1 | Replicate), family =
"poisson", data= x)
I get this return from:
a <- lsmeans(model, pairwise~Developmental, adjust = "tukey")
a$contrasts
contrast estimate SE df z.ratio p.value
AP - JU 0.2051 0.0168 Inf 12.172 <.0001
AP - L 0.3009 0.0212 Inf 14.164 <.0001
AP - MA 0.3889 0.0209 Inf 18.631 <.0001
JU - L 0.0958 0.0182 Inf 5.265 <.0001
JU - MA 0.1839 0.0177 Inf 10.387 <.0001
L - MA 0.0881 0.0222 Inf 3.964 0.0004
I am looking for a simple way to turn this output (just p values) into:
AP MA JU L
AP - <.0001 <.0001 <.0001
MA - - <.0001 0.0004
JU - - - <.0001
L - - -
I have about 20 sets of these that I need to turn into tables, so the simpler and more general the better.
Bonus points if the output is tab-deliminated, etc, so that I can easily paste into word/excel.
Thanks!
Here's a function that works...
pvmat = function(emm, ...) {
emm = update(emm, by = NULL) # need to work harder otherwise
pv = test(pairs(emm, reverse = TRUE, ...)) $ p.value
fmtpv = sprintf("%6.4f", pv)
fmtpv[pv < 0.0001] = "<.0001"
lbls = do.call(paste, emm#grid[emm#misc$pri.vars])
n = length(lbls)
mat = matrix("", nrow = n, ncol = n, dimnames = list(lbls, lbls))
mat[upper.tri(mat)] = fmtpv
idx = seq_len(n - 1)
mat[idx, 1 + idx] # trim off last row and 1st col
}
Illustration:
require(emmeans)
> warp.lm = lm(breaks ~ wool * tension, data = warpbreaks)
> warp.emm = emmeans(warp.lm, ~ wool * tension)
> warp.emm
wool tension emmean SE df lower.CL upper.CL
A L 44.6 3.65 48 37.2 51.9
B L 28.2 3.65 48 20.9 35.6
A M 24.0 3.65 48 16.7 31.3
B M 28.8 3.65 48 21.4 36.1
A H 24.6 3.65 48 17.2 31.9
B H 18.8 3.65 48 11.4 26.1
Confidence level used: 0.95
> pm = pvmat(warp.emm, adjust = "none")
> print(pm, quote=FALSE)
B L A M B M A H B H
A L 0.0027 0.0002 0.0036 0.0003 <.0001
B L 0.4170 0.9147 0.4805 0.0733
A M 0.3589 0.9147 0.3163
B M 0.4170 0.0584
A H 0.2682
Notes
As provided, this does not support by variables. Accordingly, the first line of the function disables them.
Using pairs(..., reverse = TRUE) generates the P values in the correct order needed later for upper.tri()
you can pass arguments to test() via ...
To create a tab-delimited version, use the clipr package:
clipr::write_clip(pm)
What you need is now in the clipboard and ready to paste into a spreadsheet.
Addendum
Answering this question inspired me to add a new function pwpm() to the emmeans package. It will appear in the next CRAN release, and is available now from the github site. It displays means and differences as well as P values; but the user may select which to include.
> pwpm(warp.emm)
wool = A
L M H
L [44.6] 0.0007 0.0009
M 20.556 [24.0] 0.9936
H 20.000 -0.556 [24.6]
wool = B
L M H
L [28.2] 0.9936 0.1704
M -0.556 [28.8] 0.1389
H 9.444 10.000 [18.8]
Row and column labels: tension
Upper triangle: P values adjust = “tukey”
Diagonal: [Estimates] (emmean)
Upper triangle: Comparisons (estimate) earlier vs. later