Related
In my dataframe, the three responses (yes, maybe, no) to a question are printed as three separate variables (a binary outcome of each possible response).
I want to combine the three binary responses into one variable, showing which response was selected.
The following piece of code does this:
data$var1 <- ifelse(data$var1.Yes, 0,
ifelse(data$var1.Maybe, 1,
ifelse(data$var1.No,2, NA)))
However, because I have many variables (e.g., var1, var2, var3, etc..), I want to pass a function or loop where the code runs for multiple variables whose column names include ascending numbers.
I thought of the following function:
fun <- function(i){
paste0("data$var", i) <- ifelse(paste0("data$var", i, ".Yes"), 0,
ifelse(paste0("data$var",i,".Maybe"), 1,
ifelse(paste0("data$var",i,".No"),2, NA)))
}
fun(1:3)
Unfortunately, this does not work. How can I apply this function to several variables at once?
dput(test)
structure(list(var1.Yes = c(0, 0, 1, 0, 1, 1, 1, 0, NA, 1),
var1.Maybe = c(1, 0, 0, 1, 0, 0, 0, 0, NA, 0),
var1.No= c(0, 1, 0, 0, 0, 0, 0, 1, NA, 1),
var2.Yes = c(0, 0, 1, NA, 1, 1, 1, 0, 0, 1),
var2.Maybe = c(0, 1, 1, NA, 0, 0, 0, 0, 0, 0),
var2.No= c(1, 0, 0, NA, 0, 0, 0, 1, 1, 0),
var3.Yes = c(0, 1, 0, 0, 0, 0, 0, NA, 0, 1),
var3.Maybe = c(0, 0, 0, 0, 1, 1, 1, NA, 1, 0),
class = "data.frame"))
You can loop through each three columns;
lapply(1:(ncol(test)/3), function(col) ifelse(test[,col*3-2], 0,
ifelse(test[,col*3-1], 1,
ifelse(col*3, 2, NA))))
# [[1]]
# [1] 1 2 0 1 0 0 0 2 NA 0
#
# [[2]]
# [1] 2 1 0 NA 0 0 0 2 2 0
#
# [[3]]
# [1] 2 0 2 2 1 1 1 NA 1 0
This can be merged with your data:
cbind(test, matrix(unlist(lapply_results), nrow = nrow(test)))
Data:
data.frame(
var1.Yes = c(0, 0, 1, 0, 1, 1, 1, 0, NA, 1),
var1.Maybe= c(1, 0, 0, 1, 0, 0, 0, 0, NA, 0),
var1.No = c(0, 1, 0, 0, 0, 0, 0, 1, NA, 1),
var2.Yes = c(0, 0, 1, NA, 1, 1, 1, 0, 0, 1),
var2.Maybe= c(0, 1, 1, NA, 0, 0, 0, 0, 0, 0),
var2.No = c(1, 0, 0, NA, 0, 0, 0, 1, 1, 0),
var3.Yes = c(0, 1, 0, 0, 0, 0, 0, NA, 0, 1),
var3.Maybe= c(0, 0, 0, 0, 1, 1, 1, NA, 1, 0),
var3.No = c(1, 0, 1, 1, 0, 0, 0, NA, 0, 0)) -> test
Participants in an experiment took a test that has a rule that says "once a participant has gotten 6 items wrong in a window of 8 items, you stop running the test". However, some experimenters kept testing past this point. I now need to find a way in which I can automatically see where the test should have been stopped, and change all values following the end to 0 (= item wrong). I am not even sure if this is something that can be done in R.
To be clear, I would like to go row by row (which are the participants) and once there are six 0s in a given window of 8 columns (items), I would need all values after the sixth 0 to be 0 too.
While the reproducible data is below, here is a visualization of what I would need, where the blue cells are the ones that should change to 0:
Pre-changes
Post-changes
Reproducible data:
structure(list(Participant_ID = c("E01P01", "E01P02", "E01P03",
"E01P04", "E01P05", "E01P06", "E01P07", "E01P08", "E02P01", "E02P02"
), A2 = c(1, 1, 1, 0, 0, 1, 1, 1, 1, 1), A3 = c(1, 1, 0, 0, 0,
1, 0, 0, 0, 0), B1 = c(1, 1, 1, 0, 0, 1, 0, 0, 1, 1), B2 = c(1,
1, 1, 1, 1, 1, 0, 0, 0, 1), C3 = c(1, 0, 0, 1, 0, 1, 0, 0, 0,
1), C4 = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 1), D1 = c(1, 0, 0, 0,
0, 1, 0, 0, 0, 0), D3 = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 1), E1 = c(1,
0, 0, 0, 0, 1, 0, 0, 0, 1), E3 = c(1, 1, 0, 1, 0, 1, 0, 0, 0,
0), F1 = c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0), F4 = c(1, 1, 1, 1,
0, 1, 0, 1, 1, 0), G1 = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 1), G2 = c(0,
0, 0, 0, 1, 1, 1, 0, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Any help is highly appreciated!
Here is a solution that involves some pivoting, rollsum, cumsum, if_else logic, then pivoting back. Let me know if it works.
library(tidyverse)
library(zoo)
structure(list(Participant_ID = c("E01P01", "E01P02", "E01P03",
"E01P04", "E01P05", "E01P06", "E01P07", "E01P08", "E02P01", "E02P02"
), A2 = c(1, 1, 1, 0, 0, 1, 1, 1, 1, 1), A3 = c(1, 1, 0, 0, 0,
1, 0, 0, 0, 0), B1 = c(1, 1, 1, 0, 0, 1, 0, 0, 1, 1), B2 = c(1,
1, 1, 1, 1, 1, 0, 0, 0, 1), C3 = c(1, 0, 0, 1, 0, 1, 0, 0, 0,
1), C4 = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 1), D1 = c(1, 0, 0, 0,
0, 1, 0, 0, 0, 0), D3 = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 1), E1 = c(1,
0, 0, 0, 0, 1, 0, 0, 0, 1), E3 = c(1, 1, 0, 1, 0, 1, 0, 0, 0,
0), F1 = c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0), F4 = c(1, 1, 1, 1,
0, 1, 0, 1, 1, 0), G1 = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 1), G2 = c(0,
0, 0, 0, 1, 1, 1, 0, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")) %>%
as_tibble() %>%
pivot_longer(-1) %>%
group_by(Participant_ID) %>%
mutate(running_total = zoo::rollsumr(value==0, k = 8, fill = 0),
should_terminate = cumsum(running_total >= 6),
value = if_else(should_terminate > 0, 0, value)) %>%
ungroup() %>%
select(Participant_ID, name, value) %>%
pivot_wider(names_from = name, values_from = value)
I'm doing a differential expression analysis for RNA-seq data with limma - voom. My data is about a cancer drug, 49 samples in total, some of them are responders some of them are not. I need some help building the contrast. I'm dealing with only one factor here, so two groups only.
I know it's the simplest type of data, but I'm getting most of the data as differntialy expressed (which should not be the case), only 13% is not differntialy expressed, and I think the problem has to do with the contrast. This is the design I made, with 1 or 0.
1 for NoResponse means there was no response, and 1 for Response means there was a response.
using dput:
structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), .Dim = c(49L,
2L), .Dimnames = list(c("Pt1", "Pt10", "Pt103", "Pt106", "Pt11",
"Pt17", "Pt2", "Pt24", "Pt26", "Pt27", "Pt28", "Pt29", "Pt31",
"Pt36", "Pt37", "Pt38", "Pt39", "Pt4", "Pt46", "Pt47", "Pt5",
"Pt52", "Pt59", "Pt62", "Pt65", "Pt66", "Pt67", "Pt77", "Pt78",
"Pt79", "Pt8", "Pt82", "Pt84", "Pt85", "Pt89", "Pt9", "Pt90",
"Pt92", "Pt98", "Pt101", "Pt18", "Pt3", "Pt30", "Pt34", "Pt44",
"Pt48", "Pt49", "Pt72", "Pt94"), c("NoResponse", "Response")), assign = c(1L,
1L), contrasts = list(Response = "contr.treatment"))
And here is my code for the analysis it self:
d0 <- DGEList(rawdata)
d0 <- calcNormFactors(d0)
Voom <- voom(d0, design, plot = TRUE)
vfit <- lmFit(Voom, design)
contrast <- makeContrasts(Response - NoResponse,
levels = colnames(coef(vfit)))
vfit <- contrasts.fit(vfit, contrasts = contrast)
efit <- eBayes(vfit)
plotSA(efit, main = 'final model: Mean-Variance trend')
The bioconductor guide didn't help.
Note: The problem is not with the data. The voom plot is very good, I'm just stuck with the contrast which is (I think) making all the mess.
I have a 3185x90 dataset of binary values and want to do a chi-squared test of independence, comparing all column variables against each other.
I've been tried using different variations of code from google searches with chisq.test() and some for loops, but none of them have worked so far.
How do I do this?
This is the frame I've tinkered with. My dataset is oak.
chi_trial <- data.frame(a = c(0,1), b = c(0,1))
for(row in 1:nrow(oak)){
print(row)
print(chisq.test(c(oak[row,1],d[row,2])))
}
I also tried this:
apply(d, 1, chisq.test)
which gives me the error: Error in FUN(newX[, i], ...) :
all entries of 'x' must be nonnegative and finite
dput(oak[1:2],)
structure(list(post_flu = structure(c(1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
label = "Receipt of Flu Vaccine - Encounter Survey", format.stata = "%10.0g")), row.names = c(NA,
-3185L), class = c("tbl_df", "tbl", "data.frame"), label = "Main Oakland Clinic Analysis Dataset")
I added a sample of my data with the final lines of the output. The portion of the dataset is small, but it all looks like this.
You could use something like the code below, which is similar to R's cor function. I don't have your data, so I'm simulating some. Note that I get one significant p-value, using the traditional cut-off of 0.05.
set.seed(3)
nr=3185; nc=3
oak <- as.data.frame(matrix(sample(0:1, size=nr*nc, replace=TRUE), ncol=nc))
oak
mult.chi <- function(data){
nc <- ncol(data)
res <- matrix(0, nrow=nc, ncol=nc) # or NA
for(i in 1:(nc-1))
for(j in (i+1):nc)
res[i,j] <- suppressWarnings(chisq.test(oak[,i], oak[,j])$p.value)
rownames(res) <- colnames(data)
colnames(res) <- colnames(data)
res
}
mult.chi(oak)
# V1 V2 V3
# V1 0 0.7847063 0.32012466
# V2 0 0.0000000 0.01410326
# V3 0 0.0000000 0.00000000
So consider applying a multiple testing adjustment as mentioned in the comments.
Here is a solution with combn to get all combinations of column numbers 2 by 2. Tested with the data in #Edward's answer.
chisq2cols <- function(X){
y <- matrix(0, ncol(X), ncol(X))
cmb <- combn(ncol(X), 2)
y[upper.tri(y)] <- apply(cmb, 2, function(k){
tbl <- table(X[k])
chisq.test(tbl)$p.value
})
y
}
chisq2cols(oak)
# [,1] [,2] [,3]
#[1,] 0 0.7847063 0.32012466
#[2,] 0 0.0000000 0.01410326
#[3,] 0 0.0000000 0.00000000
I am making a heat map though would like to separate the columns and add a line between each row. I am well aware that doing so makes this well, not a heat map. But this is how my boss envisions it.
Below is my code for the current heat map. Any advice on separating the columns & adding a line between each "person" would be much appreciated.
x11 <- c(0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1)
x22 <- c(1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1)
x <- rbind(x11, x22)
hv <- heatmap(t(x), col = c("cornflowerblue", "hotpink"), margins = c(4, 12), Colv = NA, Rowv = NA, scale = "none", xlab ="", ylab ="", main = "", labCol=c("BP", "Cx"), cexCol =2)
legend("topright", c("No Osteomyelitis", "Osteomyelitis"), col=c("cornflowerblue", "hotpink"), bty="n", fill=c("cornflowerblue", "hotpink"))
Yes I ended up using the code below. Thank you for the answer. I used gplots & got rid of the color key & histogram & added my own legend instead.
hv <- heatmap.2(t(x), key=FALSE, trace="none", colsep = seq(1,nrow(x)-1),
rowsep = seq(1,ncol(x)-1),
sepcolor = "white",
sepwidth = c(0.1, 0.0005), col = c("cornflowerblue", "hotpink"), margins = c(4, 12), Colv = NA, Rowv = NA, scale = "none", xlab ="", ylab ="", main = "", labCol=c("BP", "Cx"), cexCol =2)
legend("topleft", c("No Osteomyelitis", "Osteomyelitis"), col=c("cornflowerblue", "hotpink"), bty="n", fill=c("cornflowerblue", "hotpink"))
I suggest you swap to gplots::heatmap.2() which allows greater control over plotting with mostly the same arguments.
Building on your good example (+1 btw) by adding the colsep, rowsep, sepcolor and sepwidth arguments to control the separation between the rows and columns (and trace = 'none' because I don't like it) gives:
x11 <- c(0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1)
x22 <- c(1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1)
x <- rbind(x11, x22)
hv <- heatmap(t(x), col = c("cornflowerblue", "hotpink"), margins = c(4, 12), Colv = NA, Rowv = NA, scale = "none", xlab ="", ylab ="", main = "", labCol=c("BP", "Cx"), cexCol =2)
legend("topright", c("No Osteomyelitis", "Osteomyelitis"), col=c("cornflowerblue", "hotpink"), bty="n", fill=c("cornflowerblue", "hotpink"))
library(gplots)
heatmap.2(t(x),
col = c("cornflowerblue", "hotpink"),
margins = c(4, 12),
Colv = NA, Rowv = NA,
scale = "none",
xlab ="",
ylab ="",
main = "",
labCol=c("BP", "Cx"),
cexCol = 2,
trace = 'none',
colsep = seq(1,nrow(x)-1),
rowsep = seq(1,ncol(x)-1),
sepcolor = "white",
sepwidth = c(0.1, 0.05))
To separate the columns more, increse the first element of sepwidth and similarly for the rows.