How to perform a multifactorial t-test for all possible pairs of groups with a minimal number of coding lines.
My example:
3x features : 1,2,3
4x groups: : A,B,C,D
Aim: For each feature test all pairs of groups:
1(A-B,A-C,A-D,B-C,B-D,C-D)
2(A-B,A-C,A-D,B-C,B-D,C-D)
3(A-B,A-C,A-D,B-C,B-D,C-D)
= 18 T-tests
At the moment I am using ddply and inside lapply :
library(plyr)
groupVector <- c(rep("A",10),rep("B",10),rep("C",10),rep("D",10))
featureVector <- rep(1:3,each=40)
mydata <- data.frame(feature=factorVector,group=groupVector,value=rnorm(120,0,1))
ddply(mydata,.(feature),function(x){
grid <- combn(unique(x$group),2, simplify = FALSE)
df <- lapply(grid,function(p){
sub <- subset(x,group %in% p)
pval <- t.test(sub$value ~ sub$group)$p.value
data.frame(groupA=p[1],groupB=p[2],pval=pval)
})
res <- do.call("rbind",df)
return(res)
})
Here's my take, although it's arguable whether it's 'better'
split.data <- split(mydata, mydata$feature)
pairs <- as.data.frame(matrix(combn(unique(mydata$group), 2), nrow=2))
library(tidyverse)
map_df(split.data, function(x) map_df(pairs, function(y) tibble(groupA = y[1], groupB = y[2],
pval = t.test(value ~ group, data = x, subset = which(x$group %in% y))$p.value)), .id="feature")
Output
# # A tibble: 18 x 4
# feature groupA groupB pval
# <chr> <chr> <chr> <dbl>
# 1 1 A B 0.28452419
# 2 1 A C 0.65114472
# 3 1 A D 0.77746420
# 4 1 B C 0.42546791
# 5 1 B D 0.39876582
# 6 1 C D 0.88079645
# 7 2 A B 0.57843592
# 8 2 A C 0.30726571
# 9 2 A D 0.55457986
# 10 2 B C 0.74871464
# 11 2 B D 0.24017130
# 12 2 C D 0.04252878
# 13 3 A B 0.01355117
# 14 3 A C 0.08746756
# 15 3 A D 0.24527519
# 16 3 B C 0.15130684
# 17 3 B D 0.09172577
# 18 3 C D 0.64206517
Related
I have a dataset of a series of names in different columns. Each column determines the time in which the names were entered into the system. Is it possible to find the number of times ALL the names appear and the most recent column entry. I added a picture to show how the dataset works.
Here's one method:
library(dplyr)
set.seed(42)
dat <- setNames(as.data.frame(replicate(4, sample(letters, size = 10, replace = TRUE))), 1:4)
dat
# 1 2 3 4
# 1 q x c c
# 2 e g i z
# 3 a d y a
# 4 y y d j
# 5 j e e x
# 6 d n m k
# 7 r t e o
# 8 z z t v
# 9 q r b z
# 10 o o h h
tidyverse
library(dplyr)
library(tidyr)
pivot_longer(dat, everything(), names_to = "colname", values_to = "word") %>%
mutate(colname = as.integer(colname)) %>%
group_by(word) %>%
summarize(n = n(), latest = max(colname), .groups = "drop")
# # A tibble: 20 x 3
# word n latest
# <chr> <int> <int>
# 1 a 2 4
# 2 b 1 3
# 3 c 2 4
# 4 d 3 3
# 5 e 4 3
# 6 g 1 2
# 7 h 2 4
# 8 i 1 3
# 9 j 2 4
# 10 k 1 4
# 11 m 1 3
# 12 n 1 2
# 13 o 3 4
# 14 q 2 1
# 15 r 2 2
# 16 t 2 3
# 17 v 1 4
# 18 x 2 4
# 19 y 3 3
# 20 z 4 4
data.table
library(data.table)
melt(as.data.table(dat), integer(0), variable.name = "colname", value.name = "word")[
, colname := as.integer(colname)
][, .(n = .N, latest = max(colname)), by = .(word) ]
(though it is not sorted by word, the values are the same)
I have a large data frame and want to create a new variable which depends on two other variables.
Here is a short example:
v1 <- rep(c(1:5),each=3)
v2 <- c('X','A','Y','X','Y','B','X','Y','C','X','Y','C','X','Y','A')
dat <- data.frame(v1,v2)
#create a new var which contains either A,B, or C depending on what is found in v2
#desired output
v3 <- rep(c('A','B','C','C','A'),each=3)
data.frame(v1,v2,v3)
Any ideas on how to do this with a short code?
I tried this, but it's far from the solution. Too many missings. :(
dat$v3[dat$v2 %in% c('A','B','C')] <- dat$v2[dat$v2 %in% c('A','B','C')]
library(tidyverse)
dat %>% group_by(v1) %>% mutate(v3 = intersect(v2, c("A", "B", "C")))
# A tibble: 15 x 3
# Groups: v1 [5]
# v1 v2 v3
# <int> <fct> <chr>
# 1 1 X A
# 2 1 A A
# 3 1 Y A
# 4 2 X B
# 5 2 Y B
# 6 2 B B
# 7 3 X C
# 8 3 Y C
# 9 3 C C
# 10 4 X C
# 11 4 Y C
# 12 4 C C
# 13 5 X A
# 14 5 Y A
# 15 5 A A
This is assuming that only one of A, B, C can appear in a group given by v1.
I have mydf data frame below. I want to split any cell that contains comma separated data and put it into rows. I am looking for a data frame similar to y below. How could i do it efficiently in few steps? Currently i am using cSplit function on one column at a time.
I tried cSplit(mydf, c("name","new"), ",", direction = "long"), but that didn`t work
library(splitstackshape)
mydf=data.frame(name = c("AB,BW","x,y,z"), AB = c('A','B'), new=c("1,2,3","4,5,6,7"))
mydf
x=cSplit(mydf, c("name"), ",", direction = "long")
x
y=cSplit(x, c("new"), ",", direction = "long")
y
There are times when a for loop is totally fine to work with in R. This is one of those times. Try:
library(splitstackshape)
cols <- c("name", "new")
for (i in cols) {
mydf <- cSplit(mydf, i, ",", "long")
}
mydf
## name AB new
## 1: AB A 1
## 2: AB A 2
## 3: AB A 3
## 4: BW A 1
## 5: BW A 2
## 6: BW A 3
## 7: x B 4
## 8: x B 5
## 9: x B 6
## 10: x B 7
## 11: y B 4
## 12: y B 5
## 13: y B 6
## 14: y B 7
## 15: z B 4
## 16: z B 5
## 17: z B 6
## 18: z B 7
Here's a small test using slightly bigger data:
# concat.test = sample data from "splitstackshape"
test <- do.call(rbind, replicate(5000, concat.test, FALSE))
fun1 <- function() {
cols <- c("Likes", "Siblings")
for (i in cols) {
test <- cSplit(test, i, ",", "long")
}
test
}
fun2 <- function() {
test %>%
separate_rows("Likes") %>%
separate_rows("Siblings")
}
system.time(fun1())
# user system elapsed
# 3.205 0.056 3.261
system.time(fun2())
# user system elapsed
# 11.598 0.066 11.662
We can use the separate_rows function from the tidyr package.
library(tidyr)
mydf2 <- mydf %>%
separate_rows("name") %>%
separate_rows("new")
mydf2
# AB name new
# 1 A AB 1
# 2 A AB 2
# 3 A AB 3
# 4 A BW 1
# 5 A BW 2
# 6 A BW 3
# 7 B x 4
# 8 B x 5
# 9 B x 6
# 10 B x 7
# 11 B y 4
# 12 B y 5
# 13 B y 6
# 14 B y 7
# 15 B z 4
# 16 B z 5
# 17 B z 6
# 18 B z 7
If you don't what to use separate_rows function more than once, we can further design a function to iteratively apply the separate_rows function.
expand_fun <- function(df, vars){
while (length(vars) > 0){
df <- df %>% separate_rows(vars[1])
vars <- vars[-1]
}
return(df)
}
The expand_fun takes two arguments. The first argument, df, is the original data frame. The second argument, vars, is a character string with the columns names we want to expand. Here is an example using the function.
mydf3 <- expand_fun(mydf, vars = c("name", "new"))
mydf3
# AB name new
# 1 A AB 1
# 2 A AB 2
# 3 A AB 3
# 4 A BW 1
# 5 A BW 2
# 6 A BW 3
# 7 B x 4
# 8 B x 5
# 9 B x 6
# 10 B x 7
# 11 B y 4
# 12 B y 5
# 13 B y 6
# 14 B y 7
# 15 B z 4
# 16 B z 5
# 17 B z 6
# 18 B z 7
I have a set of data with duplicates:
x <- tibble(num=c(1,2,3,2,5,5,8), alph=NA)
And separate sources giving their corresponding values.
y <- tibble(num=1:4, alph=LETTERS[1:4])
z <- tibble(num=5:10, alph=LETTERS[5:10])
Normally, one would use this code to update x$num with data from y.
x$alph <- y$alph[match(x$num,y$num)]
Doing the same for z would nonetheless overwrite what was already in place from y and replace them with NAs.
How can I code so that data can be cumulatively updated? Using:
x$alph[which(x$num %in% z$num)] <- y$alph[which(z$num %in% x$num)]
doesn't work because of the duplicate.
Here I provided three options using tidyverse. x2, x4, and x5 are the final output.
We can create a combined data frames from y and z, and then perform a join with x.
# Load packages
library(tidyverse)
# Create example data frames
x <- tibble(num=c(1,2,3,2,5,5,8), alph=NA)
y <- tibble(num=1:4, alph=LETTERS[1:4])
z <- tibble(num=5:10, alph=LETTERS[5:10])
# Create combined table from y and z
yz <- bind_rows(y, z)
# Perform join
x2 <- x %>%
select(-alph) %>%
left_join(yz, by = "num")
x2
# # A tibble: 7 x 2
# num alph
# <dbl> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 2 B
# 5 5 E
# 6 5 E
# 7 8 H
Or use reduce to merge all data frames, then select the one that is not NA to construct a new data frame.
x3 <- reduce(list(x, y, z), left_join, by = "num")
x4 <- tibble(num = x3$num,
alph = apply(x3[, -1], 1, function(x) x[!is.na(x)]))
x4
# # A tibble: 7 x 2
# num alph
# <dbl> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 2 B
# 5 5 E
# 6 5 E
# 7 8 H
Or after the reduce and join, Use gather to remove NA values.
x3 <- reduce(list(x, y, z), left_join, by = "num")
x5 <- x3 %>%
gather(Type, alph, -num, na.rm = TRUE) %>%
select(-Type)
x5
# # A tibble: 7 x 2
# num alph
# <dbl> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 2 B
# 5 5 E
# 6 5 E
# 7 8 H
There are many answers for how to split a dataframe, for example How to split a data frame?
However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.
Here's an example
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)
n group
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
8 8 c
9 9 c
I'd like the output to look like:
d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
d <- list(d1, d2, d3)
d
[[1]]
n group
1 1 a
2 2 a
3 3 a
4 4 b
[[2]]
n group
1 3 a
2 4 b
3 5 b
4 6 b
5 7 c
[[3]]
n group
1 6 b
2 7 c
3 8 c
4 9 c
What is an efficient way to accomplish this task?
Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.
n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)
$a
n group
1 1 a
2 2 a
3 3 a
4 4 b
$b
n group
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
$c
n group
6 6 b
7 7 c
8 8 c
9 9 c
Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".
by(data = df, INDICES = df$group, function(x){
id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
na.omit(df[id, ])
})
# df$group: a
# n group
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# --------------------------------------------------------------------------------
# df$group: b
# n group
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# --------------------------------------------------------------------------------
# df$group: c
# n group
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c
Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).
I was going to comment under #cdetermans answer but its too late now.
You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like
library(data.table) # v1.9.6+
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
# n group
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 b
#
# [[2]]
# n group
# 1: 3 a
# 2: 4 b
# 3: 5 b
# 4: 6 b
# 5: 7 c
#
# [[3]]
# n group
# 1: 6 b
# 2: 7 c
# 3: 8 c
# 4: 9 c
Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.
library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]
library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]
}
[[1]]
n group
1: 1 a
2: 2 a
3: 3 a
4: 4 b
[[2]]
n group
1: 3 a
2: 4 b
3: 5 b
4: 6 b
5: 7 c
[[3]]
n group
1: 6 b
2: 7 c
3: 8 c
4: 9 c
Here is another dplyr way:
library(dplyr)
data =
data_frame(n = n, group) %>%
group_by(group)
firsts =
data %>%
slice(1) %>%
ungroup %>%
mutate(new_group = lag(group)) %>%
slice(-1)
lasts =
data %>%
slice(n()) %>%
ungroup %>%
mutate(new_group = lead(group)) %>%
slice(-n())
bind_rows(firsts, data, lasts) %>%
mutate(final_group =
ifelse(is.na(new_group),
group,
new_group) ) %>%
arrange(final_group, n) %>%
group_by(final_group)