I have a large data frame and want to create a new variable which depends on two other variables.
Here is a short example:
v1 <- rep(c(1:5),each=3)
v2 <- c('X','A','Y','X','Y','B','X','Y','C','X','Y','C','X','Y','A')
dat <- data.frame(v1,v2)
#create a new var which contains either A,B, or C depending on what is found in v2
#desired output
v3 <- rep(c('A','B','C','C','A'),each=3)
data.frame(v1,v2,v3)
Any ideas on how to do this with a short code?
I tried this, but it's far from the solution. Too many missings. :(
dat$v3[dat$v2 %in% c('A','B','C')] <- dat$v2[dat$v2 %in% c('A','B','C')]
library(tidyverse)
dat %>% group_by(v1) %>% mutate(v3 = intersect(v2, c("A", "B", "C")))
# A tibble: 15 x 3
# Groups: v1 [5]
# v1 v2 v3
# <int> <fct> <chr>
# 1 1 X A
# 2 1 A A
# 3 1 Y A
# 4 2 X B
# 5 2 Y B
# 6 2 B B
# 7 3 X C
# 8 3 Y C
# 9 3 C C
# 10 4 X C
# 11 4 Y C
# 12 4 C C
# 13 5 X A
# 14 5 Y A
# 15 5 A A
This is assuming that only one of A, B, C can appear in a group given by v1.
Related
I know there a several ways to create a column based on another column, however I would like to know how to do it while creating a data frame.
For example this works but is not the way I want to use it.
v1 = rnorm(10)
sample_df <- data.frame(v1 = v1,
cs = cumsum(v1))
This works not:
sample_df2 <- data.frame(v2 = rnorm(10),
cs = cumsum(v2))
Is there a way to it directly in the data.frame function? Thanks in advance.
It cannot be done using data.frame, but package tibble implements a data.frame analogue with the functionality that you want.
library("tibble")
tib <- tibble(x = 1:6, y = cumsum(x))
tib
# # A tibble: 6 × 2
# x y
# <int> <int>
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
In most cases, the resulting object (called a "tibble") can be treated as if it were a data frame, but if you truly need a data frame, then you can do this:
dat <- as.data.frame(tib)
dat
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
You can wrap everything in a function if you like:
f <- function(...) as.data.frame(tibble(...))
f(x = 1:6, y = cumsum(x))
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
Consider the following named vector vec and tibble df:
vec <- c("1" = "a", "2" = "b", "3" = "c")
df <- tibble(col = rep(1:3, c(4, 2, 5)))
df
# # A tibble: 11 x 1
# col
# <int>
# 1 1
# 2 1
# 3 1
# 4 1
# 5 2
# 6 2
# 7 3
# 8 3
# 9 3
# 10 3
# 11 3
I would like to replace the values in the col column with the corresponding named values in vec.
I'm looking for a tidyverse approach, that doesn't involve converting vec as a tibble.
I tried the following, without success:
df %>%
mutate(col = map(
vec,
~ str_replace(col, names(.x), .x)
))
Expected output:
# A tibble: 11 x 1
col
<chr>
1 a
2 a
3 a
4 a
5 b
6 b
7 c
8 c
9 c
10 c
11 c
You could use col :
df$col1 <- vec[as.character(df$col)]
Or in mutate :
library(dplyr)
df %>% mutate(col1 = vec[as.character(col)])
# col col1
# <int> <chr>
# 1 1 a
# 2 1 a
# 3 1 a
# 4 1 a
# 5 2 b
# 6 2 b
# 7 3 c
# 8 3 c
# 9 3 c
#10 3 c
#11 3 c
We can also use data.table
library(data.table)
setDT(df)[, col1 := vec[as.character(col)]]
I have a set of data with duplicates:
x <- tibble(num=c(1,2,3,2,5,5,8), alph=NA)
And separate sources giving their corresponding values.
y <- tibble(num=1:4, alph=LETTERS[1:4])
z <- tibble(num=5:10, alph=LETTERS[5:10])
Normally, one would use this code to update x$num with data from y.
x$alph <- y$alph[match(x$num,y$num)]
Doing the same for z would nonetheless overwrite what was already in place from y and replace them with NAs.
How can I code so that data can be cumulatively updated? Using:
x$alph[which(x$num %in% z$num)] <- y$alph[which(z$num %in% x$num)]
doesn't work because of the duplicate.
Here I provided three options using tidyverse. x2, x4, and x5 are the final output.
We can create a combined data frames from y and z, and then perform a join with x.
# Load packages
library(tidyverse)
# Create example data frames
x <- tibble(num=c(1,2,3,2,5,5,8), alph=NA)
y <- tibble(num=1:4, alph=LETTERS[1:4])
z <- tibble(num=5:10, alph=LETTERS[5:10])
# Create combined table from y and z
yz <- bind_rows(y, z)
# Perform join
x2 <- x %>%
select(-alph) %>%
left_join(yz, by = "num")
x2
# # A tibble: 7 x 2
# num alph
# <dbl> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 2 B
# 5 5 E
# 6 5 E
# 7 8 H
Or use reduce to merge all data frames, then select the one that is not NA to construct a new data frame.
x3 <- reduce(list(x, y, z), left_join, by = "num")
x4 <- tibble(num = x3$num,
alph = apply(x3[, -1], 1, function(x) x[!is.na(x)]))
x4
# # A tibble: 7 x 2
# num alph
# <dbl> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 2 B
# 5 5 E
# 6 5 E
# 7 8 H
Or after the reduce and join, Use gather to remove NA values.
x3 <- reduce(list(x, y, z), left_join, by = "num")
x5 <- x3 %>%
gather(Type, alph, -num, na.rm = TRUE) %>%
select(-Type)
x5
# # A tibble: 7 x 2
# num alph
# <dbl> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 2 B
# 5 5 E
# 6 5 E
# 7 8 H
How to perform a multifactorial t-test for all possible pairs of groups with a minimal number of coding lines.
My example:
3x features : 1,2,3
4x groups: : A,B,C,D
Aim: For each feature test all pairs of groups:
1(A-B,A-C,A-D,B-C,B-D,C-D)
2(A-B,A-C,A-D,B-C,B-D,C-D)
3(A-B,A-C,A-D,B-C,B-D,C-D)
= 18 T-tests
At the moment I am using ddply and inside lapply :
library(plyr)
groupVector <- c(rep("A",10),rep("B",10),rep("C",10),rep("D",10))
featureVector <- rep(1:3,each=40)
mydata <- data.frame(feature=factorVector,group=groupVector,value=rnorm(120,0,1))
ddply(mydata,.(feature),function(x){
grid <- combn(unique(x$group),2, simplify = FALSE)
df <- lapply(grid,function(p){
sub <- subset(x,group %in% p)
pval <- t.test(sub$value ~ sub$group)$p.value
data.frame(groupA=p[1],groupB=p[2],pval=pval)
})
res <- do.call("rbind",df)
return(res)
})
Here's my take, although it's arguable whether it's 'better'
split.data <- split(mydata, mydata$feature)
pairs <- as.data.frame(matrix(combn(unique(mydata$group), 2), nrow=2))
library(tidyverse)
map_df(split.data, function(x) map_df(pairs, function(y) tibble(groupA = y[1], groupB = y[2],
pval = t.test(value ~ group, data = x, subset = which(x$group %in% y))$p.value)), .id="feature")
Output
# # A tibble: 18 x 4
# feature groupA groupB pval
# <chr> <chr> <chr> <dbl>
# 1 1 A B 0.28452419
# 2 1 A C 0.65114472
# 3 1 A D 0.77746420
# 4 1 B C 0.42546791
# 5 1 B D 0.39876582
# 6 1 C D 0.88079645
# 7 2 A B 0.57843592
# 8 2 A C 0.30726571
# 9 2 A D 0.55457986
# 10 2 B C 0.74871464
# 11 2 B D 0.24017130
# 12 2 C D 0.04252878
# 13 3 A B 0.01355117
# 14 3 A C 0.08746756
# 15 3 A D 0.24527519
# 16 3 B C 0.15130684
# 17 3 B D 0.09172577
# 18 3 C D 0.64206517
I would like to transform a list like this:
l <- list(x = c(1, 2), y = c(3, 4, 5))
into a tibble like this:
Name Value
x 1
x 2
y 3
y 4
y 5
I think nothing will be easier than using the stack-function from base R:
df <- stack(l)
gives you a dataframe back:
> df
values ind
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
Because you asked for tibble as output, you can do as_tibble(df) (from the tibble-package) to get that.
Or more directly: df <- as_tibble(stack(l)).
Another pure base R method:
df <- data.frame(ind = rep(names(l), lengths(l)), value = unlist(l), row.names = NULL)
which gives a similar result:
> df
ind value
1 x 1
2 x 2
3 y 3
4 y 4
5 y 5
The row.names = NULL isn't necessarily needed but gives rownumbers as rownames.
Update
I found a better solution.
This works both in case of simple and complicated lists like the one I posted before (below)
l %>% map_dfr(~ .x %>% as_tibble(), .id = "name")
give us
# A tibble: 5 x 2
name value
<chr> <dbl>
1 x 1.
2 x 2.
3 y 3.
4 y 4.
5 y 5.
==============================================
Original answer
From tidyverse:
l %>%
map(~ as_tibble(.x)) %>%
map2(names(.), ~ add_column(.x, Name = rep(.y, nrow(.x)))) %>%
bind_rows()
give us
# A tibble: 5 × 2
value Name
<dbl> <chr>
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
The stack function from base R is great for simple lists as Jaap showed.
However, with more complicated lists like:
l <- list(
a = list(num = 1:3, let_a = letters[1:3]),
b = list(num = 101:103, let_b = letters[4:6]),
c = list()
)
we get
stack(l)
values ind
1 1 a
2 2 a
3 3 b
4 a b
5 b a
6 c a
7 101 b
8 102 b
9 103 a
10 d a
11 e b
12 f b
which is wrong.
The tidyverse solution shown above works fine, keeping the data from different elements of the nested list separated:
# A tibble: 6 × 4
num let Name lett
<int> <chr> <chr> <chr>
1 1 a a <NA>
2 2 b a <NA>
3 3 c a <NA>
4 101 <NA> b d
5 102 <NA> b e
6 103 <NA> b f
We can use melt from reshape2
library(reshape2)
melt(l)
# value L1
#1 1 x
#2 2 x
#3 3 y
#4 4 y
#5 5 y