dplyr mutate new dynamic variables with case_when - r

I'm aware of similar questions here and here, but I haven't been able to figure out the right solution for my specific situation. Some of what I'm finding are solutions which use mutate_, etc but I understand these are now obsolete. I'm new to dynamic usages of dplyr.
I have a dataframe which includes some variables with two different prefixes, alpha and beta:
df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
which.to.use = c("alpha", "alpha", "beta", "beta"))
I want to create new variables with the prefix "chosen." which are copies of either the "alpha" or "beta" columns depending on which is named for that row in the "which.to.use" column. The desired output would be:
desired.df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
which.to.use = c("alpha", "alpha", "beta", "beta"),
chosen.num = c(1, 3, 6, 8),
chosen.char = c("a", "c", "f", "h"))
My failed attempt:
varnames <- c("num", "char")
df %<>%
mutate(as.name(paste0("chosen.", varnames)) := case_when(
which.to.use == "alpha" ~ paste0("alpha.", varnames),
which.to.use == "beta" ~ pasteo("beta.", varnames)
))
I'd prefer a pure dplyr solution, and even better would be one which could be included in a longer pipe modifying the df (i.e. no need to stop to create "varnames"). Thanks for your help.

Using some fun rlang stuff & purrr:
library(rlang)
library(purrr)
library(dplyr)
df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
which.to.use = c("alpha", "alpha", "beta", "beta"),
stringsAsFactors = F)
c("num", "char") %>%
map(~ mutate(df, !!sym(paste0("chosen.", .x)) :=
case_when(
which.to.use == "alpha" ~ !!sym(paste0("alpha.", .x)),
which.to.use == "beta" ~ !!sym(paste0("beta.", .x))
))) %>%
reduce(full_join)
Result:
alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1 1 a 2 b alpha 1 a
2 3 c 4 d alpha 3 c
3 5 e 6 f beta 6 f
4 7 g 8 h beta 8 h
Without reduce(full_join):
c("num", "char") %>%
map_dfc(~ mutate(df, !!sym(paste0("chosen.", .x)) :=
case_when(
which.to.use == "alpha" ~ !!sym(paste0("alpha.", .x)),
which.to.use == "beta" ~ !!sym(paste0("beta.", .x))
))) %>%
select(-ends_with("1"))
alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1 1 a 2 b alpha 1 a
2 3 c 4 d alpha 3 c
3 5 e 6 f beta 6 f
4 7 g 8 h beta 8 h
Explanation:
(Note: I do not fully or even kind of get rlang. Maybe others can give a better explanation ;).)
Using paste0 by itself produces a string, when we need a bare name for mutate to know it is referring to a variable name.
If we wrap paste0 in sym, it evaluates to a bare name:
> x <- varrnames[1]
> sym(paste0("alpha.", x))
alpha.num
But mutate does not know to evaluate and instead read it as a symbol:
> typeof(sym(paste0("alpha.", x)))
[1] "symbol"
The "bang bang" !! operator evaluates the sym function. Compare:
> expr(mutate(df, var = sym(paste0("alpha.", x))))
mutate(df, var = sym(paste0("alpha.", x)))
> expr(mutate(df, var = !!sym(paste0("alpha.", x))))
mutate(df, var = alpha.num)
So with !!sym we can use paste to dynamically called variable names with dplyr.

This is a nest()/map() strategy that should be pretty fast. It stays in the tidyverse, but doesn't go into rlang land.
library(tidyverse)
df %>%
nest(-which.to.use) %>%
mutate(new_data = map2(data, which.to.use,
~ select(..1, matches(..2)) %>%
rename_all(funs(gsub(".*\\.", "choosen.", .) )))) %>%
unnest()
which.to.use alpha.num alpha.char beta.num beta.char choosen.num choosen.char
1 alpha 1 a 2 b 1 a
2 alpha 3 c 4 d 3 c
3 beta 5 e 6 f 6 f
4 beta 7 g 8 h 8 h
It grabs all columns, not just num and char, that are not which.to.use. But that seems like what you (I) would want IRL. You could add a select(matches('(var1|var2|etc')) line before you call nest() if you wanted to pull only specific variables.
EDIT:
My original suggestion of using select() to drop unneeded columns would result in doing a join to bring them back later. If instead you adjust the nest parameters, you can acheive this on only certain columns.
I added new bool columns here, but they will be ignored for the "choosen" selection:
new_df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
alpha.bool = FALSE,
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
beta.bool = TRUE,
which.to.use = c("alpha", "alpha", "beta", "beta"),
stringsAsFactors = FALSE)
new_df %>%
nest(matches("num|char")) %>% # only columns that match this pattern get nested, allows you to save others
mutate(new_data = map2(data, which.to.use,
~ select(..1, matches(..2)) %>%
rename_all(funs(gsub(".*\\.", "choosen.", .) )))) %>%
unnest()
alpha.bool beta.bool which.to.use alpha.num alpha.char beta.num beta.char choosen.num choosen.char
1 FALSE TRUE alpha 1 a 2 b 1 a
2 FALSE TRUE alpha 3 c 4 d 3 c
3 FALSE TRUE beta 5 e 6 f 6 f
4 FALSE TRUE beta 7 g 8 h 8 h

A base R approach using apply with margin = 1 where we select columns for each row based on the value in which.to.use column and get the value from corresponding column for the row.
df[c("chosen.num", "chosen.char")] <-
t(apply(df, 1, function(x) x[grepl(x["which.to.use"], names(df))]))
df
# alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
#1 1 a 2 b alpha 1 a
#2 3 c 4 d alpha 3 c
#3 5 e 6 f beta 6 f
#4 7 g 8 h beta 8 h

You can also try a gather/spread approach
df %>%
rownames_to_column() %>%
gather(k,v,-which.to.use,-rowname) %>%
separate(k,into = c("k1", "k2"), sep="[.]") %>%
filter(which.to.use == k1) %>%
mutate(k1="chosen") %>%
unite(k, k1, k2,sep=".") %>%
spread(k,v) %>%
select(.,chosen.num, chosen.char) %>%
bind_cols(df, .)
alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1 1 a 2 b alpha 1 a
2 3 c 4 d alpha 3 c
3 5 e 6 f beta 6 f
4 7 g 8 h beta 8 h

Related

Assign value to new column based on values in 2 other columns

Here is an example code:
Group <- c("A", "A", "A", "A", "A", "B", "B", "B","B", "B")
Actor <- c(1, 3, 6, 4, 1, 2, 2, 6, 4, 3)
df <- data.frame(Group,Actor)
df
Now, what I want to do is to create three new columns (Sex, Status, SexStat) based on the data in the Group and Actor columns.
For example, if Group = A and Actor = 1, then Sex = M, Status = Dom, and SexStat = DomM. If Group = A and Actor = 3, then Sex = F, Status = Med, and SexStat = MedF (and so on).
The numbers do not always align with the same rank/sexes in every group, and with 5500 lines of data, I would love it if there was a way to not do this manually! Any help would be much appreciated.
You can create conditions for Sex and Status and then paste them to create SexStat
library(dplyr)
Group <- c("A", "A", "A", "A", "A", "B", "B", "B","B", "B")
Actor <- c(1, 3, 6, 4, 1, 2, 2, 6, 4, 3)
df <- data.frame(Group,Actor)
df
df %>%
mutate(
Sex = case_when(
Group == "A" & Actor == 1 ~ "M",
Group == "A" & Actor == 3 ~ "F",
TRUE ~ ""
),
Status = case_when(
Group == "A" & Actor == 1 ~ "Dom",
Group == "A" & Actor == 3 ~ "Med",
TRUE ~ ""
),
SexStat = paste0(Status,Sex)
)
Group Actor Sex Status SexStat
1 A 1 M Dom DomM
2 A 3 F Med MedF
3 A 6
4 A 4
5 A 1 M Dom DomM
6 B 2
7 B 2
8 B 6
9 B 4
10 B 3
We may do this with a key/value dataset by joining
library(dplyr)
library(tidyr)
library(stringr)
keydat <- tibble(Group = "A", Actor = c(1, 3), Sex = c("M", "F"), Status = c("Dom", "Med"))
df %>%
left_join(keydat) %>%
mutate(across(c(Sex, Status), replace_na, ""),
SexStat = str_c(Status, Sex))
-output
Group Actor Sex Status SexStat
1 A 1 M Dom DomM
2 A 3 F Med MedF
3 A 6
4 A 4
5 A 1 M Dom DomM
6 B 2
7 B 2
8 B 6
9 B 4
10 B 3

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

Insert specified values in R grouped df and fill up missing values using another df (R)

I have 2 dfs : df & xdf.
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
In df, within "id" column, for the groups (a & b), I would like to insert only that row of xdf which matches the same id name as in df. How can I make it ? I have tried following commands but all of the values of xdf$x are inserted for each group.
ndf <- df %>%
group_by(id) %>%
do(add_row(., id = .$id[1], x = xdf$x))
> ndf
# A tibble: 12 x 3
# Groups: id [2]
id x y
<chr> <dbl> <dbl>
1 a 1 0.2
2 a 2 0
3 a 3 0.9
4 a 4 7
5 a 2 NA
6 a 3.5 NA
7 b 1 1
8 b 2 0.3
9 b 3 5
10 b 4 5.1
11 b 2 NA
12 b 3.5 NA
# expected result should be : ndf <- ndf[c(-6,-11),]
My end goal is to fill these newborns NA of ndf with the approx() function. But my issue remains because I'm using xout = xdf$x that calls supernumerary values. How can I overcome this? Can you help to write a function that makes xout varies?
f <- function(z)
{
fdf <- approx(z$x, z$y, xout = xdf$x, method = "linear")
return(data.frame(nx= fdf$x, y.out = fdf$y, id = unique(z$id)))
}
jdf <- as.data.frame(ddply(ndf, .(id), f))
zdf <- subset(jdf, select = c(id, nx, y.out))
> zdf
id nx y.out
1 a 2.0 0.00
2 a 3.5 3.95
3 b 2.0 0.30
4 b 3.5 5.05
# expected results
id nx y.out
1 a 2.0 0.00
2 b 3.5 5.05
Any helpful tips to this is welcome. Many thanks!
library(dplyr)
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
ndf <- df %>%
bind_rows(xdf) %>%
arrange(id)
zdf <- ndf %>%
group_by(id) %>%
group_modify(~mutate(., y_approx = approx(.$x, .$y, .$x, method = "linear")[["y"]])) %>%
ungroup() %>%
filter(is.na(y)) %>%
select(id, y_approx)

dplyr mutate to replace specific values in a data frame

I have a data frame that consists of characters "a", "b", "x", "y".
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"))
Now I want to replace all values with the following scheme and also convert the whole data frame to numeric.
"a" -> 0
"b" -> 1
"x" -> 1
"y" -> 2
I know this must be somehow possible with mutate_all but I cannot figure out how
df %>% mutate_all(replace("a", 1)) %>%
mutate_all(is.character, as.numeric)
One solution could be with case_when:
df %>%
mutate_all(funs(case_when(. == "a" ~ 0,
. %in% c("b", "x") ~ 1,
. == "y" ~ 2,
TRUE ~ NA_real_)))
# v1 v2
# 1 0 0
# 2 1 1
# 3 1 0
# 4 2 2
Create a named vector with mappings and then subset it using mutate_all
vec <- c(a = 0, b = 1, x = 1, y = 2)
library(dplyr)
df %>% mutate_all(~vec[.])
# v1 v2
#1 0 0
#2 1 1
#3 1 0
#4 2 2
In base R that would be just
df[] <- vec[unlist(df)]
data
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"), stringsAsFactors = FALSE)

How to assign a value to a data.frame filtered by dplyr?

I am trying to modify a data.frame filtered by dplyr but I don't quite seem to grasp what I need to do. In the following example, I am trying to filter the data frame z and then assign a new value to the third column -- I give two examples, one with "9" and one with "NA".
require(dplyr)
z <- data.frame(w = c("a", "a", "a", "b", "c"), x = 1:5, y = c("a", "b", "c", "d", "e"))
z %>% filter(w == "a" & x == 2) %>% select(y)
z %>% filter(w == "a" & x == 2) %>% select(y) <- 9 # Should be similar to z[z$w == "a" & z$ x == 2, 3] <- 9
z %>% filter(w == "a" & x == 3) %>% select(y) <- NA # Should be similar to z[z$w == "a" & z$ x == 3, 3] <- NA
Yet, it doesn't work: I get the following error message:
"Error in z %>% filter(w == "a" & x == 3) %>% select(y) <- NA : impossible de trouver la fonction "%>%<-"
I know that I can use the old data.frame notation, but what would be the solution for dplyr?
Thanks!
Filtering will subset the data frame. If you want to keep the whole data frame, but modify part of it, you can, for example use mutate with ifelse. I've added stringsAsFactors=FALSE to your sample data so that y will be a character column.
z <- data.frame(w = c("a", "a", "a", "b", "c"), x = 1:5, y = c("a", "b", "c", "d", "e"),
stringsAsFactors=FALSE)
z %>% mutate(y = ifelse(w=="a" & x==2, 9, y))
w x y
1 a 1 a
2 a 2 9
3 a 3 c
4 b 4 d
5 c 5 e
Or with replace:
z %>% mutate(y = replace(y, w=="a" & x==2, 9),
y = replace(y, w=="a" & x==3, NA))
w x y
1 a 1 a
2 a 2 9
3 a 3 <NA>
4 b 4 d
5 c 5 e
It is my impression that the dplyr package is philosophically opposed to modifying your underlying data. You might find the data.table package friendlier for this operation:
library(data.table)
z <- data.table(w = c("a", "a", "a", "b", "c"), x = 1:5, y = c("a", "b", "c", "d", "e"))
m <- data.table(w = c("a","a"), x = c(2,3), new_y = c("9", NA))
z[m, y := new_y, on=c("w","x")]
w x y
1: a 1 a
2: a 2 9
3: a 3 NA
4: b 4 d
5: c 5 e
I'm sure there's a way in base R as well, but I don't know it. In particular, I can't get merge or match to do the job.

Resources