dplyr ifelse mutate reference to variable outside the data frame - r

I have a simple problem but i haven't figured out the solution yet. I don't know how to reference to a variable outside the data frame when I'm using dplyr. Here is a small chunk of code:
library(dplyr)
var <- 1
df <- data.frame(col1 = c("a", "b", "c"), col2 = c(1, 2, 3))
df %>% mutate(col2 = ifelse(var == 1, col2 + var, col2))
Result:
col1 col2
1 a 2
2 b 2
3 c 2
Desired output:
col1 col2
1 a 2
2 b 3
3 c 4

This is not a dplyr specific issue but when you have a condition to check of length 1 use if and else instead of vectorized ifelse.
library(dplyr)
df %>% mutate(col2 = if(var == 1) col2 + var else col2)
# col1 col2
#1 a 2
#2 b 3
#3 c 4

We could use rowwise and sum
df %>%
rowwise() %>%
mutate(col2 = ifelse(var == 1, sum(col2,var), col2))
col1 col2
<chr> <dbl>
1 a 2
2 b 3
3 c 4

We could use base R for this
i1 <- df$col2 == var
df$col2[i1] <- df$col2[i1] + var
-output
> df
col1 col2
1 a 2
2 b 2
3 c 3
Or use data.table
library(data.table)
setDT(df)[col2 == var, col2 := col2 + var]

Related

how to simplify repetitive mutate conditions

I have an example df:
df <- data.frame(
col1 = c(4,5,6,11),
col2 = c('b','b','c', 'b')
)
> df
col1 col2
1 4 b
2 5 b
3 6 c
4 11 b
and I mutate based on these conditions:
df2 <- df %>%
mutate(col3 = case_when(
col2 == 'b' & col1 == 4 ~ 10,
col2 == 'b' & col1 == 5 ~ 15,
col2 == 'b' & col1 == 11 ~ 20,
col2 == 'c' & col1 == 6 ~ 7)
)
> df2
col1 col2 col3
1 4 b 10
2 5 b 15
3 6 c 7
4 11 b 20
You can see that the first 3 conditions are repetitive in that they require col2 == 'b'. Is there some syntax or another package or more efficient way of combining same/similar conditions so that I don't need to repeat col2 == 'b'? Like a one liner that if col2 == 'b' then do these transformations.
You can write nested case_when
df %>% mutate(col3 = case_when(
col2=="b" ~ case_when(
col1 == 4 ~ 10,
col1 == 5 ~ 15,
col1 == 11 ~ 20),
col2=="c" ~ case_when(
col1 == 6 ~ 7)))
Another solution colud be using a auxiliary table, this way however will limit the flexibility of case_when and only works for equality matches but i suspect is a lot faster.
library(dplyr)
df <- data.frame(
col1 = c(4,5,6,11),
col2 = c('b','b','c', 'b')
)
choices <- data.frame(
col2 = c(rep("b",3),"c"),
col1 = c(4,5,11,6),
col3 = c(10,15,20,7)
)
df %>% left_join(choices, by = c("col1"="col1", "col2"="col2"))
#> col1 col2 col3
#> 1 4 b 10
#> 2 5 b 15
#> 3 6 c 7
#> 4 11 b 20
Be sure to capture the default unmatched cases.

R - identify cols that contain any of a values set

I have a dataframe like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I would like to identify the columns that contain any value from the following vector:
vals=c("a","b","n","w")
A tidy solution would be awesome!
We may use select
library(dplyr)
df %>%
select(where(~ any(. %in% vals, na.rm = TRUE)))
-output
col1 col3
1 a j
2 b k
3 c l
4 d m
5 a n
A similar option in base R is with Filter
Filter(\(x) any(x %in% vals, na.rm = TRUE), df)
col1 col3
1 a j
2 b k
3 c l
4 d m
5 a n
Another tidyverse option is to use keep() from purrr.
library(purrr)
df %>%
keep( ~ any(.x %in% vals))

Declaring variables inside mutate

I am trying to declare the variables inside mutate using all_of but not getting proper output
asd <- data.frame(Col1 = c("A","B"), Col2 = c("R","E"))
a1 <- "Col1"
When I perform below operations, I get invalid output
asd %>% mutate(q1 = case_when(all_of(a1) == "A" ~ 1))
Col1 Col2 a1
1 A R NA
2 B E NA
Expected Output
asd %>% mutate(q1 = case_when(Col1 == "A" ~ 1))
Col1 Col2 q1
1 A R 1
2 B E NA
Or we could use glue::glue just bear in mind that whatever you put inside curly braces will be evaluate as R code:
library(glue)
asd %>%
mutate(q1 = case_when(
eval(parse(text = glue("{a1}"))) == "A" ~ 1
))
Col1 Col2 q1
1 A R 1
2 B E NA
Wrap it in get()
R> asd %>% mutate(q1 = case_when(all_of(get(a1)) == "A" ~ 1))
Col1 Col2 q1
1 A R 1
2 B E NA
We could use across
library(dplyr)
library(stringr)
asd %>%
mutate(across(all_of(a1), ~ case_when(. == 'A' ~ 1),
.names = "{str_replace(.col, '.*', 'q1')}"))
Col1 Col2 q1
1 A R 1
2 B E NA

How to delete duplicate rows (the shorter ones) based on certain columns?

Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.

Only Keep Certain Combinations of Predictors in a Dataframe

Imagine that I have a data frame like this:
> col1 <- rep(1:3,10)
> col2 <- rep(c("a","b"),15)
> col3 <- rnorm(30,10,2)
> sample_df <- data.frame(col1 = col1, col2 = col2, col3 = col3)
> head(sample_df)
col1 col2 col3
1 1 a 13.460322
2 2 b 3.404398
3 3 a 8.952066
4 1 b 11.148271
5 2 a 9.808366
6 3 b 9.832299
I only want to keep combinations of predictors which, together, have a col3 standard deviation below 2. I can find the combinations using ddply, but I don't know how to backtrack to the original DF and select the correct levels.
> sample_df_summ <- ddply(sample_df, .(col1, col2), summarize, sd = sd(col3), count = length(col3))
> head(sample_df_summ)
col1 col2 sd count
1 1 a 2.702328 5
2 1 b 1.032371 5
3 2 a 2.134151 5
4 2 b 3.348726 5
5 3 a 2.444884 5
6 3 b 1.409477 5
For clarity, in this example, I'd like the DF with col1 = 3, col2 = b and col1 = 1 and col 2 = b. How would I do this?
You can add a "keep" column that is TRUE only if the standard deviation is below 2. Then, you can use a left join (merge) to add the "keep" column to the initial dataframe. In the end, you just select with keep equal to TRUE.
# add the keep column
sample_df_summ$keep <- sample_df_summ$sd < 2
sample_df_summ$sd <- NULL
sample_df_summ$count <- NULL
# join and select the rows
sample_df_keep <- merge(sample_df, sample_df_summ, by = c("col1", "col2"), all.x = TRUE, all.y = FALSE)
sample_df_keep <- sample_df_keep[sample_df_keep$keep, ]
sample_df_keep$keep <- NULL
Using dplyr:
library(dplyr)
sample_df %>% group_by(col1, col2) %>% mutate(sd = sd(col3)) %>% filter(sd < 2)
You get:
#Source: local data frame [6 x 4]
#Groups: col1, col2
#
# col1 col2 col3 sd
#1 1 a 10.516437 1.4984853
#2 1 b 11.124843 0.8652206
#3 2 a 7.585740 1.8781241
#4 3 b 9.806124 1.6644076
#5 1 a 7.381209 1.4984853
#6 1 b 9.033093 0.8652206

Resources