Mutate, across, and case_when - r

I am having some trouble getting mutate, across, and case_when to function properly, I've recreated a simple version of my problem here:
a <- c(1:10)
b <- c(2:11)
c <- c(3:12)
test <- tibble(a, b, c)
# A tibble: 10 x 3
a b c
<int> <int> <int>
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12
My goal is to replace all of the 3's with 4's, and keep everything else the same. I have the following code:
test_1 <-
test %>%
mutate(across(a:c, ~ case_when(. == 3 ~ 4)))
# A tibble: 10 x 3
a b c
<dbl> <dbl> <dbl>
1 NA NA 4
2 NA 4 NA
3 4 NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
It's close but I get NA values where I want to maintain the value in the original tibble. How do I maintain the original values using the mutate across structure?
Thank you in advance!

What about this?
> test %>%
+ mutate(across(a:c, ~ case_when(. == 3 ~ 4, TRUE ~ 1 * (.))))
# A tibble: 10 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 4
2 2 4 4
3 4 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12
or
> test %>%
+ replace(. == 3, 4)
# A tibble: 10 x 3
a b c
<int> <int> <int>
1 1 2 4
2 2 4 4
3 4 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12

In base R, we can do
test[test ==3] <- 4

This also works:
a <- c(1:10)
b <- c(2:11)
c <- c(3:12)
tibble(a, b, c) %>%
modify(~ ifelse(. == 3, 4, .))
# A tibble: 10 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 4
2 2 4 4
3 4 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12

Related

Create lagged variables for several columns group by two conditions in r

I would like to create lagged variables for several columns that are grouped by two conditions.
Here is the dataset:
df <- data.frame(id = c(rep(1,4),rep(2,4)), tp = rep(1:4,2), x1 = 1:8, x2 = 2:9, x3 = 3:10, x4 = 4:11)
> df
id tp x1 x2 x3 x4
1 1 1 1 2 3 4
2 1 2 2 3 4 5
3 1 3 3 4 5 6
4 1 4 4 5 6 7
5 2 1 5 6 7 8
6 2 2 6 7 8 9
7 2 3 7 8 9 10
8 2 4 8 9 10 11
I want to lag x1, x2, x3, x4 that are grouped by id and tp and create new variables x1_lag1, x2_lag1, x3_lag1, x4_lag1, like this:
> df
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA
How to achieve that?
Your result doesn't seem to be grouped by tp at all. It is grouped by id and ordered by tp within the id grouping.
Generally a "lag" is a variable that takes the value from the previous row. The columns you want labeled as "lag" columns take the value from the next row, so we use the lead function.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("x"), lead, .names = "{.col}_lag1")) %>%
ungroup()
# A tibble: 8 × 10
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA

How to get the selected max/min value (i.e. second largest/smallest) across row by dplyr

As the title, How do I get the second/third largest/smallest value across rows by dplyr? Is there an elegant way to achieve it?
a <- data.frame(gp1=c(3:11), gp2=c(1:9), gp3=c(8,8,2,6,6,6,12,12,6))
## the max/min value is very simple
a %>%
rowwise() %>%
mutate(max1=max(gp1, gp2, gp3))
#
# # A tibble: 9 × 4
# # Rowwise:
# gp1 gp2 gp3 max1
# <int> <int> <dbl> <dbl>
# 1 3 1 8 8
# 2 4 2 8 8
# 3 5 3 2 5
# 4 6 4 6 6
# 5 7 5 6 7
# 6 8 6 6 8
# 7 9 7 12 12
# 8 10 8 12 12
# 9 11 9 6 11
The result should be similar to this:
#
# # A tibble: 9 × 4
# # Rowwise:
# gp1 gp2 gp3 max1 max2
# <int> <int> <dbl> <dbl> <dbl>
# 1 3 1 8 8 3
# 2 4 2 8 8 4
# 3 5 3 2 5 3
# 4 6 4 6 6 6
# 5 7 5 6 7 6
# 6 8 6 6 8 6
# 7 9 7 12 12 9
# 8 10 8 12 12 12
# 9 11 9 6 11 9
You can use c_across along with sort. The use of rev here reverses the sorted data, making it easy to select the largest value with index 1, the second-largest with index 2, etc.
Note that column "max2" in your example output makes errors in certain rows (I think you may have been including the "max1" column in some cases).
a %>%
rowwise() %>%
mutate(
max1 = max(gp1, gp2, gp3),
max2 = rev(sort(c_across(c(gp1, gp2, gp3))))[2]
)
gp1 gp2 gp3 max1 max2
<int> <int> <dbl> <dbl> <dbl>
1 3 1 8 8 3
2 4 2 8 8 4
3 5 3 2 5 3
4 6 4 6 6 6
5 7 5 6 7 6
6 8 6 6 8 6
7 9 7 12 12 9
8 10 8 12 12 10
9 11 9 6 11 9
A solution with pmap which does not involve rowwise:
library(purrr)
a %>%
mutate(max1 = pmax(gp1, gp2, gp3),
max2 = pmap(., ~ rev(sort(c(..1, ..2, ..3)))[2]))
gp1 gp2 gp3 max1 max2
1 3 1 8 8 3
2 4 2 8 8 4
3 5 3 2 5 3
4 6 4 6 6 6
5 7 5 6 7 6
6 8 6 6 8 6
7 9 7 12 12 9
8 10 8 12 12 10
9 11 9 6 11 9
I am sure there is a shorter way to automate it, but here is a quick solution for now:
library(dplyr)
library(slider)
a %>%
rowwise() %>%
mutate(output = list(slide_dfc(sort(c_across(everything()), decreasing = TRUE), max, .before = 1, .complete = TRUE))) %>%
unnest_wider(output) %>%
rename_with(~ sub('\\.+(\\d)', 'Max_\\1', .), contains('.')) %>%
suppressMessages()
# A tibble: 9 × 5
gp1 gp2 gp3 Max_1 Max_2
<int> <int> <dbl> <dbl> <dbl>
1 3 1 8 8 3
2 4 2 8 8 4
3 5 3 2 5 3
4 6 4 6 6 6
5 7 5 6 7 6
6 8 6 6 8 6
7 9 7 12 12 9
8 10 8 12 12 10
9 11 9 6 11 9
An option with pmax
library(dplyr)
a %>%
mutate(max1 = do.call(pmax, across(everything())),
across(starts_with('gp'), ~ replace(.x, .x == max1, NA))) %>%
transmute(max2 = do.call(pmax, c(across(starts_with('gp')), na.rm = TRUE))) %>%
bind_cols(a, .)
-output
gp1 gp2 gp3 max2
1 3 1 8 3
2 4 2 8 4
3 5 3 2 3
4 6 4 6 4
5 7 5 6 6
6 8 6 6 6
7 9 7 12 9
8 10 8 12 10
9 11 9 6 9
Or in base R
a$max2 <- do.call(pmax, c(replace(a, cbind(seq_len(nrow(a)),
max.col(a, 'first')), NA), na.rm = TRUE))
a$max2
[1] 3 4 3 6 6 6 9 10 9

conditionally adding columns to a list of dataframes

I have a list of dataframes with either 2 or 4 columns.
a <- data.frame(a=1:10,
b=1:10,
c=1:10,
d=1:10)
b <- data.frame(a=1:10,
b=1:10)
list_of_df <- list(a,b)
I want to add 2 empty columns to each dataframe with only 2 columns.
I've tried this lapply approach:
lapply(list_of_df, function(x) ifelse(ncol(x) < 4,x%>%add_column(empty=NA),x <- x))
Which does not work unfortunately. How can I fix this?
I came up with something similar:
add_col <- function(x){
col_to_add <- 4 - ncol(x)
if(col_to_add == 0) return(x)
z <- rep(NA, nrow(x))
for (i in 1:col_to_add){
x <- cbind(x, z)
}
x
}
lapply(list_of_df, add_col)
I would use a for loop to avoid copying the whole list:
for (i in seq_along(list_of_df)) {
n_columns = ncol(list_of_df[[i]])
if (n_columns == 2L) {
list_of_df[[i]][c('empty1', 'empty2')] <- NA
}
}
Result:
> list_of_df
[[1]]
a b c d
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
[[2]]
a b empty1 empty2
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 7 7 NA NA
8 8 8 NA NA
9 9 9 NA NA
10 10 10 NA NA
We could use bind_rows and then group_split and map from purrr to remove the id_Group column:
library(dplyr)
library(purrr)
bind_rows(list_of_df) %>%
group_split(id_Group =cumsum(a==1)) %>%
map(., ~ (.x %>% ungroup() %>%
select(-id_Group)))
[[1]]
# A tibble: 10 x 4
a b c d
<int> <int> <int> <int>
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
[[2]]
# A tibble: 10 x 4
a b c d
<int> <int> <int> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 7 7 NA NA
8 8 8 NA NA
9 9 9 NA NA
10 10 10 NA NA

Create a function to Impute values form one data frame into another

The NA values in column A should be filled by the A value from the dat data frame and so on for the other variables.
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,6,8,9,0,6,7,9)
B <- c(5,6,1,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,8,3,2,9,NA,2,6,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
dat <- data.frame(col=c("A","B","C","D"), value=c(23,45,26,89))
dat
dat
col value
1 A 23
2 B 45
3 C 26
4 D 89
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
I was thinking something like this but I dont know how to connect those data frames in a function...
test <- function(i){
df[,i][is.na(df[,i])] <- dat$value
}
test(2)
If you want it in your format
test <- function(i){
df[,i][is.na(df[,i])] <<- dat$value[dat$col==i]
}
test("A")
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
One approach is to iterate over the columns and values and use coalesce():
library(dplyr)
library(purrr)
df[-1] <- map2_df(df[-1], dat$value, coalesce)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
Or same using replace():
map2_df(df[-1], dat$value, ~ replace(.x, is.na(.x), .y))

R how to fill in NA with rules

data=data.frame(person=c(1,1,1,2,2,2,2,3,3,3,3),
t=c(3,NA,9,4,7,NA,13,3,NA,NA,12),
WANT=c(3,6,9,4,7,10,13,3,6,9,12))
So basically I am wanting to create a new variable 'WANT' which takes the PREVIOUS value in t and ADDS 3 to it, and if there are many NA in a row then it keeps doing this. My attempt is:
library(dplyr)
data %>%
group_by(person) %>%
mutate(WANT_TRY = fill(t) + 3)
Here's one way -
data %>%
group_by(person) %>%
mutate(
# cs = cumsum(!is.na(t)), # creates index for reference value; uncomment if interested
w = case_when(
# rle() gives the running length of NA
is.na(t) ~ t[cumsum(!is.na(t))] + 3*sequence(rle(is.na(t))$lengths),
TRUE ~ t
)
) %>%
ungroup()
# A tibble: 11 x 4
person t WANT w
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
Here is another way. We can do linear interpolation with the imputeTS package.
library(dplyr)
library(imputeTS)
data2 <- data %>%
group_by(person) %>%
mutate(WANT2 = na.interpolation(WANT)) %>%
ungroup()
data2
# # A tibble: 11 x 4
# person t WANT WANT2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3
# 2 1 NA 6 6
# 3 1 9 9 9
# 4 2 4 4 4
# 5 2 7 7 7
# 6 2 NA 10 10
# 7 2 13 13 13
# 8 3 3 3 3
# 9 3 NA 6 6
# 10 3 NA 9 9
# 11 3 12 12 12
This is harder than it seems because of the double NA at the end. If it weren't for that, then the following:
ifelse(is.na(data$t), c(0, data$t[-nrow(data)])+3, data$t)
...would give you want you want. The simplest way, that uses the same logic but doesn't look very clever (sorry!) would be:
.impute <- function(x) ifelse(is.na(x), c(0, x[-length(x)])+3, x)
.impute(.impute(data$t))
...which just cheats by doing it twice. Does that help?
You can use functional programming from purrr and "NA-safe" addition from hablar:
library(hablar)
library(dplyr)
library(purrr)
data %>%
group_by(person) %>%
mutate(WANT2 = accumulate(t, ~.x %plus_% 3))
Result
# A tibble: 11 x 4
# Groups: person [3]
person t WANT WANT2
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12

Resources