Exclude column in `dplyr` `mutate_at` while using data in this column - r

I want to rescale all variables (but year and gender) in a df by one specific year, grouped by gender:
set.seed(1)
df <- data.frame(gender = c(rep("m", 5), rep("f", 5)), year = rep(1:5, 2), var_a = 1:10, var_b = 0:9)
df
gender year var_a var_b
1 m 1 1 0
2 m 2 2 1
3 m 3 3 2
4 m 4 4 3
5 m 5 5 4
6 f 1 6 5
7 f 2 7 6
8 f 3 8 7
9 f 4 9 8
10 f 5 10 9
I can generate what I expect using:
df %>% group_by(gender) %>% mutate(var_a = ifelse(year == 3, 0, var_a - var_a[year == 3])) %>%
mutate(var_b = ifelse(year == 3, 0, var_b - var_b[year == 3]))
gender year var_a var_b
<fct> <int> <dbl> <dbl>
1 m 1 -2 -2
2 m 2 -1 -1
3 m 3 0 0
4 m 4 1 1
5 m 5 2 2
6 f 1 -2 -2
7 f 2 -1 -1
8 f 3 0 0
9 f 4 1 1
10 f 5 2 2
However, this is not an option since I have too many columns.
So I tried (with no success):
df %>% group_by(gender) %>% mutate_at(vars(-gender, -year), ifelse(year == 3, 0, var_a - var_a[year == 3]))
Error in ifelse(year == 3, 0, var_a - var_a[year == 3]) : object
'year' not found
How can I exclude column names in mutate_at (or an alternative) using vars(-col_name) while still reading the data in those columns?
This is related to this one

Use position in mutate_at
library(dplyr)
df %>%
group_by(gender) %>%
mutate_at(-c(1, 2), ~ifelse(year == 3, 0, . - .[year == 3]))
# gender year var_a var_b
# <fct> <int> <dbl> <dbl>
# 1 m 1 -2 -2
# 2 m 2 -1 -1
# 3 m 3 0 0
# 4 m 4 1 1
# 5 m 5 2 2
# 6 f 1 -2 -2
# 7 f 2 -1 -1
# 8 f 3 0 0
# 9 f 4 1 1
#10 f 5 2 2
In case, if you do not know the position of columns beforehand you can first find it
cols <- which(names(df) %in% c("gender", "year"))
df %>%
group_by(gender) %>%
mutate_at(-cols, ~ifelse(year == 3, 0, . - .[year == 3]))
Or select columns which starts_with
df %>%
group_by(gender) %>%
mutate_at(vars(starts_with("var")), ~ifelse(year == 3, 0, . - .[year == 3]))

If you add a ~ before the function you should get the wanted output.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1)
df <- data.frame(gender = c(rep("m", 5),
rep("f", 5)),
year = rep(1:5, 2), var_a = 1:10, var_b = 0:9)
df
#> gender year var_a var_b
#> 1 m 1 1 0
#> 2 m 2 2 1
#> 3 m 3 3 2
#> 4 m 4 4 3
#> 5 m 5 5 4
#> 6 f 1 6 5
#> 7 f 2 7 6
#> 8 f 3 8 7
#> 9 f 4 9 8
#> 10 f 5 10 9
df %>%
group_by(gender) %>%
mutate_at(vars(-gender, -year),
~ifelse(year == 3, 0, . - .[year == 3]))
#> # A tibble: 10 x 4
#> # Groups: gender [2]
#> gender year var_a var_b
#> <fct> <int> <dbl> <dbl>
#> 1 m 1 -2 -2
#> 2 m 2 -1 -1
#> 3 m 3 0 0
#> 4 m 4 1 1
#> 5 m 5 2 2
#> 6 f 1 -2 -2
#> 7 f 2 -1 -1
#> 8 f 3 0 0
#> 9 f 4 1 1
#> 10 f 5 2 2
Created on 2019-04-29 by the reprex package (v0.2.1)
EDIT:
In older versions of dplyr you would use funs(), but it is soft deprecated as of dplyr 0.8.0
df %>%
group_by(gender) %>%
mutate_at(vars(-gender, -year),
funs(ifelse(year == 3, 0, . - .[year == 3])))

Related

How to flag the last row of a data frame group?

Suppose we start with the below dataframe df:
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
ID Period Value
1 1 1 10
2 1 2 12
3 1 3 11
4 5 1 4
5 5 2 6
Now using dplyr I add a "Calculate" column that multiplies Period and Value of each row, giving me the following:
> df %>% mutate(Calculate = Period * Value)
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 33
4 5 1 4 4
5 5 2 6 12
I'd like to modify the above "Calculate" to give me a value of 0, when reaching the last row for a given ID, so that the data frame output looks like:
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 0
4 5 1 4 4
5 5 2 6 0
I was going to use the lead() function to peer at the next row to see if the ID changes but wasn't sure that happens when reaching the end of the data frame.
How could this be accomplished using dplyr?
You can group_by ID and replace the last row for each ID with 0.
library(dplyr)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = replace(Calculate, n(), 0)) %>%
ungroup
# ID Period Value Calculate
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 10
#2 1 2 12 24
#3 1 3 11 0
#4 5 1 4 4
#5 5 2 6 0
Yet another possibility:
library(tidyverse)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = if_else(row_number() == n(), 0, Calculate)) %>%
ungroup
#> # A tibble: 5 × 4
#> ID Period Value Calculate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
library(tidyverse)
df %>%
mutate(Calculate = Period * Value * duplicated(ID, fromLast = TRUE))
#> ID Period Value Calculate
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
Created on 2022-01-09 by the reprex package (v2.0.1)
This should work. You can also replace rownum with Period (most likely)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df = df %>% mutate(Calculate = Period * Value)
df$rownum = rownames(df)
df = df %>%
group_by(ID) %>%
mutate(Calculate = ifelse(rownum == max(rownum), 0, Calculate)) %>%
ungroup()
A tibble: 5 × 5
ID Period Value Calculate rownum
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 10 10 1
2 1 2 12 24 2
3 1 3 11 0 3
4 5 1 4 4 4
5 5 2 6 0 5

How to move all values of the same ID to a new column provided if one of the value is greater than 5 in R

I'm struggling with a problem in R. I'm trying to move all values in RL column of the same ID in Trial column into a new column, provided that any of the value in RL column is greater than 5.
I have a data set like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
RL = c(1, 2, 3, 1, 6, 3, 2, 3, 1),
SL = c(1, 1.5, 1, 0, 0, 1, 1, 1.5, 0)
)
# # A tibble: 9 x 3
# TRIAL RL SL
# <chr> <dbl> <dbl>
# 1 A 1 1
# 2 A 2 1.5
# 3 A 3 1
# 4 B 1 0
# 5 B 6 0
# 6 B 3 1
# 7 C 2 1
# 8 C 3 1.5
# 9 C 1 0
This is what I want to achieve: I want all values from one column in a group to be moved to a new column if the max value for that group is greater than 5, see example below.
# # A tibble: 9 x 4
# TRIAL RL SL RLCT
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 1 NA
# 2 A 2 1.5 NA
# 3 A 3 1 NA
# 4 B NA 0 1
# 5 B NA 0 6
# 6 B NA 1 3
# 7 C 2 1 NA
# 8 C 3 1.5 NA
# 9 C 1 0 NA
When I run this code I get not the expected output
dt %>% group_by("TRIAL") %>% mutate(RLCT = case_when ("RL"> 5 ~ "RL"))
# # A tibble: 9 x 5
# # Groups: "TRIAL" [1]
# TRIAL RL SL `"TRIAL"` RLCT
# <chr> <dbl> <dbl> <chr> <chr>
# 1 A 1 1 TRIAL RL
# 2 A 2 1.5 TRIAL RL
# 3 A 3 1 TRIAL RL
# 4 B 1 0 TRIAL RL
# 5 B 6 0 TRIAL RL
# 6 B 3 1 TRIAL RL
# 7 C 2 1 TRIAL RL
# 8 C 3 1.5 TRIAL RL
# 9 C 1 0 TRIAL RL
Sure not the most straightforward solution but seems to work:
dt0 <- dt %>%
mutate(RLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5))
dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5)) %>%
mutate(RLCT = RL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!is.na(RLCT), NA, RL))
# A tibble: 9 x 4
# Groups: TRIAL [3]
TRIAL RL SL RLCT
<chr> <dbl> <dbl> <dbl>
1 A 1 1 NA
2 A 2 1.5 NA
3 A 3 1 NA
4 C 2 1 NA
5 C 3 1.5 NA
6 C 1 0 NA
7 B NA 0 1
8 B NA 0 6
9 B NA 1 3
Add (arrange(TRIAL)) for alphabetic ordering

How to create combinations of values of one variable by group using tidyverse in R

I am using the combn function in R to get all the combinations of the values of variable y taking each time 2 values, grouping by the values of x. My expected final result is the tibble c.
But when I try to do it in tidyverse something is (very) wrong.
library(tidyverse)
df <- tibble(x = c(1, 1, 1, 2, 2, 2, 2),
y = c(8, 9, 7, 3, 5, 2, 1))
# This is what I want
a <- combn(df$y[df$x == 1], 2)
a <- rbind(a, rep(1, ncol(a)))
b <- combn(df$y[df$x == 2], 2)
b <- rbind(b, rep(2, ncol(b)))
c <- cbind(a, b)
c <- tibble(c)
c <- t(c)
# but using tidyverse it does not work
df %>% group_by(x) %>% mutate(z = combn(y, 2))
#> Error: Problem with `mutate()` input `z`.
#> x Input `z` can't be recycled to size 3.
#> i Input `z` is `combn(y, 2)`.
#> i Input `z` must be size 3 or 1, not 2.
#> i The error occurred in group 1: x = 1.
Created on 2020-11-18 by the reprex package (v0.3.0)
Try with combn
out = df %>% group_by(x) %>% do(data.frame(t(combn(.$y, 2))))
# A tibble: 9 x 3
# Groups: x [2]
x X1 X2
<dbl> <dbl> <dbl>
1 1 8 9
2 1 8 7
3 1 9 7
4 2 3 5
5 2 3 2
6 2 3 1
7 2 5 2
8 2 5 1
9 2 2 1
If you have dplyr v1.0.2, you can do this
df %>% group_by(x) %>% group_modify(~as_tibble(t(combn(.$y, 2L))))
Output
# A tibble: 9 x 3
# Groups: x [2]
x V1 V2
<dbl> <dbl> <dbl>
1 1 8 9
2 1 8 7
3 1 9 7
4 2 3 5
5 2 3 2
6 2 3 1
7 2 5 2
8 2 5 1
9 2 2 1
An option with summarise and unnest
library(dplyr)
library(tidyr)
df %>%
group_by(x) %>%
summarise(y = list(as.data.frame(t(combn(y, 2)))), .groups = 'drop') %>%
unnest(c(y))
# A tibble: 9 x 3
# x V1 V2
# <dbl> <dbl> <dbl>
#1 1 8 9
#2 1 8 7
#3 1 9 7
#4 2 3 5
#5 2 3 2
#6 2 3 1
#7 2 5 2
#8 2 5 1
#9 2 2 1

How to use group_by with summarise and summarise_all?

x y
1 1 1
2 3 2
3 2 3
4 3 4
5 2 5
6 4 6
7 5 7
8 2 8
9 1 9
10 1 10
11 3 11
12 4 12
The above is part of the input.
Let's suppose that it also has a bunch of other columns
I want to:
group_by x
summarise y by sum
And for all other columns, I want to summarise_all by just taking the first value
Here's an approach that breaks it into two problems and combines them:
library(dplyr)
left_join(
# Here we want to treat column y specially
df %>%
group_by(x) %>%
summarize(sum_y = sum(y)),
# Here we exclude y and use a different summation for all the remaining columns
df %>%
group_by(x) %>%
select(-y) %>%
summarise_all(first)
)
# A tibble: 5 x 3
x sum_y z
<int> <int> <int>
1 1 20 1
2 2 16 3
3 3 17 2
4 4 18 2
5 5 7 3
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text="x y z
1 1 1
3 2 2
2 3 3
3 4 4
2 5 1
4 6 2
5 7 3
2 8 4
1 9 1
1 10 2
3 11 3
4 12 4")
library(dplyr)
df1 %>%
group_by(x) %>%
summarise_each(list(avg = mean), -y) %>%
bind_cols(.,{df1 %>%
group_by(x) %>%
summarise_at(vars(y), funs(sum)) %>%
select(-x)
})
#> # A tibble: 5 x 4
#> x r_avg r.1_avg y
#> <int> <dbl> <dbl> <int>
#> 1 1 6.67 6.67 20
#> 2 2 5.33 5.33 16
#> 3 3 5.67 5.67 17
#> 4 4 9 9 18
#> 5 5 7 7 7
Created on 2019-06-20 by the reprex package (v0.3.0)
Data:
df1 <- read.table(text="
r x y
1 1 1
2 3 2
3 2 3
4 3 4
5 2 5
6 4 6
7 5 7
8 2 8
9 1 9
10 1 10
11 3 11
12 4 12", header=T)
df1 <- df1[,c(2,3,1,1)]
library(tidyverse)
df <- tribble(~x, ~y, # making a sample data frame
1, 1,
3, 2,
2, 3,
3, 4,
2, 5,
4, 6,
5, 7,
2, 8,
1, 9,
1, 10,
3, 11,
4, 12)
df <- df %>%
add_column(z = sample(1:nrow(df))) #add another column for the example
df
# If there is only one additional column and you need the first value
df %>%
group_by(x) %>%
summarise(sum_y = sum(y), z_1st = z[1])
# otherwise use summarise_at to address all the other columns
f <- function(x){x[1]} # function to extract the first value
df %>%
group_by(x) %>%
summarise_at(.vars = vars(-c('y')), .funs = f) # exclude column y from the calculations

r group lag sum

I have some data with groups for which I want to compute a summary (sum or mean) over a fixed number of periods. I'm trying to do this with a group_by followed by mutate and then operating with the variable and its dplyr::lag. Here is an example:
library(tidyverse)
df <- data.frame(group = rep(c("A", "B"), 5),
x = c(1, 3, 4, 7, 9, 10, 17, 29, 30, 55))
df %>%
group_by(group) %>%
mutate(cs = x + lag(x, 1, 0) + lag(x, 2, 0) + lag(x, 3, 0)) %>%
ungroup()
Which yields the desired result:
# A tibble: 10 x 3
group x cs
<fctr> <dbl> <dbl>
1 A 1 1
2 B 3 3
3 A 4 5
4 B 7 10
5 A 9 14
6 B 10 20
7 A 17 31
8 B 29 49
9 A 30 60
10 B 55 101
Is there a shorter way to accomplish this? (Here I calculated four values but I actually need twelve or more).
Perhaps you could use the purrr functions reduce and map included with the tidyverse:
library(tidyverse)
df <- data.frame(group = rep(c("A", "B"), 5),
x = c(1, 3, 4, 7, 9, 10, 17, 29, 30, 55))
df %>%
group_by(group) %>%
mutate(cs = reduce(map(0:3, ~ lag(x, ., 0)), `+`)) %>%
ungroup()
#> # A tibble: 10 x 3
#> group x cs
#> <fctr> <dbl> <dbl>
#> 1 A 1 1
#> 2 B 3 3
#> 3 A 4 5
#> 4 B 7 10
#> 5 A 9 14
#> 6 B 10 20
#> 7 A 17 31
#> 8 B 29 49
#> 9 A 30 60
#> 10 B 55 101
To see what's happening here it's probably easier to see with a simpler example that doesn't require a group.
v <- 1:5
lagged_v <- map(0:3, ~ lag(v, ., 0))
lagged_v
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 0 1 2 3 4
#>
#> [[3]]
#> [1] 0 0 1 2 3
#>
#> [[4]]
#> [1] 0 0 0 1 2
reduce(lagged_v, `+`)
#> [1] 1 3 6 10 14

Resources