Multi-conditional mutate - r

I have a data frame that requires conditional recoding of a column based on the date listed in certain rows for each subset of IDs. I am trying to figure out how to best achieve this using the mutate function in dplyr. Suggestions and alternate solutions are welcome, but I would like to avoid using for loops.
I know how to write a really verbose and inefficient for loop that would solve this problem, but would like to know how to do it more efficiently.
The sample data frame:
df<-data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
date = as.Date(c("2016-02-01","2016-02-01","2016-02-01","2016-03-21", "2016-03-21", "2016-03-21", "2016-10-05", "2016-10-05", "2016-10-05", "2016-10-05", "2016-03-01","2016-03-01","2016-03-01","2016-04-21", "2016-04-21", "2016-04-21", "2016-11-05", "2016-11-05", "2016-11-05", "2016-11-05")),
trial = c(NA, NA, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, NA)
My pseudo code - the second logical argument in the first two case_when statements is where I am stuck.
df%>%
group_by(ID)%>%
mutate(results = case_when(
is.na(trial) & date < date where trial = 1 ~ 0,
is.na(trial) & date > date where trial = 1 ~ 2,
trial == trial
))
The expected result being:
data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
date = as.Date(c("2016-02-01","2016-02-01","2016-02-01","2016-03-21", "2016-03-21", "2016-03-21", "2016-10-05", "2016-10-05", "2016-10-05", "2016-10-05", "2016-03-01","2016-03-01","2016-03-01","2016-04-21", "2016-04-21", "2016-04-21", "2016-11-05", "2016-11-05", "2016-11-05", "2016-11-05")),
trial = c(0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2)
)

An option would be to group by 'ID' and transform the 'trial' by applying the run-length-id on (rleid) on the 'trial' column
library(dplyr)
library(data.table)
df %>%
group_by(ID) %>%
mutate(trial = rleid(trial)-1)
# A tibble: 20 x 3
# Groups: ID [2]
# ID date trial
# <dbl> <date> <dbl>
# 1 1 2016-02-01 0
# 2 1 2016-02-01 0
# 3 1 2016-02-01 0
# 4 1 2016-03-21 1
# 5 1 2016-03-21 1
# 6 1 2016-03-21 1
# 7 1 2016-10-05 2
# 8 1 2016-10-05 2
# 9 1 2016-10-05 2
#10 1 2016-10-05 2
#11 2 2016-03-01 0
#12 2 2016-03-01 0
#13 2 2016-03-01 0
#14 2 2016-04-21 1
#15 2 2016-04-21 1
#16 2 2016-04-21 1
#17 2 2016-11-05 2
#18 2 2016-11-05 2
#19 2 2016-11-05 2
#20 2 2016-11-05 2
Or using rle
df %>%
group_by(ID) %>%
mutate(trial = with(rle(is.na(trial)),
rep(seq_along(values), lengths))-1)

Converting your pseudo code to code we can use which.max(trial == 1) to get first occurrence where trial = 1 for each group. This also assumes that there would be at least one entry of 1 in trial for each ID.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(trial = case_when(is.na(trial) & date < date[which.max(trial == 1)] ~ 0,
is.na(trial) & date > date[which.max(trial == 1)] ~ 2,
TRUE ~ trial))
# ID date trial
# <dbl> <date> <dbl>
# 1 1 2016-02-01 0
# 2 1 2016-02-01 0
# 3 1 2016-02-01 0
# 4 1 2016-03-21 1
# 5 1 2016-03-21 1
# 6 1 2016-03-21 1
# 7 1 2016-10-05 2
# 8 1 2016-10-05 2
# 9 1 2016-10-05 2
#10 1 2016-10-05 2
#11 2 2016-03-01 0
#12 2 2016-03-01 0
#13 2 2016-03-01 0
#14 2 2016-04-21 1
#15 2 2016-04-21 1
#16 2 2016-04-21 1
#17 2 2016-11-05 2
#18 2 2016-11-05 2
#19 2 2016-11-05 2
#20 2 2016-11-05 2

Related

Reshape from wide to long with multiple columns that have different naming patterns

I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:
# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#> person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1 1 1 1 1 0
#> 2 2 2 2 2 1
#> 3 3 1 1 3 1
#> 4 4 2 2 4 0
I want to reshape from wide to long format so that the data look like this:
# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)
long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#> person wave sex education q_2_1
#> 1 1 1 1 1 NA
#> 2 1 2 1 NA 0
#> 3 2 1 2 2 NA
#> 4 2 2 2 NA 1
#> 5 3 1 1 3 NA
#> 6 3 2 1 NA 1
#> 7 4 1 2 4 NA
#> 8 4 2 2 NA 0
To reshape the data, I tried pivot_longer(). How do I fix these issues?
(I prefer not to use data.table.)
The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).
# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
# Load package
pacman::p_load(tidyr)
# Reshape from wide to long
long <- wide %>%
pivot_longer(
cols = starts_with('W'),
names_to = 'Wave',
names_prefix = 'W',
names_pattern = '(.*)_',
values_to = 'sex',
values_drop_na = TRUE
)
long
#> # A tibble: 16 × 3
#> person Wave sex
#> <dbl> <chr> <dbl>
#> 1 1 1_resp 1
#> 2 1 2_resp 1
#> 3 1 1 1
#> 4 1 2_q_2 0
#> 5 2 1_resp 2
#> 6 2 2_resp 2
#> 7 2 1 2
#> 8 2 2_q_2 1
#> 9 3 1_resp 1
#> 10 3 2_resp 1
#> 11 3 1 3
#> 12 3 2_q_2 1
#> 13 4 1_resp 2
#> 14 4 2_resp 2
#> 15 4 1 4
#> 16 4 2_q_2 0
Created on 2022-09-19 by the reprex package (v2.0.1)
We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names
library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"),
names_pattern = "^W(\\d+)_(.*)$") %>%
rename_with(~ c("sex", "education"), c("resp_sex", "edu"))
-output
# A tibble: 8 × 5
person wave sex education q_2_1
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 1 1 NA
2 1 2 1 NA 0
3 2 1 2 2 NA
4 2 2 2 NA 1
5 3 1 1 3 NA
6 3 2 1 NA 1
7 4 1 2 4 NA
8 4 2 2 NA 0
You want to reshape the variables that are measured in both waves. You may find them tableing the substring of the names without prefix.
v <- grep(names(which(table(substring(names(wide)[-1], 4)) == 2)), names(wide))
reshape2::melt(data=wide, id.vars=1, measure.vars=v)
# person variable value
# 1 1 W1_resp_sex 1
# 2 2 W1_resp_sex 2
# 3 3 W1_resp_sex 1
# 4 4 W1_resp_sex 2
# 5 1 W2_resp_sex 1
# 6 2 W2_resp_sex 2
# 7 3 W2_resp_sex 1
# 8 4 W2_resp_sex 2

How to track changes in rows/lines of data frame?

Suppose I have the following dataframe
df:
df=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1) )
Now, I want to find out, for which ID the data has changed in which year. So for example, in 2016 ID 123100 has changed from 1 to 2. I would like to add new columns for change (1 = change, 0 = no change), year of change, old value (1,2 or 3) and new value (1,2,3).
In the end it should look like this:
df_final=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1), "change"=c(1,1,0,0,1),
"year"=c(2016, 2018, 0, 0, 2016), "before"=c(1,1,0,0,3), "after"=c(2, 3, 0, 0, 1))
I couldn't find any satisfying solution on here, so I hope you can help me.
Here's a base R method.
It may be best to have the IDs with no change as NA. If you really want zeroes, just change c(NA, NA, NA) in the following code to c(0, 0, 0)
Note that in your example data frames, if you run the code as-is, the column names for each year all start with an "x" - you can prevent this by adding the check.names = FALSE argument to the data.frame function.
cbind(df, setNames(as.data.frame(t(apply(df[-1], 1, function(x) {
y <- which(diff(x) != 0)
if(length(y)) c(as.numeric(names(y)), x[y], x[y+1])
else c(NA, NA, NA)
}))), c("Year", "Before", "After")))
#> ID 2014 2015 2016 2017 2018 Year Before After
#> 1 123100 1 1 2 2 2 2016 1 2
#> 2 123200 1 1 1 1 3 2018 1 3
#> 3 123300 1 1 1 1 1 NA NA NA
#> 4 123400 2 2 2 2 2 NA NA NA
#> 5 123500 3 3 1 1 1 2016 3 1
Data used
df <- structure(list(ID = c(123100, 123200, 123300, 123400, 123500),
`2014` = c(1, 1, 1, 2, 3), `2015` = c(1, 1, 1, 2, 3), `2016` = c(2,
1, 1, 2, 1), `2017` = c(2, 1, 1, 2, 1), `2018` = c(2, 3,
1, 2, 1)), class = "data.frame", row.names = c(NA, -5L))
Created on 2022-06-18 by the reprex package (v2.0.1)
here is an optional tidyverse approach:
library(tidyverse)
# join resume df to current df
dplyr::left_join(df,
# make df long to build groupings by ID
tidyr::pivot_longer(df, -ID) %>%
dplyr::group_by(ID) %>%
# order just to be sure
dplyr::arrange(ID, name) %>%
# generate year number, before and after values
dplyr::mutate(year = readr::parse_number(name),
before = lag(value),
# if there is no after value use current value
after = ifelse(is.na(lead(value)), value, lead(value))) %>%
# filter where preceding uneven current
dplyr::filter(before != value) %>%
# unselect obsolete columns
dplyr::select(-name, -value),
by = "ID") %>%
# fill up empty fields with zeros
dplyr::mutate(dplyr::across(year:after, ~ifelse(is.na(.x), 0, .x)))
ID X2014 X2015 X2016 X2017 X2018 year before after
1 123100 1 1 2 2 2 2016 1 2
2 123200 1 1 1 1 3 2018 1 3
3 123300 1 1 1 1 1 0 0 0
4 123400 2 2 2 2 2 0 0 0
5 123500 3 3 1 1 1 2016 3 1
another possibility to solve the task within the tidyverse is to work row-wise:
dplyr::rowwise(df) %>%
dplyr::mutate(year = readr::parse_number(names(.)[stringr::str_detect(names(.), pattern = "^X")][c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0][1]),
before = dplyr::c_across(dplyr::starts_with("x"))[dplyr::lead(c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0, default = FALSE)][1],
after = dplyr::coalesce(dplyr::c_across(dplyr::starts_with("x"))[dplyr::lag(c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0, default = FALSE)][1],
dplyr::c_across(dplyr::starts_with("x"))[c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0][1])) %>%
dplyr::ungroup() %>%
dplyr::mutate(dplyr::across(year:after, ~ ifelse(is.na(.x), 0, .x)))
ID X2014 X2015 X2016 X2017 X2018 year before after
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 123100 1 1 2 2 2 2016 1 2
2 123200 1 1 1 1 3 2018 1 3
3 123300 1 1 1 1 1 0 0 0
4 123400 2 2 2 2 2 0 0 0
5 123500 3 3 1 1 1 2016 3 1
matrixStats::rowDiffs might be helpful and faster here.
z <- apply(matrixStats::rowDiffs(as.matrix(df[-1])) != 0, 1, which.max) + 1; d <- dim(df)
m <- matrix(t(df[-1])[c(z + 0:(d[2] - 2)*d[1] - 1, z + 0:(d[2] - 2)*d[1])],,2, di=list(c(), c('before', 'after')))
cbind(df, `[<-`(cbind(change=1, year=substring(names(df[-1])[z], 2), m), z == 2, 1:4, 0))
# ID X2014 X2015 X2016 X2017 X2018 change year before after
# 1 123100 1 1 2 2 2 1 2016 1 2
# 2 123200 1 1 1 1 3 1 2018 1 3
# 3 123300 1 1 1 1 1 0 0 0 0
# 4 123400 2 2 2 2 2 0 0 0 0
# 5 123500 3 3 1 1 1 1 2016 3 1
Data:
df <- structure(list(ID = c(123100, 123200, 123300, 123400, 123500),
X2014 = c(1, 1, 1, 2, 3), X2015 = c(1, 1, 1, 2, 3), X2016 = c(2,
1, 1, 2, 1), X2017 = c(2, 1, 1, 2, 1), X2018 = c(2, 3, 1,
2, 1)), class = "data.frame", row.names = c(NA, -5L))

Can't add rows to grouped data frames

This is a follow-up question of this How to add a row to a dataframe modifying only some columns.
After solving this question I wanted to apply the solution provided by stefan to a larger dataframe with group_by:
My dataframe:
df <- structure(list(test_id = c(1, 1, 1, 1, 1, 1, 1, 1), test_nr = c(1,
1, 1, 1, 2, 2, 2, 2), region = c("A", "B", "C", "D", "A", "B",
"C", "D"), test_value = c(3, 1, 1, 2, 4, 2, 4, 1)), class = "data.frame", row.names = c(NA,
-8L))
test_id test_nr region test_value
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 1 2 A 4
6 1 2 B 2
7 1 2 C 4
8 1 2 D 1
I now want to add a new row to each group with this code, which gives an error:
df %>%
group_by(test_nr) %>%
add_row(test_id = .$test_id[1], test_nr = .$test_nr[1], region = "mean", test_value = mean(.$test_value))
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
My expected output would be:
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have tried so far:
library(tidyverse)
df %>%
group_by(test_nr) %>%
group_split() %>%
map_dfr(~ .x %>%
add_row(!!! map(.[4], mean)))
test_id test_nr region test_value
<dbl> <dbl> <chr> <dbl>
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 NA NA NA 1.75
6 1 2 A 4
7 1 2 B 2
8 1 2 C 4
9 1 2 D 1
10 NA NA NA 2.75
How could I modify column 1 to 3 to place my values there?
I actually recently made a little helper function for exactly this. The idea
is to use group_modify() to take the group data, and
bind_rows() the summary statistics calculated with summarise().
This is what it looks like in code:
add_summary_rows <- function(.data, ...) {
group_modify(.data, function(x, y) bind_rows(x, summarise(x, ...)))
}
And here’s how that would work with your data:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
test_id = c(1, 1, 1, 1, 1, 1, 1, 1),
test_nr = c(1, 1, 1, 1, 2, 2, 2, 2),
region = c("A", "B", "C", "D", "A", "B", "C", "D"),
test_value = c(3, 1, 1, 2, 4, 2, 4, 1)
)
df %>%
group_by(test_id, test_nr) %>%
add_summary_rows(
region = "MEAN",
test_value = mean(test_value)
)
#> # A tibble: 10 x 4
#> # Groups: test_id, test_nr [2]
#> test_id test_nr region test_value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 A 3
#> 2 1 1 B 1
#> 3 1 1 C 1
#> 4 1 1 D 2
#> 5 1 1 MEAN 1.75
#> 6 1 2 A 4
#> 7 1 2 B 2
#> 8 1 2 C 4
#> 9 1 2 D 1
#> 10 1 2 MEAN 2.75
You can combine your two approaches:
df %>%
split(~test_nr) %>%
map_dfr(~ .x %>%
add_row(test_id = .$test_id[1],
test_nr = .$test_nr[1],
region = "mean",
test_value = mean(.$test_value)))
You could achieve your target with this Base R one-liner:
merge( df, aggregate( df, by = list( df$test_nr ), FUN = mean ), all = TRUE )[ , 1:4 ]
aggregate provides you with the lines you need, and merge inserts them into the right places of your dataframe. You don't need the last column of the combined dataframe, so use only the first four columns. The code produces some warnings for the region column which can be disregarded. In the region column, the function (MEAN) is not displayed.
Making it a little more generic:
f <- "mean"
df1 <- merge( df, aggregate( df, by = list( df$test_id, df$test_nr ),
FUN = f ), all = TRUE )[ , 1:4 ]
df1$region[ is.na( df1$region ) ] <- toupper( f )
Here, you aggregate also by test_id, you can change the function you are using in one place, and you have it printed in the region column:
> df1
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75

Calculate sum of n previous rows

I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.

create a cumulative count, until 2 of previous 6 rows meet a condition

If column a is equal to 1, I would like to start a cumulative sum. I would like to stop when 2 of the previous 6 rows is equal to 0.
dplyr::tibble(a = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1),
sum = c(1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 0, 1, 2, 3))
sum is my desired output
Ideally using tidyverse
One approach could be to find out row where two consecutive 0's are found within interval of 6 rows, then use cumsum to create groups and final take cumsum value in each group.
library(dplyr)
library(purrr)
df %>%
mutate(sum1 = map_dbl(seq_along(a), ~sum(a[. : max(.-6, 1)] == 0) >= 2)) %>%
group_by(group = cumsum(sum1 != lag(sum1, default = first(sum1)))) %>%
mutate(ans = cumsum(a)) %>%
ungroup %>%
select(-sum1, -group)
# A tibble: 14 x 2
# a ans
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 6
# 7 1 7
# 8 1 8
# 9 0 8
#10 1 9
#11 0 0
#12 1 1
#13 1 2
#14 1 3

Resources