dplyr - mutate with variable column names - r

I have a tibble containing time series of various blood parameters like CRP over the course of several days. The tibble is tidy, with each time series in one column, as well as a column for the day of measurement. The tibble contains another column with a day of infection. I want to replace each blood parameter with NA if the Day variable is greater-equal than the InfectionDay. Since I have a lot of variables, I'd like to have a function which accepts the column name dynamically and creates a new column name by appending "_censored" to the old one. I've tried the following:
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !!colname, NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
Running this, I expected
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 3
2 2 3 2 2
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
but I get
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 CRP
2 2 3 2 CRP
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA

You can add sym() to the column name in mutate to convert to symbol before evaluating
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !! sym(colname), NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")

We can select columns on which we want to apply the function (cols) and use mutate_at which will also automatically rename the columns. Added an extra column in the data to show renaming.
library(dplyr)
cols <- c("CRP", "CRP1")
data %>%
mutate_at(cols, list(censored = ~replace(., Day >= InfectionDay, NA)))
# A tibble: 5 x 6
# Day InfectionDay CRP CRP1 CRP_censored CRP1_censored
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 3 3 3 3
#2 2 3 2 2 2 2
#3 3 3 5 5 NA NA
#4 4 3 4 4 NA NA
#5 5 3 1 1 NA NA
data
data <- tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1), CRP1 = c(3,2,5,4,1))

Related

how to compute row means iff the number of NA's is smaller than a given value

I have questionnaire data (rows=individuals, cols=scores on questions)and would like to compute a sumscore for individuals if they answered a given number of questions, otherwise the sumscore variable should be NA. The code below computes row sums, counts the number of NA's, assigns an otherwise not occurring value to the row sum variable in case the number of NA's is large, and then replaces that with an NA. The code works but I bet there is a more elegant way...Suggestions much appreciated.
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum<-dum %>%
mutate(sumsum = rowSums(select(., x:z), na.rm = TRUE))
dum<-dum %>%
mutate(countna=rowSums(is.na(select(.,x:z))))
dum<-dum %>%
mutate(sumsum=case_when(countna>=2 ~ 100,TRUE~sumsum))
dum<-dum %>%
mutate(sumsum = na_if(sumsum, 100))
You may combine your code in one statement -
library(dplyr)
dum <- tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum <- dum %>%
mutate(sumsum = replace(rowSums(select(., x:z), na.rm = TRUE),
rowSums(is.na(select(., x:z))) >= 2, NA))
dum
# A tibble: 5 × 4
# x y z sumsum
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 3
#2 NA 2 NA NA
#3 2 3 2 7
#4 3 NA 3 6
#5 4 5 4 13
You can also try this:
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum2 <- dum %>% mutate(sumsum = ifelse(rowSums(is.na(select(.,x:z)))>=2, NA,rowSums(select(., x:z), na.rm = TRUE)))
dum2
# A tibble: 5 × 4
x y z sumsum
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 NA 2 NA NA
3 2 3 2 7
4 3 NA 3 6
5 4 5 4 13

how to set names in a dynamically long list

Given the following data:
test = data.frame(x = c(NA,1,1,2,3,4),
y = c(NA,1,2,3,4,4))
I want to perform some calculations and store these as new columns. The calculations, however, might result in a variable amount of columns. E.g. suppose I want store for each row the column index of the column(s) that contain the minimum per row. E.g. in row 1, both columns contain the minimum, hence I need to create two columns.
Using the tidyverse approach, I know I can use the set_names argument when passing my function as a list. But this doesn't work when I don't know the number of columns my calculation will create. See also here: https://community.rstudio.com/t/how-to-handle-lack-of-names-with-unnest-wider/40496
My approach for the calculations:
library(tidyverse)
test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist))) %>%
ungroup() %>%
unnest_wider(code)
which automatically names the unnested columns with "...1" and "...2":
# A tibble: 6 x 5
x y dist ...1 ...2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
But that's not what I want. I also tried to use the named_repair argument within the unnest_wider, i.e. unnest_wider(code, names_repair = ~paste0("code", .x)) but this renames all columns.
Any ideas (preferably in the tidyverse approach)? Expected outcome:
# A tibble: 6 x 5
x y dist code_1 code_2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
EDITED to add an example where one row contains only missings.
Edit 2: this is my current solution. But it is really ugly and requires to stop half way through. Problem here is that the rename_with function doesn't recognize the on-the-fly generated "length_code" column when I put everything into one pipe.
test2 <- test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist)),
length_code = length(code)) %>%
ungroup() %>%
unnest_wider(code) %>%
test3 <- test2 %>%
rename_with(.cols = starts_with("..."), .fn = ~paste0("code_", 1:max(test2$length_code)))
which gives:
# A tibble: 6 x 6
x y dist code_1 code_2 length_code
<dbl> <dbl> <dbl> <int> <int> <int>
1 NA NA NA NA NA 0
2 1 1 1 1 2 2
3 1 2 1 1 NA 1
4 2 3 2 1 NA 1
5 3 4 3 1 NA 1
6 4 4 4 1 2 2

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

Create multiple new dataframes based on rows in another dataframe with a for loop in r

I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop
Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Resources