dplyr Modify multiple columns based on a single column in a dataframe - r

I have a large dataframe with responses to a questionnaire. My minimal working example (below) has the responses to 3 questions as well as the delay in responding to the questionnaire from which the answers are drawn
df <- data.frame(ID = LETTERS[1:10],
Q1 = sample(0:10, 10, replace=T),
Q2 = sample(0:10, 10, replace=T),
Q3 = sample(0:10, 10, replace=T),
Delay = 1:10
)
I'd like to change the responses with a delay > 3 to NA's. I can accomplish this easily enough for a single question:
df %>%
mutate(Q1 = ifelse(Delay >3, NA, Q1))
which gives me
ID Q1 Q2 Q3 Delay
1 A 5 6 9 1
2 B 8 1 5 2
3 C 8 4 6 3
4 D NA 7 1 4
5 E NA 8 10 5
6 F NA 9 4 6
7 G NA 1 6 7
8 H NA 8 9 8
9 I NA 9 1 9
10 J NA 5 7 10
I'd like instead to do this for all three questions with one statement (in my real life problem, I have over 20 questions, so it's tedious to do each question separately). I therefore create a vector of questions:
q_vec <- c("Q1", "Q2", "Q3")
and then tried variants of my earlier code such as
df %>%
mutate(all_of(q_vec) = ifelse(Delay >3, NA, ~))
but nothing worked.
What is the correct syntax for this?
Many thanks in advance
Thomas Philips

We can use across :
library(dplyr)
q_vec <- c("Q1", "Q2", "Q3")
df %>% mutate(across(all_of(q_vec), ~ifelse(Delay >3, NA, .)))
# ID Q1 Q2 Q3 Delay
#1 A 1 5 0 1
#2 B 9 9 6 2
#3 C 5 7 1 3
#4 D NA NA NA 4
#5 E NA NA NA 5
#6 F NA NA NA 6
#7 G NA NA NA 7
#8 H NA NA NA 8
#9 I NA NA NA 9
#10 J NA NA NA 10
Or in base R :
df[q_vec][df$Delay > 3, ] <- NA

Related

trying to calculate sum of row with dataframe having NA values

I am trying to sum the row of values if any column have values but not working for me like below
df=data.frame(
x3=c(2,NA,3,5,4,6,NA,NA,3,3),
x4=c(0,NA,NA,6,5,6,NA,0,4,2))
df$summ <- ifelse(is.na(c(df[,"x3"] & df[,"x4"])),NA,rowSums(df[,c("x3","x4")], na.rm=TRUE))
the output should be like
An alternative solution:
library(data.table)
setDT(df)[!( is.na(x3) & is.na(x4)),summ:=rowSums(.SD, na.rm = T)]
You can do :
df <- transform(df, summ = ifelse(is.na(x3) & is.na(x4), NA,
rowSums(df, na.rm = TRUE)))
df
# x3 x4 summ
#1 2 0 2
#2 NA NA NA
#3 3 NA 3
#4 5 6 11
#5 4 5 9
#6 6 6 12
#7 NA NA NA
#8 NA 0 0
#9 3 4 7
#10 3 2 5
In general for any number of columns :
cols <- c('x3', 'x4')
df <- transform(df, summ = ifelse(rowSums(is.na(df[cols])) == length(cols),
NA, rowSums(df, na.rm = TRUE)))
Try the code below with rowSums + replace
df$summ <- replace(rowSums(df, na.rm = TRUE), rowSums(is.na(df)) == 2, NA)
which gives
> df
x3 x4 summ
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
This is not much different from already posted answers, however, it contains some useful functions:
library(dplyr)
df %>%
rowwise() %>%
mutate(Count = ifelse(all(is.na(cur_data())), NA,
sum(c_across(everything()), na.rm = TRUE)))
# A tibble: 10 x 3
# Rowwise:
x3 x4 Count
<dbl> <dbl> <dbl>
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5

dplyr tidyr based subsetting not working in R

I have the following data frame created in R
df<-data.frame("X_F"=c(5,10,20,200, 5,10,15,25,30,60,200, NA),
"X_A"=c(1,2,3,4,1,2,3,4,5,6,7,NA),"Y_F"=c(5,20,200, NA, 5,12,16,25,100, NA,
NA, NA), "Y_A"=c(1,2,3,NA, 1,2,3,4,5,NA, NA, NA), "Z_F"=c(5,10,20,100,
4,12,1,7,30,100,200, 250), 'Z_A'=c(1,2,3,4,1,3,4,5,6,7,9,10), "ID"=c("A",
"A", "A", "A", "B", "B", "B", "B","B","B", "B", "B"))
The data frame has differing entries across rows and looks as follows
X_F X_A Y_F Y_A Z_F Z_A ID
1 5 1 5 1 5 1 A
2 10 2 20 2 10 2 A
3 20 3 200 3 20 3 A
4 200 4 NA NA 100 4 A
5 5 1 5 1 4 1 B
6 10 2 12 2 12 3 B
7 15 3 16 3 1 4 B
8 25 4 25 4 7 5 B
9 30 5 100 5 30 6 B
10 60 6 NA NA 100 7 B
11 200 7 NA NA 200 9 B
12 NA NA NA NA 250 10 B
next I have created a new column called SF that includes all values in X_F, Y_F Z_F as a sequence separated by one.
library(dplyr)
library(tidyr)
df=df %>% group_by(ID) %>%
mutate(SF=pmax(X_F,Y_F,Z_F,na.rm = TRUE)) %>%
complete(SF=full_seq(SF,1))
Next I have created the following columns
df[c("X_F2", "Y_F2", "Z_F2") ]<-df$SF
df[c("X_A2", "Y_A2", "Z_A2")]<-NA
The following code should transfer values in X_A to X_A2 based on the values in X_F being equal to X_F2.
df<-df%>%group_by(ID)%>%
mutate(X_A2, case_when(X_F2==X_F~X_A))%>%
mutate(Y_A2, case_when(Y_F2==Y_F~Y_A))%>%
mutate(Z_A2, case_when(Z_F2==Z_F~Z_A))
I am not getting the expected result
The expected result should be as follows
head(data.frame(df$`case_when(X_F2 == X_F ~ X_A)`, df$X_F2),10)
df..case_when.X_F2....X_F...X_A.. df.X_F2
1 5
NA 6
NA 7
NA 8
NA 9
2 10
NA 11
NA 12
NA 13
NA 14
However I am getting the following output
df..case_when.X_F2....X_F...X_A.. df.X_F2
1 5
NA 6
NA 7
NA 8
NA 9
NA 10
NA 11
NA 12
NA 13
NA 14
I request someone to take a look. have also tried else if but that clearly doesnt work

Filtering data relative to first and last occurance of an event

I have a dataframe of an experiment, where stimulus is shown to participants, and time is measured continuously.
# reprex
df <-
tibble(stim = c(NA, NA, NA, NA, "a", "b", NA, "c", NA, "d", NA, NA, NA),
time = 0:12)
# A tibble: 13 x 2
stim time
<chr> <int>
1 NA 0
2 NA 1
3 NA 2
4 NA 3
5 a 4
6 b 5
7 NA 6
8 c 7
9 NA 8
10 d 9
11 NA 10
12 NA 11
13 NA 12
I want to create a generalized solution, using tidyverse functions to drop the data 1 second before and 2 seconds after the first and last marker, respectively. Using tidyverse, I thought this will work, but it throws an uninformative error.
df %>%
# store times for first and last stim
mutate(first_stim = drop_na(stim) %>% pull(time) %>% first(),
last_stim = drop_na(stim) %>% pull(time) %>% last()) %>%
# filter df based on new variables
filter(time >= first(first_stim) - 1 &
time <= first(last_stim) + 2)
Error in mutate_impl(.data, dots) : bad value
So I made a pretty ugly base r code to overcome this issue by changing the mutate:
df2 <- df %>%
mutate(first_stim = .[!is.na(.$stim), "time"][1,1],
last_stim = .[!is.na(.$stim), "time"][nrow(.[!is.na(.$stim), "time"]), 1])
# A tibble: 13 x 4
stim time first_stim last_stim
<chr> <int> <tibble> <tibble>
1 NA 0 4 9
2 NA 1 4 9
3 NA 2 4 9
4 NA 3 4 9
5 a 4 4 9
6 b 5 4 9
7 NA 6 4 9
8 c 7 4 9
9 NA 8 4 9
10 d 9 4 9
11 NA 10 4 9
12 NA 11 4 9
13 NA 12 4 9
Now I would only need to filter based on the new variables first_stim - 1 and last_stim + 2. But filter fails too:
df2 %>%
filter(time >= first(first_stim) - 1 &
time <= first(last_stim) + 2)
Error in filter_impl(.data, quo) :
Not compatible with STRSXP: [type=NULL].
I was able to do it in base R, but it is really ugly:
df2[(df2$time >= (df2[[1, "first_stim"]] - 1)) &
(df2$time <= (df2[[1, "last_stim"]] + 2))
,]
The desired output should look like this:
# A tibble: 13 x 2
stim time
<chr> <int>
4 NA 3
5 a 4
6 b 5
7 NA 6
8 c 7
9 NA 8
10 d 9
11 NA 10
12 NA 11
I believe that the errors are related to dplyr::nth() and related functions. And I've found some old issues that are related to this behavior, but should no longer exist https://github.com/tidyverse/dplyr/issues/1980
I would really appreciate if someone could highlight what is the problem, and how to do this in a tidy way.
You could use a combination of is.na and which...
library(dplyr)
df <-
tibble(stim = c(NA, NA, NA, NA, "a", "b", NA, "c", NA, "d", NA, NA, NA),
time = 0:12)
df %>%
filter(row_number() >= first(which(!is.na(stim))) - 1 &
row_number() <= last(which(!is.na(stim))) + 2)
# # A tibble: 9 x 2
# stim time
# <chr> <int>
# 1 NA 3
# 2 a 4
# 3 b 5
# 4 NA 6
# 5 c 7
# 6 NA 8
# 7 d 9
# 8 NA 10
# 9 NA 11
you could also make your first attempt work with a little modification...
df %>%
mutate(first_stim = first(drop_na(., stim) %>% pull(time)),
last_stim = last(drop_na(., stim) %>% pull(time))) %>%
filter(time >= first(first_stim) - 1 &
time <= first(last_stim) + 2)
We can create a cumulative sum of non-NA values and then find the row indices where the we encounter the first non-NA value and the last one. We then select rows based on the requirement. (-1 from start and +2 from end).
library(tidyverse)
df %>%
mutate(count_cumsum = cumsum(!is.na(stim))) %>%
slice((which.max(count_cumsum == 1) -1):(which.max(count_cumsum) + 2)) %>%
select(-count_cumsum)
# stim time
# <chr> <int>
#1 NA 3
#2 a 4
#3 b 5
#4 NA 6
#5 c 7
#6 NA 8
#7 d 9
#8 NA 10
#9 NA 11
Just to give an idea how count_cumsum looks:
df %>%
mutate(count_cumsum = cumsum(!is.na(stim)))
# A tibble: 13 x 3
# stim time count_cumsum
# <chr> <int> <int>
#1 NA 0 0
#2 NA 1 0
#3 NA 2 0
#4 NA 3 0
#5 a 4 1
#6 b 5 2
#7 NA 6 2
#8 c 7 3
#9 NA 8 3
#10 d 9 4
#11 NA 10 4
#12 NA 11 4
#13 NA 12 4

Insert NA-rows in data frame according to rownames of other data frame

I have 2 data frames with different rownames, e.g.:
df1 <- data.frame(A = c(1,3,7,1,5), B = c(5,2,9,5,5), C = c(1,1,3,4,5))
df2 <- data.frame(A = c(4,3,2), B = c(4,4,9), C = c(3,9,3))
rownames(df2) <- c(1, 3, 6)
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
> df2
A B C
1 4 4 3
3 3 4 9
6 2 9 3
I need to insert NA-rows in both data frames for each row that does exist in only one of the data frames. In the given example:
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
6 NA NA NA
> df2
A B C
1 4 4 3
2 NA NA NA
3 3 4 9
4 NA NA NA
5 NA NA NA
6 2 9 3
I will have to perform this operation many times with different data frames, so I need an automatized way to do this. I was trying to solve the issue with different if/else loops, but I am sure there must be a much more automatized way.
We can use functions union, %in% or intersect to find the common rownames and assign rows of an NA dataframe with the values of the dataset if it matches the rownames
un1 <- union(rownames(df1), rownames(df2))
d1 <- as.data.frame(matrix(NA, ncol = ncol(df1),
nrow = length(un1), dimnames = list(un1, names(df1))))
d2 <- d1
d1[rownames(d1) %in% rownames(df1),] <- df1
d2[rownames(d2) %in% rownames(df2),] <- df2
d2
# A B C
#1 4 4 3
#2 NA NA NA
#3 3 4 9
#4 NA NA NA
#5 NA NA NA
#6 2 9 3

Using conditions in dplyr::mutate

I am working with a large data frame. I'm trying to create a new vector based on the conditions that exist in two current vectors.
Given the size of the dataset (and its general awesomeness) I'm trying to find a solution using dplyr, which has lead me to mutate. I feel like I'm not far off, but I'm just not able to get a solution to stick.
My data frame resembles:
ID X Y
1 1 10 12
2 2 10 NA
3 3 11 NA
4 4 10 12
5 5 11 NA
6 6 NA NA
7 7 NA NA
8 8 11 NA
9 9 10 12
10 10 11 NA
To recreate it:
ID <- c(1:10)
X <- c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11)
Y <- c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA)
I'm looking to create a new vector 'Z' from the existing data. If Y > X, then I want it return the value from Y. If Y is NA then I'd like it to return the X value. If both are NA, then it should return NA.
My attempt thus far, has using the code below has let me create a new vector meeting the first condition, but not the second.
newData <- data %>%
mutate(Z =
ifelse(Y > X, Y,
ifelse(is.na(Y), X, NA)))
> newData
ID X Y Z
1 1 10 12 12
2 2 10 NA NA
3 3 11 NA NA
4 4 10 12 12
5 5 11 NA NA
6 6 NA NA NA
7 7 NA NA NA
8 8 11 NA NA
9 9 10 12 12
10 10 11 NA NA
I feel like I'm missing something mindblowingly simple. Can point me in the right direction?
pmax(, na.rm=TRUE) is what you are looking for
data <- data_frame(ID = c(1:10),
X = c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11),
Y = c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA))
data %>% mutate(Z = pmax(X, Y, na.rm=TRUE))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11
The ifelse code can be
data %>%
mutate(Z= ifelse(Y>X & !is.na(Y), Y, X))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11

Resources