I am working with a large data frame. I'm trying to create a new vector based on the conditions that exist in two current vectors.
Given the size of the dataset (and its general awesomeness) I'm trying to find a solution using dplyr, which has lead me to mutate. I feel like I'm not far off, but I'm just not able to get a solution to stick.
My data frame resembles:
ID X Y
1 1 10 12
2 2 10 NA
3 3 11 NA
4 4 10 12
5 5 11 NA
6 6 NA NA
7 7 NA NA
8 8 11 NA
9 9 10 12
10 10 11 NA
To recreate it:
ID <- c(1:10)
X <- c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11)
Y <- c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA)
I'm looking to create a new vector 'Z' from the existing data. If Y > X, then I want it return the value from Y. If Y is NA then I'd like it to return the X value. If both are NA, then it should return NA.
My attempt thus far, has using the code below has let me create a new vector meeting the first condition, but not the second.
newData <- data %>%
mutate(Z =
ifelse(Y > X, Y,
ifelse(is.na(Y), X, NA)))
> newData
ID X Y Z
1 1 10 12 12
2 2 10 NA NA
3 3 11 NA NA
4 4 10 12 12
5 5 11 NA NA
6 6 NA NA NA
7 7 NA NA NA
8 8 11 NA NA
9 9 10 12 12
10 10 11 NA NA
I feel like I'm missing something mindblowingly simple. Can point me in the right direction?
pmax(, na.rm=TRUE) is what you are looking for
data <- data_frame(ID = c(1:10),
X = c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11),
Y = c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA))
data %>% mutate(Z = pmax(X, Y, na.rm=TRUE))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11
The ifelse code can be
data %>%
mutate(Z= ifelse(Y>X & !is.na(Y), Y, X))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11
Related
I have created the following dataframe in R
library(tidyR)
library(dplyr)
DF11<- data.frame("ID"= c("A", "A", "A", "B", "B", "B", "B", "B"))
DF11$X_F<-c(5, 7,9,6,7,8,9,10)
DF11$X_A<-c(7, 8,9,3,6,7,9,10)
The dataframe looks as follows
ID X_F X_A
A 5 7
A 7 8
A 9 9
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
ID is the grouping variable. I would like to use dplyr to create the following dataframe.
ID X_F X_A
A 0 NA
A 1 NA
A 2 NA
A 3 NA
A 4 NA
A 5 7
A 7 8
A 9 9
A 10 NA
A 11 NA
A 12 NA
B 0 NA
B 1 NA
B 2 NA
B 3 NA
B 4 NA
B 5 NA
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
B 11 NA
B 12 NA
B 13 NA
The resultant dataframe should take DF11 and then group the X_F column using ID column. Next it should complete X_F group-wise from 0 to the minimum value of X_F by group, and then from the maximum value of X_F to maximum value X_F +3.
I tried the following code and was able to solve it partially.
DF112<-DF11%>%group_by(ID)%>%complete(X_F=seq(0, max(X_F)+3, by =1))
ID X_F X_A
A 0 NA
A 1 NA
A 2 NA
A 3 NA
A 4 NA
A 5 7
A 6 NA
A 7 8
A 8 NA
A 9 9
A 10 NA
A 11 NA
A 12 NA
B 0 NA
B 1 NA
B 2 NA
B 3 NA
B 4 NA
B 5 NA
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
B 11 NA
B 12 NA
B 13 NA
How do I get the desired output mentioned above. I request someone to guide me.
It would work to pass two vectors into your complete function call, one to do the lower values and one to do the upper:
library(tidyr)
library(dplyr)
DF11 <- data.frame("ID" = c("A", "A", "A", "B", "B", "B", "B", "B"))
DF11$X_F <- c(5, 7, 9, 6, 7, 8, 9, 10)
DF11$X_A <- c(7, 8, 9, 3, 6, 7, 9, 10)
DF11 %>%
group_by(ID) %>%
complete(X_F = c(seq(0, min(X_F) - 1 , by = 1), seq(max(X_F) + 1, max(X_F) + 3, by = 1))) |>
arrange(ID, X_F)
# A tibble: 25 × 3
# Groups: ID [2]
ID X_F X_A
<chr> <dbl> <dbl>
1 A 0 NA
2 A 1 NA
3 A 2 NA
4 A 3 NA
5 A 4 NA
6 A 5 7
7 A 7 8
8 A 9 9
9 A 10 NA
10 A 11 NA
11 A 12 NA
12 B 0 NA
13 B 1 NA
14 B 2 NA
15 B 3 NA
16 B 4 NA
17 B 5 NA
18 B 6 3
19 B 7 6
20 B 8 7
21 B 9 9
22 B 10 10
23 B 11 NA
24 B 12 NA
25 B 13 NA
Created on 2022-11-01 with reprex v2.0.2
I have a large dataframe with responses to a questionnaire. My minimal working example (below) has the responses to 3 questions as well as the delay in responding to the questionnaire from which the answers are drawn
df <- data.frame(ID = LETTERS[1:10],
Q1 = sample(0:10, 10, replace=T),
Q2 = sample(0:10, 10, replace=T),
Q3 = sample(0:10, 10, replace=T),
Delay = 1:10
)
I'd like to change the responses with a delay > 3 to NA's. I can accomplish this easily enough for a single question:
df %>%
mutate(Q1 = ifelse(Delay >3, NA, Q1))
which gives me
ID Q1 Q2 Q3 Delay
1 A 5 6 9 1
2 B 8 1 5 2
3 C 8 4 6 3
4 D NA 7 1 4
5 E NA 8 10 5
6 F NA 9 4 6
7 G NA 1 6 7
8 H NA 8 9 8
9 I NA 9 1 9
10 J NA 5 7 10
I'd like instead to do this for all three questions with one statement (in my real life problem, I have over 20 questions, so it's tedious to do each question separately). I therefore create a vector of questions:
q_vec <- c("Q1", "Q2", "Q3")
and then tried variants of my earlier code such as
df %>%
mutate(all_of(q_vec) = ifelse(Delay >3, NA, ~))
but nothing worked.
What is the correct syntax for this?
Many thanks in advance
Thomas Philips
We can use across :
library(dplyr)
q_vec <- c("Q1", "Q2", "Q3")
df %>% mutate(across(all_of(q_vec), ~ifelse(Delay >3, NA, .)))
# ID Q1 Q2 Q3 Delay
#1 A 1 5 0 1
#2 B 9 9 6 2
#3 C 5 7 1 3
#4 D NA NA NA 4
#5 E NA NA NA 5
#6 F NA NA NA 6
#7 G NA NA NA 7
#8 H NA NA NA 8
#9 I NA NA NA 9
#10 J NA NA NA 10
Or in base R :
df[q_vec][df$Delay > 3, ] <- NA
I have the following data frame created in R
df<-data.frame("X_F"=c(5,10,20,200, 5,10,15,25,30,60,200, NA),
"X_A"=c(1,2,3,4,1,2,3,4,5,6,7,NA),"Y_F"=c(5,20,200, NA, 5,12,16,25,100, NA,
NA, NA), "Y_A"=c(1,2,3,NA, 1,2,3,4,5,NA, NA, NA), "Z_F"=c(5,10,20,100,
4,12,1,7,30,100,200, 250), 'Z_A'=c(1,2,3,4,1,3,4,5,6,7,9,10), "ID"=c("A",
"A", "A", "A", "B", "B", "B", "B","B","B", "B", "B"))
The data frame has differing entries across rows and looks as follows
X_F X_A Y_F Y_A Z_F Z_A ID
1 5 1 5 1 5 1 A
2 10 2 20 2 10 2 A
3 20 3 200 3 20 3 A
4 200 4 NA NA 100 4 A
5 5 1 5 1 4 1 B
6 10 2 12 2 12 3 B
7 15 3 16 3 1 4 B
8 25 4 25 4 7 5 B
9 30 5 100 5 30 6 B
10 60 6 NA NA 100 7 B
11 200 7 NA NA 200 9 B
12 NA NA NA NA 250 10 B
next I have created a new column called SF that includes all values in X_F, Y_F Z_F as a sequence separated by one.
library(dplyr)
library(tidyr)
df=df %>% group_by(ID) %>%
mutate(SF=pmax(X_F,Y_F,Z_F,na.rm = TRUE)) %>%
complete(SF=full_seq(SF,1))
Next I have created the following columns
df[c("X_F2", "Y_F2", "Z_F2") ]<-df$SF
df[c("X_A2", "Y_A2", "Z_A2")]<-NA
The following code should transfer values in X_A to X_A2 based on the values in X_F being equal to X_F2.
df<-df%>%group_by(ID)%>%
mutate(X_A2, case_when(X_F2==X_F~X_A))%>%
mutate(Y_A2, case_when(Y_F2==Y_F~Y_A))%>%
mutate(Z_A2, case_when(Z_F2==Z_F~Z_A))
I am not getting the expected result
The expected result should be as follows
head(data.frame(df$`case_when(X_F2 == X_F ~ X_A)`, df$X_F2),10)
df..case_when.X_F2....X_F...X_A.. df.X_F2
1 5
NA 6
NA 7
NA 8
NA 9
2 10
NA 11
NA 12
NA 13
NA 14
However I am getting the following output
df..case_when.X_F2....X_F...X_A.. df.X_F2
1 5
NA 6
NA 7
NA 8
NA 9
NA 10
NA 11
NA 12
NA 13
NA 14
I request someone to take a look. have also tried else if but that clearly doesnt work
My zoo data looks like below. This data is part of a larger zoo (time series) data set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NA NA NA NA NA 1 NA NA NA NA NA 3 NA NA NA
library(zoo)
x <- zoo(c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA))
I want to replace NAs in a window around each non-NA value with the non-NA value. For example, a window of [EDIT] 5 around the non-NA would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NA NA NA 1 1 1 1 1 NA 3 3 3 3 3 NA
I can do what I want with a long and messy set of ifelse statements.
Is there a better way? I looked at zoo's NA fill set of functions but did not see anything for a window.
I guess rolling apply will do the job?
> rollapply(x, 5, function(x){mean(x[!is.na(x)])}, fill=NA)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
NA NA NaN 1 1 1 1 1 NaN 3 3 3 3 3 NA NA
We could also use filter
v2 <- stats::filter(replace(v1,is.na(v1),0), rep(1,5))
is.na(v2) <- !v2
I want to filter my dataset to keep cases with observations in a specific column. To illustrate:
help <- data.frame(deid = c(5, 5, 5, 5, 5, 12, 12, 12, 12, 17, 17, 17),
score.a = c(NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, 1, NA))
Creates
deid score.a
1 5 NA
2 5 1
3 5 1
4 5 1
5 5 NA
6 12 NA
7 12 NA
8 12 NA
9 12 NA
10 17 NA
11 17 1
12 17 NA
And I want to tell dplyr to keep cases that have any observations in score.a, including the NA values. Thus, I want it to return:
deid score.a
1 5 NA
2 5 1
3 5 1
4 5 1
5 5 NA
6 17 NA
7 17 1
8 17 NA
I ran the code help %>% group_by(deid) %>% filter(score.a > 0) however it pulls out the NAs as well. Thank you for any assistance.
Edit: A similar question was asked here How to remove groups of observation with dplyr::filter()
However, in the answer they use the 'all' condition and this requires use of the 'any' condition.
Try
library(dplyr)
help %>%
group_by(deid) %>%
filter(any(score.a >0 & !is.na(score.a)))
# deid score.a
#1 5 NA
#2 5 1
#3 5 1
#4 5 1
#5 5 NA
#6 17 NA
#7 17 1
#8 17 NA
Or a similar approach with data.table
library(data.table)
setDT(help)[, if(any(score.a>0 & !is.na(score.a))) .SD , deid]
# deid score.a
#1: 5 NA
#2: 5 1
#3: 5 1
#4: 5 1
#5: 5 NA
#6: 17 NA
#7: 17 1
#8: 17 NA
If the condition is to subset 'deid's with all the values in 'score.a' > 0, then the above code can be modified to,
setDT(help)[, if(!all(is.na(score.a)) &
all(score.a[!is.na(score.a)]>0)) .SD , deid]
# deid score.a
#1: 5 NA
#2: 5 1
#3: 5 1
#4: 5 1
#5: 5 NA
#6: 17 NA
#7: 17 1
#8: 17 NA
Suppose one of the 'score.a' in 'deid' group is less than 0,
help$score.a[3] <- -1
the above code would return
setDT(help)[, if(!all(is.na(score.a)) &
all(score.a[!is.na(score.a)]>0, deid],
# deid score.a
#1: 17 NA
#2: 17 1
#3: 17 NA
library(dplyr)
df%>%group_by(deid)%>%filter(sum(score.a,na.rm=T)>0)