My zoo data looks like below. This data is part of a larger zoo (time series) data set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NA NA NA NA NA 1 NA NA NA NA NA 3 NA NA NA
library(zoo)
x <- zoo(c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA))
I want to replace NAs in a window around each non-NA value with the non-NA value. For example, a window of [EDIT] 5 around the non-NA would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NA NA NA 1 1 1 1 1 NA 3 3 3 3 3 NA
I can do what I want with a long and messy set of ifelse statements.
Is there a better way? I looked at zoo's NA fill set of functions but did not see anything for a window.
I guess rolling apply will do the job?
> rollapply(x, 5, function(x){mean(x[!is.na(x)])}, fill=NA)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
NA NA NaN 1 1 1 1 1 NaN 3 3 3 3 3 NA NA
We could also use filter
v2 <- stats::filter(replace(v1,is.na(v1),0), rep(1,5))
is.na(v2) <- !v2
Related
Being still rather new to R, I struggle with the following problem:
I am looking at several particles moving along the x axis (in reality this is in 3D, but this simplifies matters for our purpose, here). I have a data frame with the each particle's ID and their corresponding position at a given time point.
Here's an example:
x.position1 <- c(5, NA, 4, 7, 1, NA, 2, NA, NA, 3)
x.position2 <- c(6, NA, 8, 7, 2, 1, 2, NA, NA, 1)
x.position3 <- c(6, 2, 7, 7, 4, 3, 1, NA, NA, 6)
x.position4 <- c(7, 4, 9, 7, 5, 5, 0, 0, 5, 7)
x.position5 <- c(9, 5, NA, 7, 6, NA, 2, 3, 8, 11)
particule.ID <- c(1:10)
df <- data.frame(particule.ID, x.position1, x.position2, x.position3, x.position4, x.position5)
df
particule.ID x.position1 x.position2 x.position3 x.position4 x.position5
1 1 5 6 6 7 9
2 2 NA NA 2 4 5
3 3 4 8 7 9 NA
4 4 7 7 7 7 7
5 5 1 2 4 5 6
6 6 NA 1 3 5 NA
7 7 2 2 1 0 2
8 8 NA NA NA 0 3
9 9 NA NA NA 5 8
10 10 3 1 6 7 11
My aim is to calculate the displacement of each particle at each time point i. This displacement is therefore xi - x1. This newly calculated displacement is to be placed in a newly created column.
Here's the script I originally wrote for this:
for (i in 1:5){ # iterate for each time point i
df$Disp <- df[,2+i-1]-df[,2] # create a new column with the calculated displacement for time point i
nam.Disp <- paste("Disp", i, sep = "") #rename new column Disp+time point number
names(df)[names(df) == 'Disp'] <- nam.Disp
}
df
particule.ID x.position1 x.position2 x.position3 x.position4 x.position5 Disp1 Disp2 Disp3 Disp4 Disp5
1 1 5 6 6 7 9 0 1 1 2 4
2 2 NA NA 2 4 5 NA NA NA NA NA
3 3 4 8 7 9 NA 0 4 3 5 NA
4 4 7 7 7 7 7 0 0 0 0 0
5 5 1 2 4 5 6 0 1 3 4 5
6 6 NA 1 3 5 NA NA NA NA NA NA
7 7 2 2 1 0 2 0 0 -1 -2 0
8 8 NA NA NA 0 3 NA NA NA NA NA
9 9 NA NA NA 5 8 NA NA NA NA NA
10 10 3 1 6 7 11 0 -2 3 4 8
However, as you may notice in the data frame, sometimes a particle is not detected at i=1, or later. This means that I get a NA value. Therefore, including another loop with IF so that if that 1st time point is NA, R would go to the next time point until it found a non NA value to substract.
I threfore came up with this, using ifelse instead of IF, since the latter can only deal with one value while my input is actually a column:
for (i in 1:5){ # iterate for each time point i
for (j in 1:5){ # if first time point has no value (NA) scan the row for next time point until an object is detected
ifelse(!is.na(df[,2+j-1]),
df$Disp <- (df[,2+i-1]-df[,2+j-1]), # create a new column with the calculated displacement for i time point
next) # if time point is NA go to next j (next fixed initial time point to test)
}
nam.Disp <- paste("Disp", i, sep = "") #rename new column Disp+time point number
names(df)[names(df) == 'Disp'] <- nam.Disp
}
df
particule.ID x.position1 x.position2 x.position3 x.position4 x.position5 Disp1 Disp2 Disp3 Disp4 Disp5
1 1 5 6 6 7 9 -4 -3 -3 -2 0
2 2 NA NA 2 4 5 NA NA -3 -1 0
3 3 4 8 7 9 NA NA NA NA NA NA
4 4 7 7 7 7 7 0 0 0 0 0
5 5 1 2 4 5 6 -5 -4 -2 -1 0
6 6 NA 1 3 5 NA NA NA NA NA NA
7 7 2 2 1 0 2 0 0 -1 -2 0
8 8 NA NA NA 0 3 NA NA NA -3 0
9 9 NA NA NA 5 8 NA NA NA -3 0
10 10 3 1 6 7 11 -8 -10 -5 -4 0
Somehow this return wrong values. It looks like the computation occured backwards: Disp1 = x5-x1, Disp2 = x5- x2, Disp3 = x5-x3 etc... while what I expected was: Disp1 = x1-x1, Disp2 = x2-x1, Disp3 = x3-x1 etc.
How can this inclusion of the new for loop and the ifelse function have caused this? What am I doing wrong?
There may be a way to do this manually, but since in reality I have at least 60 time points if not more, the task would be daunting.
Also, if you think there's a much smarter way to do this, please do share! And if I forgot to include important details that would help you understand better, just let me know.
Many thanks!
Flo
An example on how to deal with your problem in long format:
library(data.table)
library(zoo)
library(stringr)
df <- setDT(df)
dflong <- melt(df,patterns = "position",id.vars = "particule.ID",value.name = "position")
dflong[,time := str_extract(variable,"[0-9]+$")]
setkey(dflong,time)
dflong[,displacement := c(NA,diff(na.locf(position,na.rm = F))),by = particule.ID]
particule.ID variable position time displacement
1: 1 x.position1 5 1 NA
2: 1 x.position2 6 2 1
3: 1 x.position3 6 3 0
4: 1 x.position4 7 4 1
5: 1 x.position5 9 5 2
6: 2 x.position1 NA 1 NA
7: 2 x.position2 NA 2 NA
8: 2 x.position3 2 3 NA
9: 2 x.position4 4 4 2
10: 2 x.position5 5 5 1
11: 3 x.position1 4 1 NA
12: 3 x.position2 8 2 4
13: 3 x.position3 7 3 -1
14: 3 x.position4 9 4 2
15: 3 x.position5 NA 5 0
.....
I use here data.table, but the same can be done with dplyr and tidyr, with pivot_long.
melt(df,patterns = "position",id.vars = "particule.ID",value.name = "position")
transform your data in long format, with 3 variables : your column name containing the time, the particle.ID, and the position.
I then extract the time with str_extract from stringr:
dflong[,time := str_extract(variable,"[0-9]+$")]
I order the table along ID and time setkey(dflong,time)
I then use diff(position) for each particule.ID to calculate the displacement. But as I know that I have Nas problems, I use na.locf from zoo, which allow me to replace the NAs by the last available value.
Since you want to calculate the difference of a certain time i to time 1 for each position, you will get NA if one of the two numbers is NA.
In your case, we can create a second data frame df2 which captures changes from time 1 to i for each position 1 until 10.
df2<-data.frame(matrix(NA, nrow = 10, ncol = 4))
colnames(df2)<-cbind("chp2","chp3","chp4","chp5")
We can capture the differences in a loop:
for (i in 1:4) { for (j in 1:10) {
df2[i][j,]<-df[i+2][j,]-df[2][j,]
}
}
Now we can put them all in a data frame if we want df1<-cbind(df,df2)
Suppose I have this dataframe
df <- data.frame(
x=c(1, NA, NA, 4, 5, NA),
y=c(NA, 2, 3, NA, NA, 6)
which looks like this
x y
1 1 NA
2 NA 2
3 NA 3
4 4 NA
5 5 NA
6 NA 6
How can I merge the two columns into one? Basically the NA values are in complementary rows. It would be nice to also obtain (in the process) a flag column containing 0 if the entry comes from x and 1 if the entry comes from y.
We can try using the coalesce function from the dplyr package:
df$merged <- coalesce(df$x, df$y)
df$flag <- ifelse(is.na(df$y), 0, 1)
df
x y merged flag
1 1 NA 1 0
2 NA 2 2 1
3 NA 3 3 1
4 4 NA 4 0
5 5 NA 5 0
6 NA 6 6 1
We can also use base R methods with max.col on the logical matrix to get the column index, cbind with row index and extract the values that are not NA
df$merged <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df)))]
df$flag <- +(!is.na(df$y))
df
# x y merged flag
#1 1 NA 1 0
#2 NA 2 2 1
#3 NA 3 3 1
#4 4 NA 4 0
#5 5 NA 5 0
#6 NA 6 6 1
Or we can use fcoalesce from data.table which is written in C and is multithreaded for numeric and factor types.
library(data.table)
setDT(df)[, c('merged', 'flag' ) := .(fcoalesce(x, y), +(!is.na(y)))]
df
# x y merged flag
#1: 1 NA 1 0
#2: NA 2 2 1
#3: NA 3 3 1
#4: 4 NA 4 0
#5: 5 NA 5 0
#6: NA 6 6 1
You can do that using dplyr as follows;
library(dplyr)
# Creating dataframe
df <-
data.frame(
x = c(1, NA, NA, 4, 5, NA),
y = c(NA, 2, 3, NA, NA, 6))
df %>%
# If x is null then replace it with y
mutate(merged = coalesce(x, y),
# If x is null then put 1 else put 0
flag = if_else(is.na(x), 1, 0))
# x y merged flag
# 1 NA 1 0
# NA 2 2 1
# NA 3 3 1
# 4 NA 4 0
# 5 NA 5 0
# NA 6 6 1
I have the following data frame created in R
df<-data.frame("X_F"=c(5,10,20,200, 5,10,15,25,30,60,200, NA),
"X_A"=c(1,2,3,4,1,2,3,4,5,6,7,NA),"Y_F"=c(5,20,200, NA, 5,12,16,25,100, NA,
NA, NA), "Y_A"=c(1,2,3,NA, 1,2,3,4,5,NA, NA, NA), "Z_F"=c(5,10,20,100,
4,12,1,7,30,100,200, 250), 'Z_A'=c(1,2,3,4,1,3,4,5,6,7,9,10), "ID"=c("A",
"A", "A", "A", "B", "B", "B", "B","B","B", "B", "B"))
The data frame has differing entries across rows and looks as follows
X_F X_A Y_F Y_A Z_F Z_A ID
1 5 1 5 1 5 1 A
2 10 2 20 2 10 2 A
3 20 3 200 3 20 3 A
4 200 4 NA NA 100 4 A
5 5 1 5 1 4 1 B
6 10 2 12 2 12 3 B
7 15 3 16 3 1 4 B
8 25 4 25 4 7 5 B
9 30 5 100 5 30 6 B
10 60 6 NA NA 100 7 B
11 200 7 NA NA 200 9 B
12 NA NA NA NA 250 10 B
next I have created a new column called SF that includes all values in X_F, Y_F Z_F as a sequence separated by one.
library(dplyr)
library(tidyr)
df=df %>% group_by(ID) %>%
mutate(SF=pmax(X_F,Y_F,Z_F,na.rm = TRUE)) %>%
complete(SF=full_seq(SF,1))
Next I have created the following columns
df[c("X_F2", "Y_F2", "Z_F2") ]<-df$SF
df[c("X_A2", "Y_A2", "Z_A2")]<-NA
The following code should transfer values in X_A to X_A2 based on the values in X_F being equal to X_F2.
df<-df%>%group_by(ID)%>%
mutate(X_A2, case_when(X_F2==X_F~X_A))%>%
mutate(Y_A2, case_when(Y_F2==Y_F~Y_A))%>%
mutate(Z_A2, case_when(Z_F2==Z_F~Z_A))
I am not getting the expected result
The expected result should be as follows
head(data.frame(df$`case_when(X_F2 == X_F ~ X_A)`, df$X_F2),10)
df..case_when.X_F2....X_F...X_A.. df.X_F2
1 5
NA 6
NA 7
NA 8
NA 9
2 10
NA 11
NA 12
NA 13
NA 14
However I am getting the following output
df..case_when.X_F2....X_F...X_A.. df.X_F2
1 5
NA 6
NA 7
NA 8
NA 9
NA 10
NA 11
NA 12
NA 13
NA 14
I request someone to take a look. have also tried else if but that clearly doesnt work
I want to filter my dataset to keep cases with observations in a specific column. To illustrate:
help <- data.frame(deid = c(5, 5, 5, 5, 5, 12, 12, 12, 12, 17, 17, 17),
score.a = c(NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, 1, NA))
Creates
deid score.a
1 5 NA
2 5 1
3 5 1
4 5 1
5 5 NA
6 12 NA
7 12 NA
8 12 NA
9 12 NA
10 17 NA
11 17 1
12 17 NA
And I want to tell dplyr to keep cases that have any observations in score.a, including the NA values. Thus, I want it to return:
deid score.a
1 5 NA
2 5 1
3 5 1
4 5 1
5 5 NA
6 17 NA
7 17 1
8 17 NA
I ran the code help %>% group_by(deid) %>% filter(score.a > 0) however it pulls out the NAs as well. Thank you for any assistance.
Edit: A similar question was asked here How to remove groups of observation with dplyr::filter()
However, in the answer they use the 'all' condition and this requires use of the 'any' condition.
Try
library(dplyr)
help %>%
group_by(deid) %>%
filter(any(score.a >0 & !is.na(score.a)))
# deid score.a
#1 5 NA
#2 5 1
#3 5 1
#4 5 1
#5 5 NA
#6 17 NA
#7 17 1
#8 17 NA
Or a similar approach with data.table
library(data.table)
setDT(help)[, if(any(score.a>0 & !is.na(score.a))) .SD , deid]
# deid score.a
#1: 5 NA
#2: 5 1
#3: 5 1
#4: 5 1
#5: 5 NA
#6: 17 NA
#7: 17 1
#8: 17 NA
If the condition is to subset 'deid's with all the values in 'score.a' > 0, then the above code can be modified to,
setDT(help)[, if(!all(is.na(score.a)) &
all(score.a[!is.na(score.a)]>0)) .SD , deid]
# deid score.a
#1: 5 NA
#2: 5 1
#3: 5 1
#4: 5 1
#5: 5 NA
#6: 17 NA
#7: 17 1
#8: 17 NA
Suppose one of the 'score.a' in 'deid' group is less than 0,
help$score.a[3] <- -1
the above code would return
setDT(help)[, if(!all(is.na(score.a)) &
all(score.a[!is.na(score.a)]>0, deid],
# deid score.a
#1: 17 NA
#2: 17 1
#3: 17 NA
library(dplyr)
df%>%group_by(deid)%>%filter(sum(score.a,na.rm=T)>0)
I am working with a large data frame. I'm trying to create a new vector based on the conditions that exist in two current vectors.
Given the size of the dataset (and its general awesomeness) I'm trying to find a solution using dplyr, which has lead me to mutate. I feel like I'm not far off, but I'm just not able to get a solution to stick.
My data frame resembles:
ID X Y
1 1 10 12
2 2 10 NA
3 3 11 NA
4 4 10 12
5 5 11 NA
6 6 NA NA
7 7 NA NA
8 8 11 NA
9 9 10 12
10 10 11 NA
To recreate it:
ID <- c(1:10)
X <- c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11)
Y <- c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA)
I'm looking to create a new vector 'Z' from the existing data. If Y > X, then I want it return the value from Y. If Y is NA then I'd like it to return the X value. If both are NA, then it should return NA.
My attempt thus far, has using the code below has let me create a new vector meeting the first condition, but not the second.
newData <- data %>%
mutate(Z =
ifelse(Y > X, Y,
ifelse(is.na(Y), X, NA)))
> newData
ID X Y Z
1 1 10 12 12
2 2 10 NA NA
3 3 11 NA NA
4 4 10 12 12
5 5 11 NA NA
6 6 NA NA NA
7 7 NA NA NA
8 8 11 NA NA
9 9 10 12 12
10 10 11 NA NA
I feel like I'm missing something mindblowingly simple. Can point me in the right direction?
pmax(, na.rm=TRUE) is what you are looking for
data <- data_frame(ID = c(1:10),
X = c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11),
Y = c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA))
data %>% mutate(Z = pmax(X, Y, na.rm=TRUE))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11
The ifelse code can be
data %>%
mutate(Z= ifelse(Y>X & !is.na(Y), Y, X))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11