I want to calculate a rolling average. Specifically, I want to fill each row of columns 5 and 6 of Mat1, with a rolling average of the prior 3 columns. For column 5 this implies an average over 2,3,4 and for column 6, the average over columns 3,4,5. I only want to calculate the average when there are no NAs in the columns over which the average is calculated.
mat1 <- data.frame(matrix(nrow =6, ncol =6))
mat1[1:4,1:4] = rnorm(16,0,1)
mat1[5:6,1:3] = rnorm(6,0,1)
mat1
X1 X2 X3 X4 X5 X6
1 0.40023542 2.05111693 0.695422777 0.9938004 NA NA
2 0.22673283 -0.86433614 0.002620227 0.8464388 NA NA
3 0.88522293 -0.72385091 0.751663489 1.3240476 NA NA
4 0.65373734 1.68385938 0.759718967 -0.4577604 NA NA
5 -0.09442161 0.72186678 0.180312264 NA NA NA
6 0.39930843 0.04311092 2.141065229 NA NA NA
for entry 1,5 = mean(2.051,0.69,0.99) and for entry 1,6 = mean(0.69, 0.99, mean(2.051,0.69,0.99)).
We can use for loop to calculate rolling mean of last three columns
cols <- 5:6
for(i in cols) {
mat1[i] <- rowMeans(mat1[(i-3):(i-1)])
}
mat1
# X1 X2 X3 X4 X5 X6
#1 0.40023542 2.05111693 0.695422777 0.9938004 1.246780036 0.9786677
#2 0.22673283 -0.86433614 0.002620227 0.8464388 -0.005092371 0.2813222
#3 0.88522293 -0.72385091 0.751663489 1.3240476 0.450620060 0.8421104
#4 0.65373734 1.68385938 0.759718967 -0.4577604 0.661939316 0.3212993
#5 -0.09442161 0.72186678 0.180312264 NA NA NA
#6 0.39930843 0.04311092 2.141065229 NA NA NA
This returns NA if any NA value is present in the calculation as mentioned in the comments. If we need to ignore NA values, we can set na.rm = TRUE in rowMeans.
data
mat1 <- structure(list(X1 = c(0.40023542, 0.22673283, 0.88522293, 0.65373734,
-0.09442161, 0.39930843), X2 = c(2.05111693, -0.86433614, -0.72385091,
1.68385938, 0.72186678, 0.04311092), X3 = c(0.695422777, 0.002620227,
0.751663489, 0.759718967, 0.180312264, 2.141065229), X4 = c(0.9938004,
0.8464388, 1.3240476, -0.4577604, NA, NA), X5 = c(NA, NA, NA,
NA, NA, NA), X6 = c(NA, NA, NA, NA, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
Related
This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)
I have a data frame df of the following structure:
observation x1 x2 x3 x4
"obs1" NA NA NA 51
"obs1" NA NA NA NA
"obs1" NA 25 NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA 56
"obs3" 26 NA NA NA
"obs3" NA 82 NA NA
"obs3" NA NA "x" NA
I want a data frame df2 that, for each observation and for each column, takes the one value, that is not NA. The resulting data frame should look like this:
observation x1 x2 x3 x4
"obs1" NA 25 NA 51
"obs2" NA NA NA 56
"obs3" 26 82 "x" NA
I tried to do:
only_value = function(x){
x[which(!is.na(x))]
}
df2 = df %>% lapply(only_value) %>% as.data.frame()
However, this only works if there is the same amount of values for each observation. This is not the case in my example.
A data.table option using fcoalesce may help
type.convert(setDT(df)[,data.table(t(fcoalesce(asplit(.SD,1)))),observation],as.is = TRUE)
which gives
observation x1 x2 x3 x4
1: obs1 NA 25 <NA> 51
2: obs2 NA NA <NA> 56
3: obs3 26 82 x NA
Data
> dput(df)
structure(list(observation = c("obs1", "obs1", "obs1", "obs2",
"obs2", "obs2", "obs3", "obs3", "obs3"), x1 = c(NA, NA, NA, NA,
NA, NA, 26L, NA, NA), x2 = c(NA, NA, 25L, NA, NA, NA, NA, 82L,
NA), x3 = c(NA, NA, NA, NA, NA, NA, NA, NA, "x"), x4 = c(51L,
NA, NA, NA, NA, 56L, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-9L))
Similarly, you can use coalesce with dplyr
df %>%
group_by(observation) %>%
summarise(across(x1:x4,~do.call(coalesce,as.list(.x))))
which gives
observation x1 x2 x3 x4
* <chr> <int> <int> <chr> <int>
1 obs1 NA 25 <NA> 51
2 obs2 NA NA <NA> 56
3 obs3 26 82 x NA
Change the only_value function to return only 1st non-NA value.
only_value = function(x){
x[!is.na(x)][1]
}
Now apply this function by group to columns x1 to x4 :
library(dplyr)
df %>%
group_by(observation) %>%
summarise(across(x1:x4, only_value))
# observation x1 x2 x3 x4
#* <chr> <int> <int> <chr> <int>
#1 obs1 NA 25 NA 51
#2 obs2 NA NA NA 56
#3 obs3 26 82 x NA
Let's see example data
nad=structure(list(x1 = 1:5, x2 = c(NA, 2L, 2L, NA, 34L), x3 = c(NA,
1L, NA, NA, NA), x4 = c(NA, 2L, 5L, NA, NA), x5 = c(NA, 3L, NA,
NA, NA), x6 = c(NA, 4L, NA, NA, NA)), .Names = c("x1", "x2",
"x3", "x4", "x5", "x6"), class = "data.frame", row.names = c(NA,
-5L))
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 2 1 2 3 4
3 3 2 NA 5 NA NA
4 4 NA NA NA NA NA
5 5 34 NA NA NA NA
Usually to get complete data without NA, i can use this function
na.omit(nad)
But my problem a little complex.
In spite of the fact that x2 has NA, i do not need delete row where there are NA for x2.
Valuable data is where there is value for x1 and not in x2,
and if there are observations in the row for x1 and x2 but not on the another variables, then the row should not be deleted.
Therefore, the first and 4th rows should not be deleted.
3 and 5 should be deleted, because here, on the one hand there are observations on x1 and x2, but other variables are blank.
Second row is completely complete, i do not need to delete it.
How can I delete NA using such condition?
Desired output
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 2 1 2 3 4
3 4 NA NA NA NA NA
As an addition(separately question, but adjacent), I also want to ask here, maybe I will need this for analytics
if there is such situation
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 NA 1 1 1 1
Here first row has NA for x2, and NA for other variables,
and second row has NA for x2,but another variable is not NA.
How in such case, left only rows where x1 has value, x2 doesn't have, but another variable have values?
So maybe you are looking for
nad[!is.na(nad$x1) & is.na(nad$x2) | rowSums(!is.na(nad)) == ncol(nad), ]
# x1 x2 x3 x4 x5 x6
#1 1 NA NA NA NA NA
#2 2 2 1 2 3 4
#4 4 NA NA NA NA NA
This selects rows where x1 has non-NA values and x2 has NA OR all the values in the row are non-NA.
I think you would probably be best off by checking each row wether it satisfies your conditions. If I understood correctly, something like the following could work:
keep <- apply(nad, 1, function(row) {
# Don't keep data if first column is NA
if (!is.na(row[[1]])) {
sumna <- sum(is.na(row[-1]))
# Only keep if rest is all NA or none is NA
if (sumna == 0 | sumna == length(row) - 1) {
return(TRUE)
} else {
return(FALSE)
}
} else {
return(FALSE)
}
})
nad[keep,]
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 2 1 2 3 4
4 4 NA NA NA NA NA
I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")
I have a data frame with NA values. I want to replace these NAs with a sequence between the values before and after the NAs.
Consider the following example:
# Example data
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
df
# x1 x2 x3
# 5 NA 10
# NA 2 NA
# NA NA 15
# 10 -10 NA
# NA NA 20
The NAs between two values should be replaced with a sequence. NAs at the beginning or the end should remain NA:
# Expected output
# x1 x2 x3
# 5 NA 10
# 6.666667 2 12.5
# 8.333333 -4 15
# 10 -10 17.5
# NA NA 20
How could I replace NAs between two values in an automatized way?
The na.approx function in zoo does this interpolation very easily.
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
df
#> x1 x2 x3
#> 1 5 NA 10
#> 2 NA 2 NA
#> 3 NA NA 15
#> 4 10 -10 NA
#> 5 NA NA 20
zoo::na.approx(df)
#> x1 x2 x3
#> [1,] 5.000000 NA 10.0
#> [2,] 6.666667 2 12.5
#> [3,] 8.333333 -4 15.0
#> [4,] 10.000000 -10 17.5
#> [5,] NA NA 20.0
Created on 2019-02-10 by the reprex package (v0.2.0).
Here is a solution with imputeTS package:
# Example data
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
library("imputeTS")
na.interpolation(df, option = "linear)
For imputeTS::na.interpolation you can choose a different interpolation method via the parameter option (option = "spline" or option = "stine").