Replace NAs between Values with Sequence - r

I have a data frame with NA values. I want to replace these NAs with a sequence between the values before and after the NAs.
Consider the following example:
# Example data
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
df
# x1 x2 x3
# 5 NA 10
# NA 2 NA
# NA NA 15
# 10 -10 NA
# NA NA 20
The NAs between two values should be replaced with a sequence. NAs at the beginning or the end should remain NA:
# Expected output
# x1 x2 x3
# 5 NA 10
# 6.666667 2 12.5
# 8.333333 -4 15
# 10 -10 17.5
# NA NA 20
How could I replace NAs between two values in an automatized way?

The na.approx function in zoo does this interpolation very easily.
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
df
#> x1 x2 x3
#> 1 5 NA 10
#> 2 NA 2 NA
#> 3 NA NA 15
#> 4 10 -10 NA
#> 5 NA NA 20
zoo::na.approx(df)
#> x1 x2 x3
#> [1,] 5.000000 NA 10.0
#> [2,] 6.666667 2 12.5
#> [3,] 8.333333 -4 15.0
#> [4,] 10.000000 -10 17.5
#> [5,] NA NA 20.0
Created on 2019-02-10 by the reprex package (v0.2.0).

Here is a solution with imputeTS package:
# Example data
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
library("imputeTS")
na.interpolation(df, option = "linear)
For imputeTS::na.interpolation you can choose a different interpolation method via the parameter option (option = "spline" or option = "stine").

Related

How to change values across 1 row based on values in a column in R?

I have a lot of columns in 1 dataframe that identify different timepoints of the same variable. Basically, within my data, if there's no response at timepoint X-1, there will be no response at time point X or beyond (after an NA appears in a row, it will continue). I currently have a column that shows which row the last response came from and what that response is. The dataframe currently looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 4 NA NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 5 NA NA 5 X2
6 6 5 7 7 7 7 X4
My goal is to be able to conduct a regression using the last response of each row as the outcome variable. However, I don't want it to repeat twice in the "X_final" column and also in the column that the response actually comes from. Therefore, I am hoping to find a way to put a "." in for the cell where that value originally came from so it looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 . <NA> NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 . NA NA 5 X2
6 6 5 7 7 7 7 X4
Any suggestions would be appreciated - thank you!
Another method, since you already have the locations in $X_final_location. As mentioned in the question comments, NA values would be preferred if the goal would be regression analysis to preserve numeric values.
data_orig <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
X1 = c(5, 4, 7, 8, 1, 5),
X2 = c(5, NA, 1, 2, 5, 7),
X3 = c(6, NA, 3, 4, NA, 7),
X4 = c(NA, NA, 5, 2, NA, 7),
X_final = c(6, 4, 5, 2, 5, 7),
X_final_location = c("X3", "X1", "X4", "X4", "X2", "X4")
)
data_new <- data_orig
for (i in seq_len(nrow(data_new))) {
data_new[i, data_new$X_final_location[i]] <- NA
}
data_new
# id X1 X2 X3 X4 X_final X_final_location
# 1 1 5 5 NA NA 6 X3
# 2 2 NA NA NA NA 4 X1
# 3 3 7 1 3 NA 5 X4
# 4 4 8 2 4 NA 2 X4
# 5 5 1 NA NA NA 5 X2
# 6 6 5 7 7 NA 7 X4
One way to do this (NA instead of . to preserve data type):
match finds the first NA position, replace replaces the value in that position - 1 (previous) with NA.
apply(data, 1, \(x) ...) applies that function for each row. Finally t transposes the result (since apply by default coerces the result to columns.
data = data.frame(id = 1:6, X1 = c(5L, 4L, 7L, 8L, 1L, 5L), X2 = c(5L,
NA, 1L, 2L, 5L, 7L), X3 = c(6L, NA, 3L, 4L, NA, 7L), X4 = c(NA,
NA, 5L, 2L, NA, 7L), X_final = c(6L, 4L, 5L, 2L, 5L, 7L), X_final_location = c("X3",
"X1", "X4", "X4", "X2", "X4"))
data[,2:5] <- t(apply(data[,2:5], 1 , function(x) replace(x, match(NA, x) - 1, NA)))
data
#> id X1 X2 X3 X4 X_final X_final_location
#> 1 1 5 5 NA NA 6 X3
#> 2 2 NA NA NA NA 4 X1
#> 3 3 7 1 3 5 5 X4
#> 4 4 8 2 4 2 2 X4
#> 5 5 1 NA NA NA 5 X2
#> 6 6 5 7 7 7 7 X4
Another way using split (grouping by row):
split(data, row.names(data)) <-
lapply(split(data, row.names(data)), \(x) replace(x, x$X_final_location, "."))

Forming a new column from whichever of two columns isn’t NA [duplicate]

This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)

In R data frame for each set of rows and column use value that is not na

I have a data frame df of the following structure:
observation x1 x2 x3 x4
"obs1" NA NA NA 51
"obs1" NA NA NA NA
"obs1" NA 25 NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA 56
"obs3" 26 NA NA NA
"obs3" NA 82 NA NA
"obs3" NA NA "x" NA
I want a data frame df2 that, for each observation and for each column, takes the one value, that is not NA. The resulting data frame should look like this:
observation x1 x2 x3 x4
"obs1" NA 25 NA 51
"obs2" NA NA NA 56
"obs3" 26 82 "x" NA
I tried to do:
only_value = function(x){
x[which(!is.na(x))]
}
df2 = df %>% lapply(only_value) %>% as.data.frame()
However, this only works if there is the same amount of values for each observation. This is not the case in my example.
A data.table option using fcoalesce may help
type.convert(setDT(df)[,data.table(t(fcoalesce(asplit(.SD,1)))),observation],as.is = TRUE)
which gives
observation x1 x2 x3 x4
1: obs1 NA 25 <NA> 51
2: obs2 NA NA <NA> 56
3: obs3 26 82 x NA
Data
> dput(df)
structure(list(observation = c("obs1", "obs1", "obs1", "obs2",
"obs2", "obs2", "obs3", "obs3", "obs3"), x1 = c(NA, NA, NA, NA,
NA, NA, 26L, NA, NA), x2 = c(NA, NA, 25L, NA, NA, NA, NA, 82L,
NA), x3 = c(NA, NA, NA, NA, NA, NA, NA, NA, "x"), x4 = c(51L,
NA, NA, NA, NA, 56L, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-9L))
Similarly, you can use coalesce with dplyr
df %>%
group_by(observation) %>%
summarise(across(x1:x4,~do.call(coalesce,as.list(.x))))
which gives
observation x1 x2 x3 x4
* <chr> <int> <int> <chr> <int>
1 obs1 NA 25 <NA> 51
2 obs2 NA NA <NA> 56
3 obs3 26 82 x NA
Change the only_value function to return only 1st non-NA value.
only_value = function(x){
x[!is.na(x)][1]
}
Now apply this function by group to columns x1 to x4 :
library(dplyr)
df %>%
group_by(observation) %>%
summarise(across(x1:x4, only_value))
# observation x1 x2 x3 x4
#* <chr> <int> <int> <chr> <int>
#1 obs1 NA 25 NA 51
#2 obs2 NA NA NA 56
#3 obs3 26 82 x NA

Changing NA value if matching a defined length

I have this kind of data :
daynight
[1] NA NA NA NA 2 1 NA NA
I want R to detect if there is a series of at least x NA and replace these by another value.
For example if x=3 and the replacement value is 3 I want R to give me in output :
daynight
[1] 3 3 3 3 2 1 NA NA
Would you have any ideas?
We can use rle
daynight <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
x <- 3
r <- 3
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
daynight
#[1] 3 3 3 3 2 1 NA NA
Taking another example :
daynight <- c(NA, NA, NA, 3,2,1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
#[1] 3 3 3 3 2 1 NA NA 1 3 3 3 1 NA NA
And here is another solution using the zoo package
library(zoo)
replace_consecutive_NAs <- function(x, nrNAs = 3, replaceBy = nrNAs){
x <- as.numeric(is.na(x))
indexes <- (rollapply(x, 3, prod, fill = 0, align = "left") +
rollapply(x, 3, prod, fill = 0, align = "right")) != 0
x[indexes] <- replaceBy
x
}
x <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
replace_consecutive_NAs(x, 3, 999)
[1] 999 999 999 999 2 1 NA NA

Calculate rolling average in matrix

I want to calculate a rolling average. Specifically, I want to fill each row of columns 5 and 6 of Mat1, with a rolling average of the prior 3 columns. For column 5 this implies an average over 2,3,4 and for column 6, the average over columns 3,4,5. I only want to calculate the average when there are no NAs in the columns over which the average is calculated.
mat1 <- data.frame(matrix(nrow =6, ncol =6))
mat1[1:4,1:4] = rnorm(16,0,1)
mat1[5:6,1:3] = rnorm(6,0,1)
mat1
X1 X2 X3 X4 X5 X6
1 0.40023542 2.05111693 0.695422777 0.9938004 NA NA
2 0.22673283 -0.86433614 0.002620227 0.8464388 NA NA
3 0.88522293 -0.72385091 0.751663489 1.3240476 NA NA
4 0.65373734 1.68385938 0.759718967 -0.4577604 NA NA
5 -0.09442161 0.72186678 0.180312264 NA NA NA
6 0.39930843 0.04311092 2.141065229 NA NA NA
for entry 1,5 = mean(2.051,0.69,0.99) and for entry 1,6 = mean(0.69, 0.99, mean(2.051,0.69,0.99)).
We can use for loop to calculate rolling mean of last three columns
cols <- 5:6
for(i in cols) {
mat1[i] <- rowMeans(mat1[(i-3):(i-1)])
}
mat1
# X1 X2 X3 X4 X5 X6
#1 0.40023542 2.05111693 0.695422777 0.9938004 1.246780036 0.9786677
#2 0.22673283 -0.86433614 0.002620227 0.8464388 -0.005092371 0.2813222
#3 0.88522293 -0.72385091 0.751663489 1.3240476 0.450620060 0.8421104
#4 0.65373734 1.68385938 0.759718967 -0.4577604 0.661939316 0.3212993
#5 -0.09442161 0.72186678 0.180312264 NA NA NA
#6 0.39930843 0.04311092 2.141065229 NA NA NA
This returns NA if any NA value is present in the calculation as mentioned in the comments. If we need to ignore NA values, we can set na.rm = TRUE in rowMeans.
data
mat1 <- structure(list(X1 = c(0.40023542, 0.22673283, 0.88522293, 0.65373734,
-0.09442161, 0.39930843), X2 = c(2.05111693, -0.86433614, -0.72385091,
1.68385938, 0.72186678, 0.04311092), X3 = c(0.695422777, 0.002620227,
0.751663489, 0.759718967, 0.180312264, 2.141065229), X4 = c(0.9938004,
0.8464388, 1.3240476, -0.4577604, NA, NA), X5 = c(NA, NA, NA,
NA, NA, NA), X6 = c(NA, NA, NA, NA, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))

Resources