replacing NA values with specific averege - r

i have a data.frame with columns and rows. how could i replace NA values so that it would be the average of the first value before and after that cell in that column?
for example:
1. 1 2 3
2. 4 NA 7
3. 9 NA 8
4. 1 5 6
I need the first NA to be - (5+2)/2=3.5
and the second to be (3.5+5)/2=4.25

Lets create some sample data and transform it to data.table:
require(data.table)
require(zoo)
dat <- data.frame(a = c(1, 2, NA, 4))
setDT(dat)
Now, using the zoo::na.approx function we can impute the missing values.
dat[, newA:= na.approx(a, rule = 2)]
Output:
a newA
1: 1 1
2: 2 2
3: NA 3
4: 4 4

Related

Skip NAs when using Reduce() in data.table

I'm trying to get the cumulative sum of data.table rows and was able to find this code in another stackoverflow post:
devDF1[,names(devDF1):=Reduce(`+`,devDF1,accumulate=TRUE)]
It does what I need it to do, however when it comes across a row that starts off with an NA, it will just replace every element in that row with NA (instead of the cumsum of the other elements in the row). I don't want to replace the NAs with 0s, because I'll be needing this output for further processes and don't want the same final cumsum duplicated in the rows. Is there any way I can adjust that piece of code to ignore the NAs? Or is there an alternate code that could be used to get the cumulative sum of the rows in a data.table while ignoring NAs?
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt
# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4
If you want to carry previous value to NA values you can use :
dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))),
.SDcols = names(dt)]
dt
# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8
If you want to keep NA as NA :
dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]
dt
# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

Iterate over a column ignoring but retaining NA values in R

I have a time series data frame in R that has a column, V1, which consists of integers with a few NAs interspersed throughout. I want to iterate over this column and subtract V1 from itself one time step previously. However, I want to ignore the NA values in V1 and use the last non-NA value in the subtraction. If the current value of V1 is NA, then the difference should return NA. See below for an example
V1 <- c(1, 3, 4, NA, NA, 6, 9, NA, 10)
time <- 1:length(V1)
dat <- data.frame(time = time,
V1 = V1)
lag_diff <- c(NA, 2, 1, NA, NA, 2, 3, NA, 1) # The result I want
diff(dat$V1) # Not the result I want
I'd prefer not to do this with loops because I have hundreds of data frames, each with >10,000 rows.
My first thought to solve this was to filter out the NA rows, perform the iterative difference calculation and then reinsert the rows that were filtered out but I can't think of a way to do that. It doesn't seem very "tidy" to do it that way either and I'm not sure it would be faster than looping. Any help is appreciated, bonus points if the solution uses tidyverse functions.
dat[!is.na(dat$V1), 'lag_diff'] <- c(NA, diff(dat[!is.na(dat$V1), 'V1']))
# time V1 lag_diff
# 1 1 1 NA
# 2 2 3 2
# 3 3 4 1
# 4 4 NA NA
# 5 5 NA NA
# 6 6 6 2
# 7 7 9 3
# 8 8 NA NA
# 9 9 10 1
Or with data.table (same result)
library(data.table)
setDT(dat)
dat[!is.na(V1), lag_diff := V1 - shift(V1)]
# time V1 lag_diff
# 1: 1 1 NA
# 2: 2 3 2
# 3: 3 4 1
# 4: 4 NA NA
# 5: 5 NA NA
# 6: 6 6 2
# 7: 7 9 3
# 8: 8 NA NA
# 9: 9 10 1
A tidyverse version, just in case. It does need a filter though
dat %>%
filter(!is.na(V1)) %>%
mutate(diff=V1- lag(V1)) %>%
right_join(dat,by=c("time","V1"))

R previous index per group

I am trying to set the previous observation per group to NA, if a certain condition applies.
Assume I have the following datatable:
DT = data.table(group=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6,6,3,1,1,3,6), a=1:9, b=9:1)
and I am using the simple condition:
DT[y == 6]
How can I set the previous rows of DT[y == 6] within DT to NA, namely the rows with the numbers 2 and 8 of DT? That is, how to set the respectively previous rows per group to NA.
Please note: From DT we can see that there are 3 rows when y is equal to 6, but for group a (row nr 4) I do not want to set the previous row to NA, as the previous row belongs to a different group.
So what I want in different terms is the previous index of certain elements in datatable. Is that possible? Would be also interesting if one can go further back than 1 period. Thanks for any hints.
You can find the row indices where current y is not 6 and next row is 6, then set the whole row to NA:
DT[shift(y, type="lead")==6 & y!=6,
(names(DT)) := lapply(.SD, function(x) NA)]
DT
output:
group v y a b
1: b 1 1 1 9
2: <NA> NA NA NA NA
3: b 1 6 3 7
4: a 2 6 4 6
5: a 2 3 5 5
6: a 1 1 6 4
7: c 1 1 7 3
8: <NA> NA NA NA NA
9: c 2 6 9 1
As usual, Frank commenting with a more succinct version:
DT[shift(y, type="lead")==6 & y!=6, names(DT) := NA]

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Drop data frame rows if NA for certain variables referred to by name in dplyr

I would like to drop entire rows from a data frame if they have all NAs but for only certain subset of columns (which are named in a sequence as well as start with "X").
This is different than other SO answers that I found from what I can tell since I cannot refer to each column manually by name (too many variables) and do not only want to drop the rows if they are completely NA (rather if some variables are completely NA).
So turn sample data:
data1 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(1, NA, NA), c(4, 8, NA)))
colnames(data1) <- c("Z","X1","X2")
data1
Z X1 X2
1 1 2 3
2 1 NA 4
3 4 6 7
4 1 NA NA
5 4 8 NA
into:
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
I.e. drop the row if both X1 and X2 (all of the X sequence) are NA.
In this example there are only two variables(X1:X2)for ease but in reality I have closer to 100 of this sequence and many other important variables that may or may not be NA. I would prefer to do so in dplyr with filter but other solutions would be appreciated as well.
I feel like:
data2 %>% filter(!is.na(all(X1:X2)))
or something similar is close but R does not like the sequence reference to X1:X2 within filter.
You can use rowSums + select + starts_with + filter:
data1 %>%
filter(rowSums(!is.na(select(., starts_with("X")))) != 0)
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#4 4 8 NA
A base R solution using apply would be:
drop <- which(apply(data1[,startsWith(colnames(data1), "X")], 1, function(x) all(is.na(x))))
data1[-drop,]
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#5 4 8 NA
Another option using rowSums:
drop <- which(rowSums(is.na(data1[,c("X1","X2")]))>=2)
data1[-drop]

Resources