Iterate over a column ignoring but retaining NA values in R - r

I have a time series data frame in R that has a column, V1, which consists of integers with a few NAs interspersed throughout. I want to iterate over this column and subtract V1 from itself one time step previously. However, I want to ignore the NA values in V1 and use the last non-NA value in the subtraction. If the current value of V1 is NA, then the difference should return NA. See below for an example
V1 <- c(1, 3, 4, NA, NA, 6, 9, NA, 10)
time <- 1:length(V1)
dat <- data.frame(time = time,
V1 = V1)
lag_diff <- c(NA, 2, 1, NA, NA, 2, 3, NA, 1) # The result I want
diff(dat$V1) # Not the result I want
I'd prefer not to do this with loops because I have hundreds of data frames, each with >10,000 rows.
My first thought to solve this was to filter out the NA rows, perform the iterative difference calculation and then reinsert the rows that were filtered out but I can't think of a way to do that. It doesn't seem very "tidy" to do it that way either and I'm not sure it would be faster than looping. Any help is appreciated, bonus points if the solution uses tidyverse functions.

dat[!is.na(dat$V1), 'lag_diff'] <- c(NA, diff(dat[!is.na(dat$V1), 'V1']))
# time V1 lag_diff
# 1 1 1 NA
# 2 2 3 2
# 3 3 4 1
# 4 4 NA NA
# 5 5 NA NA
# 6 6 6 2
# 7 7 9 3
# 8 8 NA NA
# 9 9 10 1
Or with data.table (same result)
library(data.table)
setDT(dat)
dat[!is.na(V1), lag_diff := V1 - shift(V1)]
# time V1 lag_diff
# 1: 1 1 NA
# 2: 2 3 2
# 3: 3 4 1
# 4: 4 NA NA
# 5: 5 NA NA
# 6: 6 6 2
# 7: 7 9 3
# 8: 8 NA NA
# 9: 9 10 1

A tidyverse version, just in case. It does need a filter though
dat %>%
filter(!is.na(V1)) %>%
mutate(diff=V1- lag(V1)) %>%
right_join(dat,by=c("time","V1"))

Related

R Completing NAs with average of previous values

I have looked at several similar questions on SO but can't seem to find a solution that works for me (though zoo and tidyr have gotten me the closest). I have a df with a column containing a series of NA values and need to fill those values with the average of the previous 2 lags. That new value needs to be included as one of the lags in the next record and so on. So something like this:
1
2
3
4
5
NA
NA
NA
needs to become
1
2
3
4
5
4.5
4.75
4.625
Thanks in advance for any suggestions, here is some sample data to play with.
df <- tibble::tribble(
~x,
1,
2,
3,
4,
5,
NA,
NA,
NA
)
I'd use a for loop:
for (i in 1:nrow(df)){
if(is.na(df$x[i])){
df$x[i] <- mean(c(df$x[i-1], df$x[i-2]))
}
}
# x
# <dbl>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 4.5
# 7 4.75
# 8 4.62

replacing NA values with specific averege

i have a data.frame with columns and rows. how could i replace NA values so that it would be the average of the first value before and after that cell in that column?
for example:
1. 1 2 3
2. 4 NA 7
3. 9 NA 8
4. 1 5 6
I need the first NA to be - (5+2)/2=3.5
and the second to be (3.5+5)/2=4.25
Lets create some sample data and transform it to data.table:
require(data.table)
require(zoo)
dat <- data.frame(a = c(1, 2, NA, 4))
setDT(dat)
Now, using the zoo::na.approx function we can impute the missing values.
dat[, newA:= na.approx(a, rule = 2)]
Output:
a newA
1: 1 1
2: 2 2
3: NA 3
4: 4 4

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Drop data frame rows if NA for certain variables referred to by name in dplyr

I would like to drop entire rows from a data frame if they have all NAs but for only certain subset of columns (which are named in a sequence as well as start with "X").
This is different than other SO answers that I found from what I can tell since I cannot refer to each column manually by name (too many variables) and do not only want to drop the rows if they are completely NA (rather if some variables are completely NA).
So turn sample data:
data1 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(1, NA, NA), c(4, 8, NA)))
colnames(data1) <- c("Z","X1","X2")
data1
Z X1 X2
1 1 2 3
2 1 NA 4
3 4 6 7
4 1 NA NA
5 4 8 NA
into:
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
I.e. drop the row if both X1 and X2 (all of the X sequence) are NA.
In this example there are only two variables(X1:X2)for ease but in reality I have closer to 100 of this sequence and many other important variables that may or may not be NA. I would prefer to do so in dplyr with filter but other solutions would be appreciated as well.
I feel like:
data2 %>% filter(!is.na(all(X1:X2)))
or something similar is close but R does not like the sequence reference to X1:X2 within filter.
You can use rowSums + select + starts_with + filter:
data1 %>%
filter(rowSums(!is.na(select(., starts_with("X")))) != 0)
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#4 4 8 NA
A base R solution using apply would be:
drop <- which(apply(data1[,startsWith(colnames(data1), "X")], 1, function(x) all(is.na(x))))
data1[-drop,]
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#5 4 8 NA
Another option using rowSums:
drop <- which(rowSums(is.na(data1[,c("X1","X2")]))>=2)
data1[-drop]

Removing rows with NA in R [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.
I've searched the SO archives, and come up with this as the most likely solution:
df2 <- df[df[, 12] != NA,]
But when I run it and look at df2, all I see is a screen full of NAs (and s).
Any suggestions?
Depending on what you're looking for, one of the following should help you on your way:
Some sample data to start with:
mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4),
C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4),
E = c(NA, 2, 3, 4))
mydf
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by #SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.
mydf[complete.cases(mydf$A), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
mydf[!is.na(mydf[, 1]), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:
mydf[rowSums(is.na(mydf)) < 2, ]
# A B C D E
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:
mydf[complete.cases(mydf), ]
# A B C D E
# 4 4 4 4 4 4

Resources