Currently I am trying to impute values in a vector in R. The conditions
of the imputation are.
Find all NA values
Then check if they have an existing value before and after them
Also check if the value which follows the NA is larger than
the value before the NA
If the conditions are met, calculate a mean taking the values before
and after.
Replace the NA value with the imputed one
# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)
# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)
# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)
I started out to write code to detect the values which can
be imputed. But I got stuck with the following.
# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]),
rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)
This however only detects the NAs which might be
imputable and it only works with example one. It is incomplete and
unfortunately super hard to read and understand.
Any help with this would be highly appreciated.
We can use dplyrs lag and lead functions for that:
input_three = c(NA,NA,3,4,NA,6,NA,NA)
library(dplyr)
ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
(lag(input_three) + lead(input_three))/ 2,
input_three)
Retrurns:
[1] NA NA 3 4 5 6 NA NA
Edit
Explanation:
We use ifelse which is the vectorized version of if. I.e. everything within ifelse will be applied to each element of the vectors.
First we test if the elements are NA and if the following element is > than the previous. To get the previous and following element we can use dplyr lead and lag functions:
lag offsets a vector to the right (default is 1 step):
lag(1:5)
Returns:
[1] NA 1 2 3 4
lead offsets a vector to the left:
lead(1:5)
Returns:
[1] 2 3 4 5 NA
Now to the 'test' clause of ifelse:
is.na(input_three) & lead(input_three) > lag(input_three)
Which returns:
[1] NA NA FALSE FALSE TRUE FALSE NA NA
Then if the ifelse clause evaluates to TRUE we want to return the sum of the previous and following element divided by 2, othrwise return the original element
Here's an example using the imputeTS library. It takes account of more than one NA in the sequence, ensures that the mean is calculated if the next valid observation is greater than the last valid observation and also ignores NA at the beginning and end.
library(imputeTS)
myimpute <- function(series) {
# Find where each NA is
nalocations <- is.na(series)
# Find the last and the previous observation for each row
last1 <- lag(series)
next1 <- lead(series)
# Carry forward the last and next observations over sequences of NA
# Each row will then get a last and next that can be averaged
cflast <- na_locf(last1, na_remaining = 'keep')
cfnext <- na_locf(next1, option = 'nocb', na_remaining = 'keep')
# Make a data frame
df <- data.frame(series, nalocations, last1, cflast, next1, cfnext)
# Calculate the mean where there is currently a NA
# making sure that the next is greater than the last
df$mean <- ifelse(df$nalocations, ifelse(df$cflast < df$cfnext, (df$cflast+df$cfnext)/2, NA), NA)
imputedseries <- ifelse(df$nalocations, ifelse(!is.na(df$mean), df$mean, NA), series)
#list(df, imputedseries) # comment this in and return it to see the intermediate data frame for debugging
imputedseries
}
myimpute(c(NA,NA,3,4,NA,NA,6,NA,NA,8,NA,7,NA,NA,9,NA,11,NA,NA))
# [1] NA NA 3 4 5 5 6 7 7 8 NA 7 8 8 9 10 11 NA NA
There is also the na_ma function in the imputeTS package for imputing moving averages.
In your case this would be with the following settings:
na_ma(x, k = 1, weighting = "simple")
k = 1 (meaning 1 value before and 1 after the NA are taken into account)
weighting = "simple" (the mean of these two values is calculated)
This can be applied quite easy with basically 1 line of code:
library(imputeTS)
na_ma(yourData, k = 1, weighting = "simple")
You could also choose to take more values before and after the NA into account e.g. k=3. Interesting feature if you take more than 1 value to each side into account is the possibility to choose a different weighting e.g. with weighting = "linear" weights decrease in arithmetical progression (a Linear Weighted Moving Average) - meaning the further they values are away from the NA the less impact they have.
I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")