the average of nearest neighbour values in a column in r - r

I have a dataframe here:
df <- data.frame("Time" = 1:10, "Value" = c(1.7,NA,-999,-999,1.5,1.6,NA,4,-999,8))
"NA" means there is no observation, just leave them there. "-999" means the observation is identified as an outlier.
Now I am trying to replace the "-999" with the average of the nearest values. For example:
The first "-999" should be replaced with (1.7+1.5)/2 = 1.6
The second "-999" should be replaced with (1.7+1.5)/2 = 1.6
The last "-999" should be replaced with (4.0+8.0)/2 = 6
I tried to usenext statement to find the next iteration, and use if statement to decide where to stop. But how can I go up to check the previous iterations? Or is there just another kind of solution to this?
Many thanks.

One approach utilizing dplyr, purrr and tidyr could be:
df %>%
mutate(New_Value = if_else(Value == -999,
map_dbl(.x = seq_along(Value),
~ mean(c(tail(na.omit(na_if(Value[1:(.x - 1)], -999)), 1),
head(na.omit(na_if(Value[(.x + 1):n()], -999)), 1)))),
Value))
Time Value New_Value
1 1 1.7 1.7
2 2 NA NA
3 3 -999.0 1.6
4 4 -999.0 1.6
5 5 1.5 1.5
6 6 1.7 1.7
7 7 NA NA
8 8 4.0 4.0
9 9 -999.0 6.0
10 10 8.0 8.0

Using a few while loops, which bump up how far we lag/lead, we can accomplish this. I am not sure how performant this operation will be on large data sets. But it seems to get the job done for your sample data.
# find where replacements and initialize
where_to_replace <- which(df$Value == -999)
len_replace <- length(where_to_replace)
lag_value <- rep(NA, len_replace)
lead_value <- rep(NA, len_replace)
# more initializing
i <- 1
lag_n <- 1
lead_n <- 1
while(i <= len_replace){
# find appropriate lagged value
# can't use NA or lag value == -999
while(is.na(lag_value[i]) | lag_value[i] == -999){
lag_value[i] <- dplyr::lag(df$Value, lag_n)[where_to_replace[i]]
lag_n <- lag_n + 1
}
# find appropriate lead value
# can't use NA or -999 as lead value
while(is.na(lead_value[i]) | lead_value[i] == -999){
lead_value[i] <- dplyr::lead(df$Value, lead_n)[where_to_replace[i]]
lead_n <- lead_n + 1
}
# reset iterators
i <- i + 1
lag_n <- 1
lead_n <- 1
}
# replacement value
df$Value[where_to_replace] <- (lead_value + lag_value) / 2
# Time Value
# 1 1 1.7
# 2 2 NA
# 3 3 1.6
# 4 4 1.6
# 5 5 1.5
# 6 6 1.6
# 7 7 NA
# 8 8 4.0
# 9 9 6.0
# 10 10 8.0

I created two new helper colums - before and after.
Before fills every NA and -999 with the next value on top and after fills NAs and -999 with the next value underneath. In the next step I over wrote each -999 with the mean of the two values.
df <- data.frame(Time = 1:10,
Value = c(1.7, NA, -999, -999, 1.5,
1.6, NA,
4, -999, 8))
df <- df %>%
mutate(before = recode(Value, `-999` = NA_real_),
after = recode(Value, `-999` = NA_real_)) %>%
fill(before, .direction = "down") %>%
fill(after, .direction = "up") %>%
mutate(Value = case_when(Value == -999 ~ (before + after)/2,
TRUE ~ Value)) %>%
select(Time, Value)
The Output
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0

Here is a base R option using findInterval
x <- which(df$Value == -999)
y <- setdiff(which(!is.na(df$Value)),x)
ind <- findInterval(x,y)
dfout <- within(df,Value <- replace(Value,x,rowMeans(cbind(Value[y[ind]],Value[y[ind+1]]))))
such that
> dfout
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0

Just sticking with base R data.frames we can make a function and use sapply over indices of interest.
outliers <- df$Value == -999 # Keep as logical for now
fillers <- which(!is.na(df$Value) & !outliers)
outliers <- which(outliers) # Now convert to indices; FALSE and NA do not appear
filled_outliers <- sapply(outliers, function(x) {
before_ind = max(fillers[fillers < x]) # maximum INDEX before an outlier
after_ind = min(fillers[fillers > x])
0.5*(df$Value[before_ind] + df$Value[after_ind])
})
df[outliers, ] <- filled_outliers
df
Gives:
Time Value
1 1.0 1.7
2 2.0 NA
3 1.6 1.6
4 1.6 1.6
5 5.0 1.5
6 6.0 1.6
7 7.0 NA
8 8.0 4.0
9 6.0 6.0
10 10.0 8.0

Related

Using previous value in a row to fill in subsequent values in same column in R (no loops)

I have a partially filled in table, there are NAs at the top and bottom of the table (column X in the table below). I want to fill in the table using a rate (0.3) to get the results in the Goal column. This is similar to the fill up/down function in Excel used to copy a formula and fill cells.
df <- data.frame(X = matrix(nrow = 10, ncol = 1, NA))
df [3:5,1] <- 2:4
X Goal
1 NA 1.4
2 NA 1.7
3 2 2
4 3 3
5 4 4
6 NA 4.3
7 NA 4.6
8 NA 4.9
9 NA 5.2
10 NA 5.9
Essentially what I want the code to do is this:
1.4 (X2 answer - 0.3)
1.7 (2 - 0.3)
2
3
4
4.3 (4 + 0.3)
4.6 (X6 answer + 0.3)
4.9 (X7 answer + 0.3)
5.2 (X8 answer + 0.3)
5.5 (X9 answer + 0.3)
I know this can probably be done using loops, but I find them intimidating given my skill level, so I'm looking for a solution that avoids them (if that's even possible).
Avoiding loops with nafill() and fcoalesce() from data.table.
library(data.table)
loc = range(which(!is.na(df$X)))
df$Goal =
fcoalesce(nafill(df$X, "locf"), nafill(df$X, "nocb")) +
c( -((loc[1] - 1):1)*0.3, rep(0, diff(loc)+1), (1:(nrow(df) - loc[2]))*0.3 )
Still, it is (arguably) much easier to keep track of what is happening in each case with a loop:
# Preallocate
df$Goal = 0
for (i in 1:nrow(df)) {
if (i < loc[1]) df$Goal[i] = df$X[loc[1]] - (loc[1] - i) * 0.3
else if (i > loc[2]) df$Goal[i] = df$X[loc[2]] + (i - loc[2]) * 0.3
else df$Goal[i] = df$X[i ]
}
# X Goal
# 1 NA 1.4
# 2 NA 1.7
# 3 2 2.0
# 4 3 3.0
# 5 4 4.0
# 6 NA 4.3
# 7 NA 4.6
# 8 NA 4.9
# 9 NA 5.2
# 10 NA 5.5

updating the dataframe according to the specified conditions in R

library(data.table)
a <- data.table(p = seq(1, 5, 0.5), s = 1:9)
b <- data.table(p = c(2.0, 6, 3.5), s = c(4, 7, 0))
I have 2 dataframes, dataframe "a" is the basis, dataframe "b" contains updates. I need to make changes to the dataframe "a" using the following conditions:
if in the dataframe "b" in the column "s" there is a value of 0, then it is necessary to remove from the dataframe "a" all rows with the corresponding value "p"
if in the dataframe "in" the value in the column "s" differs from the corresponding value "s" in the dataframe "a", then you must replace it with the value from the dataframe "b"
if the dataframe "b" has a unique value "p", then you need to add this line to the dataframe "a"
result:
p s
1: 1.0 1
2: 1.5 2
3: 2.0 4
4: 2.5 4
5: 3.0 5
6: 4.0 7
7: 4.5 8
8: 5.0 9
9: 6.0 7
Using dplyr (question does not specify if data.table syntax is necessary).
Get values for p that need to be deleted.
Filter out values found in 1. and use distinct() to make sure all rows are unique.
library(dplyr)
to_delete <-
filter(b, s == 0) %>%
pull(p)
bind_rows(a, b) %>%
filter(!p %in% to_delete) %>%
distinct(p, .keep_all = TRUE)
#> p s
#> 1: 1.0 1
#> 2: 1.5 2
#> 3: 2.0 3
#> 4: 2.5 4
#> 5: 3.0 5
#> 6: 4.0 7
#> 7: 4.5 8
#> 8: 5.0 9
#> 9: 6.0 7
Perhaps a join on 'p' (assuming they have the same precision of values) to update the 's' by the corresponding 's' from 'b' (i.s), remove the rows where 's' is 0 and rbind the rows of 'b' whose 'p' values are not in a's p
rbind(a[b, s := i.s, on = .(p)][s != 0], b[!p %in% a$p])
-output
# p s
#1: 1.0 1
#2: 1.5 2
#3: 2.0 4
#4: 2.5 4
#5: 3.0 5
#6: 4.0 7
#7: 4.5 8
#8: 5.0 9
#9: 6.0 7
I first used full_join so that we have all the rows of both data frames. In case s.x and s.y are not equal and s.y is not an NA value it replaces s.x even in case of zeros which are later deleted. If s.x is NA and s.y is not which means that the row id for s.y coming from data frame b does not exist in data a it again replaces it.
library(dplyr)
library(tidyr)
library(stringr)
a %>%
full_join(b, by = "p") %>%
mutate(s.x = ifelse(s.x != s.y & !is.na(s.y), s.y, s.x),
s.x = ifelse(is.na(s.x) & !is.na(s.y), s.y, s.x)) %>%
select(-s.y) %>%
filter(s.x != 0) %>%
rename_with(~ str_remove(., ".x"), ends_with(".x"))
p s
1: 1.0 1
2: 1.5 2
3: 2.0 4
4: 2.5 4
5: 3.0 5
6: 4.0 7
7: 4.5 8
8: 5.0 9
9: 6.0 7
using coalesce it will be easier
a %>%
full_join(b, by = "p") %>%
transmute(p, s = coalesce(s.y, s.x)) %>%
filter(s != 0)
p s
1: 1.0 1
2: 1.5 2
3: 2.0 4
4: 2.5 4
5: 3.0 5
6: 4.0 7
7: 4.5 8
8: 5.0 9
9: 6.0 7

Cumulatively sum a portion of a previous value with its next value

I would like to create a column that sums the adjacent value and 80% of the previous value from another column. So, if column x is 1, 2, 3...10, I want column z to be 1, 2.8, 5.24, 8.192, etc.
Yet, here is my failed attempt:
x <- c(1:10)
y <- c("")
df <- data.frame(x,y)
df1 <- df %>%
mutate(y = cumsum(x*0.8))
Result:
x y
1 1 0.8
2 2 2.4
3 3 4.8
4 4 8.0
5 5 12.0
6 6 16.8
7 7 22.4
8 8 28.8
9 9 36.0
10 10 44.0
I would use a for loop to do this. It's important to initialize a vector first, especially if you're working with a large data set.
# initialize
newx <- vector("numeric", length(df$x))
newx[1] <- df$x[1]
for(i in 2:length(df$x)){
newx[i] <- df$x[i] + (0.8 * newx[i-1])
}
newx
# [1] 1.00000 2.80000 5.24000 8.19200 11.55360 15.24288 19.19430 23.35544 27.68435 32.14748
With the addition of purrr, you can do:
df %>%
mutate(y = accumulate(x, ~ .x * 0.8 + .y))
x y
1 1 1.00000
2 2 2.80000
3 3 5.24000
4 4 8.19200
5 5 11.55360
6 6 15.24288
7 7 19.19430
8 8 23.35544
9 9 27.68435
10 10 32.14748
Try using the Reduce function:
Reduce(function(last, current) current + last * .8, x = x, accumulate = T)
# [1] 1.00000 2.80000 5.24000 8.19200 11.55360 15.24288 19.19430 23.35544 27.68435 32.14748

Conditional slicing|filtering top and bottom n rows from grouped data

I have come to an issue that filtering or slicing the top and bottom n number of rows at the same time from the grouped data.
So it is different than this Select first and last row from grouped data
What I need to do that if sub_gr==a then filter|slice top three rows
if sub_gr==b then filter|slice bottom two rows that's it!
my data something like this
df <- data.frame(gr=rep(seq(1,2),each=10),sub_gr=rep(rep(c("a","b"),each=5),2),
y = rep(c(sort(runif(5,0,0.5),decreasing=TRUE), sort(runif(5,0,0.5),,decreasing=TRUE)),2),
x = rep(c(seq(0.1,0.5,0.1),rev(seq(-0.5,-0.1,0.1))),2))
gr sub_gr y x
1 1 a 0.37851909 0.1
2 1 a 0.33305165 0.2
3 1 a 0.22478005 0.3
4 1 a 0.09677654 0.4
5 1 a 0.07060651 0.5
6 1 b 0.41999445 -0.1
7 1 b 0.35356301 -0.2
8 1 b 0.33274398 -0.3
9 1 b 0.20451400 -0.4
10 1 b 0.03714828 -0.5
11 2 a 0.37851909 0.1
12 2 a 0.33305165 0.2
13 2 a 0.22478005 0.3
14 2 a 0.09677654 0.4
15 2 a 0.07060651 0.5
16 2 b 0.41999445 -0.1
17 2 b 0.35356301 -0.2
18 2 b 0.33274398 -0.3
19 2 b 0.20451400 -0.4
20 2 b 0.03714828 -0.5
library(dplyr)
Here is what I tried,
df%>%
group_by(gr, sub_gr)%>%
slice(if(any(sub_gr=="a")) {row_number()==1:3} else {row_number()==4:n()})
Warning messages:
1: In 1:5 == 1:3 :
longer object length is not a multiple of shorter object length
2: In 1:5 == 4:5L :
longer object length is not a multiple of shorter object length
3: In 1:5 == 1:3 :
longer object length is not a multiple of shorter object length
4: In 1:5 == 4:5L :
longer object length is not a multiple of shorter object length
thanks for your help in advance!
There are probably more elegant solutions, but I think the following works. I set seed for reproducibility.
set.seed(123)
df <- data.frame(gr=rep(seq(1,2),each=10),sub_gr=rep(rep(c("a","b"),each=5),2),
y = rep(c(sort(runif(5,0,0.5),decreasing=TRUE), sort(runif(5,0,0.5),,decreasing=TRUE)),2),
x = rep(c(seq(0.1,0.5,0.1),rev(seq(-0.5,-0.1,0.1))),2))
df %>%
group_by(gr, sub_gr) %>%
filter((sub_gr %in% "a" & row_number() %in% 1:3) |
(sub_gr %in% "b" & row_number() %in% (n() - 1):n())) %>%
ungroup()
# # A tibble: 10 x 4
# gr sub_gr y x
# <int> <fctr> <dbl> <dbl>
# 1 1 a 0.47023364 0.1
# 2 1 a 0.44150870 0.2
# 3 1 a 0.39415257 0.3
# 4 1 b 0.22830737 -0.4
# 5 1 b 0.02277825 -0.5
# 6 2 a 0.47023364 0.1
# 7 2 a 0.44150870 0.2
# 8 2 a 0.39415257 0.3
# 9 2 b 0.22830737 -0.4
# 10 2 b 0.02277825 -0.5
library(tidyverse)
# create a custom function to take the head or tail based on your rule
cond_slice <- function(x) {
if (unique(x$sub_gr) == "a") {
head(x, 3)
} else {
tail(x, 2)
}
}
# create a column to split by and then map across the subsets
result <- x %>%
unite(split_by, gr, sub_gr, remove = F) %>%
split(.$split_by) %>%
map(cond_slice) %>%
bind_rows() %>%
select(-split_by)

R - Taking column maxes over specific subsets of a dataframe

I have a large set of data with various indices and such. I would like to change my data from something like this:
id time var1_t1 var1_t2 var1_t3 var2_t1 var2_t2 var2_t3
1 1 1.5 NA NA 3.3 NA NA
1 2 NA 2.5 NA NA 1.2 NA
1 3 NA NA 3.5 NA NA .7
to something like this:
id time var1_t1 var1_t2 var1_t3 var2_t1 var2_t2 var2_t3
1 1 1.5 2.5 3.5 3.3 1.2 .7
1 2 NA 2.5 NA NA 1.2 NA
1 3 NA NA 3.5 NA NA .7
or this:
id time var1_t1 var1_t2 var1_t3 var2_t1 var2_t2 var2_t3
1 1 1.5 2.5 3.5 3.3 1.2 .7
1 2 1.5 2.5 3.5 3.3 1.2 .7
1 3 1.5 2.5 3.5 3.3 1.2 .7
Except that there are rather more than six columns I need to do this for, and "id' has values other than 1.
I can do this for a single column as follows:
for (i in 1:max(df$id) ){
df[df$id == i & df$time == 1,]$var1_t1 <- max(df[df$id == i,]$var1_t1,
na.rm = TRUE)
}
But that uses a for loop, so it is a terrible idea. And I would have to repeat that line for each column. Is there a way I can do this more elegantly?
If you want to replace all NA's with the column-wise max value by group of id, you could define a little custom function:
f <- function(x) {
x[is.na(x)] <- max(x, na.rm = TRUE)
x
}
And then use your favorite data manipulation functions/package, for example dplyr:
library(dplyr)
df %>% group_by(id) %>% mutate_each(funs(f))
Or data.table:
library(data.table)
setDT(df)[, lapply(.SD, f), by = id]

Resources