Fill NAs by smoothing the peak values - r

df <- data.frame(date = seq(from=as.POSIXct(as.Date("2020-10-01")),
to= as.POSIXct(as.Date("2020-10-02")) , by = 'hour'),
val = c(15,20,18,22,17,NA,NA,NA,80,14,23,16,19,21,NA,NA,60,18,15,20,22,19,NA,35,18))
There are uneven sequences of 'NA's followed by peak values e.g.: val = 80, 60 and 35 .
I would like to fill the 'NA's by smoothing the peak values. For example: in the first NA sequence, three NAs are followed by 80, which equals four data points therefore, 80 is divided by 4 = 20 .
Note: the peak values are NOT outliers, so the total sum of the data points should not change.
If possible, I would like to fill the NAs with the above conditions while reserving the signal behavior (trend and seasonality).
Many thanks.

The following function fills sequences of NA values with the next non-NA value divided by the sequence length.
fill_na <- function(x){
na <- is.na(x)
r <- rle(na)
div <- r$lengths[r$values] + 1L
cs <- cumsum(r$lengths)[r$values]
for(i in seq_along(div)){
if(cs[i] < length(x)){
x[ (cs[i] - div[i] + 1L):(cs[i] + 1L) ] <- x[ cs[i] + 1L ]/div[i]
}
}
x
}
fill_na(df$val)
# [1] 15.0 20.0 18.0 22.0 20.0 20.0 20.0 20.0 20.0 14.0 23.0
#[12] 16.0 19.0 20.0 20.0 20.0 20.0 18.0 15.0 20.0 22.0 17.5
#[23] 17.5 17.5 18.0

Related

Replacing multiple observations from one column with values from another column in R

I am trying to replace values from two columns with values from another two columns. This is a rather basic question and has been asked by python users, however I am using R.
I have a df that looks like this (only on a much larger scale [>20,000]):
squirrel_id locx locy dist
6391 17.5 10.0 50.0
6391 17.5 10.0 20.0
6391 17.5 10.0 15.5
8443 20.5 1.0 800
6025 -5.0 -0.5 0.0
I need to, for 63 squirrels, replace their locx and locy values.
I normally replace values with the following code:
library(dplyr)
df <- df %>%
mutate(locx = ifelse (squirrel_id=="6391", "12.5", locx),
locy = ifelse (squirrel_id=="6391", "15.5", locy),
locx = ifelse (squirrel_id=="8443", "2.5", locx),
locy = ifelse (squirrel_id=="8443", "80", locy)) #etc for 63 squirrels
Which would give me:
squirrel_id locx locy dist
6391 12.5 10.0 50.0
6391 12.5 10.0 20.0
6391 12.5 10.0 15.5
8443 2.5 80.0 800
6025 -5.0 -0.5 0.0
But this is creating an extra 126 lines of code and I suspect there is a simpler way to do this.
I do have all the new locx and locy values in a separate df, but I do not know how to join the two dataframes by squirrel_id without it messing up the data.
df with the values that need to replace the ones in the old df:
squirrel_id new_locx new_locy
6391 12.5 15.5
8443 2.5 80
6025 -55.0 0.0
How can I do this more efficiently?
You can left_join the two data frames and then use an if_else statement to get the right locx and locy. Try out:
library(dplyr)
df %>% left_join(df2, by = "squirrel_id") %>%
mutate(locx = if_else(is.na(new_locx), locx, new_locx), # as suggested by #echasnovski, we can also use locx = coalesce(new_locx, locx)
locy = if_else(is.na(new_locy), locy, new_locy)) %>% # or locy = coalesce(new_locy, locy)
select(-new_locx, -new_locy)
# output
squirrel_id locx locy dist
1 6391 12.5 15.5 50.0
2 6391 12.5 15.5 20.0
3 6391 12.5 15.5 15.5
4 8443 2.5 80.0 800.0
5 6025 -55.0 0.0 0.0
6 5000 18.5 18.5 10.0 # squirrel_id 5000 was created for an example of id
# present if df but not in df2
Data
df <- structure(list(squirrel_id = c(6391L, 6391L, 6391L, 8443L, 6025L,
5000L), locx = c(17.5, 17.5, 17.5, 20.5, -5, 18.5), locy = c(10,
10, 10, 1, -0.5, 12.5), dist = c(50, 20, 15.5, 800, 0, 10)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(squirrel_id = c(6391L, 8443L, 6025L), new_locx = c(12.5,
2.5, -55), new_locy = c(15.5, 80, 0)), class = "data.frame", row.names = c(NA,
-3L))
Using #ANG's data, here's a data.table solution. It joins and updates the original df by reference.
library(data.table)
setDT(df)
setDT(df2)
df[df2, on = c('squirrel_id'), `:=` (locx = new_locx, locy = new_locy) ]
df
squirrel_id locx locy dist
1: 6391 12.5 15.5 50.0
2: 6391 12.5 15.5 20.0
3: 6391 12.5 15.5 15.5
4: 8443 2.5 80.0 800.0
5: 6025 -55.0 0.0 0.0
6: 5000 18.5 12.5 10.0
See also:
how to use merge() to update a table in R
Replace a subset of a data frame with dplyr join operations
R: Updating a data frame with another data frame

Replacement of a negative value in df for the same value divided by two

I try to write a function that would replace a value in a vector by the same value divided by two.
# replacement function for a vector
rep <- function(x)
x.half <- {abs(replace(x, which(x<0),x/2))}
But, I know that this function does not work properly, because if I simulate a vector with negative and positive numbers I get wrong result:
a <- c(1,-1,2,-2,3,-3,4,-4,5,-5,11,-11,12,-12,13,-13,21,-21,25,-25)
a.rep <- rep(a)
# data frame to test
test <- cbind(a,a.rep)
Also, when I apply this function to vector a, I get a warning message like this:
Warning message:
In replace(x, which(x < 0), x/2) :
number of items to replace is not a multiple of replacement length
Obviously, there is something wrong with my function.
The third argument in replace needs to be the same length as the values being subsetted in the second.
## determine which values are below zero
a0 <- a < 0
## replace them with their halved values
replace(a, a0, a[a0] / 2)
[1] 1.0 -0.5 2.0 -1.0 3.0 -1.5 4.0 -2.0 5.0 -2.5 11.0
[12] -5.5 12.0 -6.0 13.0 -6.5 21.0 -10.5 25.0 -12.5
Although the question has been answered already, I felt challenged to add an arithmetical solution.
The expression
(sign(a) + 3) / 4 * a
will return
# [1] 1.0 -0.5 2.0 -1.0 3.0 -1.5 4.0 -2.0 5.0 -2.5 11.0 -5.5 12.0 -6.0 13.0 -6.5
#[17] 21.0 -10.5 25.0 -12.5
as requested.
How it works
The requirement is that
negative numbers should be multiplied by 1/2,
positive numbers should not be changed.
This can be translated to
if sign(a) == -1 then multiply a by 1/2
if sign(a) == +1 then multiply a by 1
Now, we need to find a linear function y = f(x) = p * x + q which satisfies the equations
f(-1) = -p + q = 1/2
f(1) = p + q = 1.
After solving for p and q we get f(x) = 1/4 * x + 3/4. With x = sign(a), the factor to multiply each element of a with is:
1/4 * sign(a) + 3/4
you could use ifelse():
half_if_neg <- function(x) {ifelse(x < 0, x / 2, x)}
#sapply(a, half_if_neg)
half_if_neg(a)
We can also do the assignment
i1 <- a < 0
a[i1] <- a[i1]/2
Or using
pmax(a, (a*NA^(a >=0))/2, na.rm = TRUE)
#[1] 1.0 -0.5 2.0 -1.0 3.0 -1.5 4.0 -2.0 5.0 -2.5 11.0
#[12] -5.5 12.0 -6.0 13.0 -6.5 21.0 -10.5 25.0 -12.5

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Compute mean across intervals of rows

I have a long matrix, of which I need to compute the mean for a certain interval of rows. At the moment I am doing this manually like this:
values <- cbind(1:50,1)
meanqual10 <- mean(values[1:10,1])
meanqual10
[1] 5.5
meanqual15 <- mean(values[11:20,1])
meanqual15
[1] 15.5
meanqual20 <- mean(values[21:30,1])
meanqual20
[1] 25.5
meanqual25 <- mean(values[31:40,1])
meanqual25
[1] 35.5
meanqual30 <- mean(values[41:50,1])
meanqual30
[1] 45.5
There must be a nicer way of doing this. Can anybody help please?
SeƱor O's answer is nice if you have regular intervals. Another approach, if you want to select arbitrary rows could be something like:
l <- list(1:10,11:20,21:30,31:40,41:50) # vectors of any length or ordering
sapply(l, function(x) mean(values[x,1]))
Which gives:
[1] 5.5 15.5 25.5 35.5 45.5
And of course by can do this for arbitrary rows, too. This is just a slightly different approach.
by(values[,1], ceiling(1:50 / 10), mean)
ceiling(1:50/10) creates a vector of length 50 with a new integer every 10 numbers.
by will then take the mean for each group of the same integer.
Result:
ceiling(1:50/10): 1
[1] 5.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 2
[1] 15.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 3
[1] 25.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 4
[1] 35.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 5
[1] 45.5

Apply formula for between species comparison

I have a data frame laid out in the follwing manner:
Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
I would like to apply, for each species combination, (Xa) - (Xb) where is X is the trait value and the letter is the species and Xa > Xb. I.e has to be such that the larger value of each respective species combination has to come first, calculated for every trait
Would this be a multi-step process?
An example output could be
Combination Trait.p Trait.y Trait.z
a/b 0.3 1.1 1.1
I assumed you choose the largest value but David brings up a good point. I doubt this is the best approach but I think it gives you what you're after. Note I added a c as I'm sure your problem is a bit more complex that just a and b:
dat <- read.table(text="Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
c 14.2 3.8 11.9", header=T)
li <- lapply(split(dat, dat$Species), function(x) apply(x[, -1], 2, max))
com <- expand.grid(names(li), names(li))
inds <- com[com[, 1] != com[, 2], ]
inds <- t(apply(inds, 1, sort))
inds <- inds[!duplicated(inds), ]
ans <- lapply(1:nrow(inds), function(i) {
abs(li[[inds[i, 1]]]-li[[inds[i, 2]]])
})
cbind(Combination = paste(inds[, 1], inds[, 2], sep="/"),
as.data.frame(do.call(rbind, ans)))
This gives us:
Combination Trait.p Trait.y Trait.z
1 a/b 0.3 1.1 1.1
2 a/c 5.9 3.4 2.2
3 b/c 6.2 4.5 3.3
Sorry for the lack of annotation but I'm heading to class.

Resources