na.approx Interpolation in R - r

I'm using Zoo's na.approx to fill NA values.
library(zoo)
Bus_data<-data.frame(Action = c("Boarding", "Alighting",NA, NA,"Boarding", "Alighting",NA, NA,"Boarding", "Alighting"),
Distance=c(1,1,2,2,3,3,4,4,5,5),
Time = c(1,2,NA,NA,5,6,NA,NA,9,10))
I'd like the resulting data.frame to look like the following:
Action Distance Time
1 Boarding 1 1
2 Alighting 1 2
3 NA 2 3.5
4 NA 2 3.5
5 Boarding 3 5
6 Alighting 3 6
7 NA 4 7.5
8 NA 4 7.5
9 Boarding 5 9
10 Alighting 5 10
However, when I use
na.approx(Bus_data$Time,Bus_data$Distance,ties = "ordered" )
1 Boarding 1 2 <-Value Changes
2 Alighting 1 2
3 NA 2 3.5
4 NA 2 3.5
5 Boarding 3 6 <-Value Changes
6 Alighting 3 6
7 NA 4 7.5
8 NA 4 7.5
9 Boarding 5 10 <-Value Changes
10 Alighting 5 10
Any idea how I could get the desired outcome through na.approx? Note, in the example "Distance" is evenly spaced for simplification, the dataset has varying distances.

You can use approx from baseR
Time = c(1,2,NA,NA,5,6,NA,NA,9,10)
approx(Time, method = "constant", n = length(Time), f = .5)$y
Result
# [1] 1.0 2.0 3.5 3.5 5.0 6.0 7.5 7.5 9.0 10.0
From ?approx
f :
for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values.
With na.approx it would be similar
library(zoo)
na.approx(Time, method = "constant", f = .5)

We could replace the non-NA elements of original column to NA after the na.approx and then do a coalesce
library(dplyr)
library(zoo)
coalesce(Bus_data$Time, replace(na.approx(Bus_data$Time,Bus_data$Distance,
ties = "ordered" ),
!is.na(Bus_data$Time), NA))
#[1] 1.0 2.0 3.5 3.5 5.0 6.0 7.5 7.5 9.0 10.0

Related

Impute missing values for missing dates

Imagine I have the following two data frames:
> sp
date value
1 2004-08-20 1
2 2004-08-23 2
3 2004-08-24 4
4 2004-08-25 5
5 2004-08-26 10
6 2004-08-27 11
> other
date value
1 2004-08-20 2
2 2004-08-23 4
3 2004-08-24 5
4 2004-08-25 10
5 2004-08-27 11
where the first columns represents the dates and the second the values for each day. The matrix of reference is sp and I want to impute to the matrix other the missing dates and values with respect to sp. For instance, in this case I miss the date "2004-08-26" in the matrix other. I should add to the matrix other a new row, with the date "2004-08-26" and the value which is given by the mean of the values at "2004-08-25" and "2004-08-27".
Could anyone suggest me how I can do it?
Data
sp <- data.frame(date=c("2004-08-20", "2004-08-23", "2004-08-24", "2004-08-25",
"2004-08-26", "2004-08-27"), value=c(1, 2, 4, 5, 10, 11))
other <- data.frame(date=c("2004-08-20", "2004-08-23", "2004-08-24", "2004-08-25",
"2004-08-27"), value=c(2, 4, 5, 10, 11))
An option using zoo::na.approx :
library(dplyr)
sp %>%
select(date) %>%
left_join(other, by = 'date') %>%
mutate(value = zoo::na.approx(value))
# date value
#1 2004-08-20 2.0
#2 2004-08-23 4.0
#3 2004-08-24 5.0
#4 2004-08-25 10.0
#5 2004-08-26 10.5
#6 2004-08-27 11.0
If I understand correctly, you want to add dates from sp that are missing in other.
You can merge other with just the date column of sp. Note, that by default from one-column data frames (and matrices) dimensions are dropped, so we need drop=FALSE.
The resulting NA can be e.g. linearly interpolated using approx, which gives the desired mean.
other2 <- merge(other, sp[, 'date', drop=FALSE], all=TRUE) |>
transform(value=approx(value, xout=seq_along(value))$y)
other2
# date value
# 1 2004-08-20 2.0
# 2 2004-08-23 4.0
# 3 2004-08-24 5.0
# 4 2004-08-25 10.0
# 5 2004-08-26 10.5 ## interpolated
# 6 2004-08-27 11.0
Note: For R < 4.1, do:
transform(merge(other, sp[, "date", drop = FALSE], all = TRUE),
value = approx(value, xout = seq_along(value))$y)
# date value
# 1 2004-08-20 2.0
# 2 2004-08-23 4.0
# 3 2004-08-24 5.0
# 4 2004-08-25 10.0
# 5 2004-08-26 10.5 ## interpolated
# 6 2004-08-27 11.0

How to compare values with more or less

I have a dataframe and i want to compare variables in line 3 using if statement with the use of more or less
let's say i want to compare the same values in third column with more or less 0.2
data >
NAME A B C D
first 3 2 4 5
second 1 2 3 4
third 7 7.1 7.5 6.9
four 2 1 0 5
here a program to compare the exact values
for (i in 1:3) {
d <- i+1
for (j in d:4) {
if(data [3,i] == data [3,j] ){
print(paste("The columns" , colnames(data[,i]) ,"and " , colnames(data[,i]) , "are equal"))
}
}
}
Here it retuns nothings because the program compares the exacte values and me i want to compare that have the same values more or less 0.2
the result i want is
the column A and B are equal
the column A and D are equal
it's because A(=7) + or - the same as B(7.1)
and the same thing for D
A(=7) + or - D (6.9)
Thank you
Take the combination of columns then compare with tolerance:
df1 <- read.table(text ="
NAME A B C D
first 3 2 4 5
second 1 2 3 4
third 7 7.1 7.5 6.9
four 2 1 0 5", header = TRUE)
tolerance = 0.2
cbind(df1,
combn(colnames(df1[, 2:5]), 2, FUN = function(x){
paste0(x[ 1 ],
ifelse(abs(df1[, x[ 1 ] ] - df1[, x[ 2 ] ]) <= tolerance, "=", "!="),
x[ 2 ])
}))
# NAME A B C D 1 2 3 4 5 6
# 1 first 3 2.0 4.0 5.0 A!=B A!=C A!=D B!=C B!=D C!=D
# 2 second 1 2.0 3.0 4.0 A!=B A!=C A!=D B!=C B!=D C!=D
# 3 third 7 7.1 7.5 6.9 A=B A!=C A=D B!=C B=D C!=D
# 4 four 2 1.0 0.0 5.0 A!=B A!=C A!=D B!=C B!=D C!=D

Resizing and interpolating middle values in column in R

I have a dataframe.
df <- data.frame(level = c(1:10), values = c(3,4,5,6,8,9,4,2,1,6))
Which I would like to resize to fewer levels, lets say 6 levels.
Where level 0 and level 10 are corresponding to level 0 and level 6 in the new dataframe. (I just guessed some floats in between, not sure what the result would actually be)
level value
1 3
2 3.4
3 4.6
4 6.2
5 2.2
6 6
How would I go about doing this?
Maybe you want to use approxfun for interpolation like below?
data.frame(
level = 1:6,
values = approxfun(df$level, df$values)(seq(1, nrow(df), length.out = 6))
)
which gives
level values
1 1 3.0
2 2 4.8
3 3 7.2
4 4 7.0
5 5 1.8
6 6 6.0

Row Means based on Column Substring

I have a dataframe that looks like this:
df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))
> df
CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1 0 1 3 2 1
2 5 5 3 3 0
3 6 3 4 4 0
4 2 6 5 5 5
I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.
Any help would be appreciated!
1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.
prefix <- sub("\\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))
giving this matrix (wrap it in data.frame(...) if you need a data frame result):
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.
m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m
giving this matrix
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:
X <- model.matrix(~ prefix + 0) # model matrix
X
giving:
prefixCB_1 prefixHC_1 prefixHC_2
1 1 0 0
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"
Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.
3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.
long <- transform(stack(df), ind = sub("\\..*", "", ind), id = c(row(df)))
with(long, tapply(values, long[c("id", "ind")], mean))
giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.
ind
id CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
Here is a base R solution using rowMeans + split.default, i.e.,
dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\\..*$","",names(df)), levels = unique(s)))))
such that
> dfout
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
If you do not mind the order of column names, you can use the shorter code below
dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\\..*$","",names(df)))))
such that
> dfout
CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
One option involving dplyr and purrr could be:
map_dfc(.x = unique(sub("\\..*$", "", names(df))),
~ df %>%
transmute(!!.x := rowMeans(select(., starts_with(.x)))))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
A base option could be:
#find column names splitting on "."
cols <- unique(sapply(strsplit(names(df),".", fixed = T), `[`, 1))
#loop through each column name and find the rowMeans
as.data.frame(sapply(cols, function (x) rowMeans(df[grep(x, names(df))])))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0

Vector: How to filter out segments of decreasing values

Let's say there is a vector sim that contains the following sequence of numbers:
1
2
4
7
5
3
2.5
4
6
How can I filter out all the segments of decreasing values in order to achieve sim with only increasing values? The expected result:
1
2
4
7
2.5
4
6
Based on #akrun's suggestion:
dif <- diff(sim) > 0
sim[ c(dif[1], dif) | c(dif, dif[length(dif)]) ]
[1] 1.0 2.0 4.0 7.0 2.5 4.0 6.0

Resources