This question already has an answer here:
How to make a linear interpolation for different dates using the function "approx"?
(1 answer)
Closed 3 months ago.
I have two NA's in my Data and I simply have to do a linear interpolation to find their value but I don't understand why it does not work.
Here is the data.
It is quite big.
Here is what i've tried:
id1 <- as.numeric(ID1)
anyNA(id1)
#There is 2
sum(anyNA(id1))
is.na(id1)
na46 <- approx(x=c(95.4968:101.491), y=c(103.856 : 44.7562), method = "linear")
Use na.approx. There are some arguments that can be used to specify how to deal with NA's on the ends. It assumes that the data is equally spaced unless you use the x= argument to specify some other x values. See ?na.approx for more details.
library(zoo)
y <- c(1, 2, NA, 4, 5, NA, 7)
na.approx(y)
## [1] 1 2 3 4 5 6 7
y2 <- c(NA, 1, NA, 2, NA)
na.approx(y2)
## [1] 1.0 1.5 2.0
na.approx(y2, na.rm = FALSE)
[1] NA 1.0 1.5 2.0 NA
na.approx(y2, rule = 2)
## [1] 1.0 1.0 1.5 2.0 2.0
# BOD comes with R. Create version where demand[5] is NA.
BOD2 <- transform(BOD, demand = replace(demand, 5, NA))
transform(BOD2, demand = na.approx(demand)) # without x=
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 17.9
## 6 7 19.8
transform(BOD2, demand = na.approx(demand, x = Time)) # with x=
## Time demand
## 1 1 8.30000
## 2 2 10.30000
## 3 3 19.00000
## 4 4 16.00000
## 5 5 17.26667
## 6 7 19.80000
Related
I have a partially filled in table, there are NAs at the top and bottom of the table (column X in the table below). I want to fill in the table using a rate (0.3) to get the results in the Goal column. This is similar to the fill up/down function in Excel used to copy a formula and fill cells.
df <- data.frame(X = matrix(nrow = 10, ncol = 1, NA))
df [3:5,1] <- 2:4
X Goal
1 NA 1.4
2 NA 1.7
3 2 2
4 3 3
5 4 4
6 NA 4.3
7 NA 4.6
8 NA 4.9
9 NA 5.2
10 NA 5.9
Essentially what I want the code to do is this:
1.4 (X2 answer - 0.3)
1.7 (2 - 0.3)
2
3
4
4.3 (4 + 0.3)
4.6 (X6 answer + 0.3)
4.9 (X7 answer + 0.3)
5.2 (X8 answer + 0.3)
5.5 (X9 answer + 0.3)
I know this can probably be done using loops, but I find them intimidating given my skill level, so I'm looking for a solution that avoids them (if that's even possible).
Avoiding loops with nafill() and fcoalesce() from data.table.
library(data.table)
loc = range(which(!is.na(df$X)))
df$Goal =
fcoalesce(nafill(df$X, "locf"), nafill(df$X, "nocb")) +
c( -((loc[1] - 1):1)*0.3, rep(0, diff(loc)+1), (1:(nrow(df) - loc[2]))*0.3 )
Still, it is (arguably) much easier to keep track of what is happening in each case with a loop:
# Preallocate
df$Goal = 0
for (i in 1:nrow(df)) {
if (i < loc[1]) df$Goal[i] = df$X[loc[1]] - (loc[1] - i) * 0.3
else if (i > loc[2]) df$Goal[i] = df$X[loc[2]] + (i - loc[2]) * 0.3
else df$Goal[i] = df$X[i ]
}
# X Goal
# 1 NA 1.4
# 2 NA 1.7
# 3 2 2.0
# 4 3 3.0
# 5 4 4.0
# 6 NA 4.3
# 7 NA 4.6
# 8 NA 4.9
# 9 NA 5.2
# 10 NA 5.5
I'm back to using R after using SAS for a few years, and I'm relearning everything again.
I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:
0 - 1/3 acre = 0
1/3 - 2/3 acre = 1
2/3 - 1 acre = 2
1+ acre = 3
Into a new variable LS_cat.
I've explored the mutate command but I keep returning errors. Anyone have any ideas?
UPDATE
Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.
I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?
Use case_when().
library(tidyverse)
set.seed(123)
my_df <- tibble(
lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)
my_df |> mutate(
ls_cat = case_when(lot_size < 1 / 3 ~ 0,
lot_size < 2 / 3 ~ 1,
lot_size < 1 ~ 2,
TRUE ~ 3)
)
#> A tibble: 10 x 2
#> lot_size ls_cat
#> <dbl> <dbl>
#> 1 0.486 1
#> 2 0.988 2
#> 3 0.608 1
#> 4 1.08 3
#> 5 1.14 3
#> 6 0.244 0
#> 7 0.727 2
#> 8 1.09 3
#> 9 0.751 2
#>10 0.656 1
Case_when() is usually a sound solution when there's more than two options (if_else() if there are just two), but in this case there's a simpler math(s) solution.
my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <- ceiling((my_df$lot_size*3)-0.99)
Though, this may be less instructive on R programming.
For your follow on question, ifelse() works well, e.g.
Base:
my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)
Or Tidyverse:
my_df <- my_df %>%
mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))
NB: if_else() is a more pedantic version of ifelse(). Both should work equally well here, but if_else() is better for catching possible errors
We can use findInterval:
Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
# [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
# [1] 0 0 1 1 1 2 2 2 3 3
In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L (integer 1).
cut it.
dat <- transform(dat, Lot_Size_cat=
cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
include.lowest=TRUE))
dat
# X1 Lot_Size Lot_Size_cat
# 1 0.77436849 1.0509024 3
# 2 0.19722419 0.2819626 0
# 3 0.97801384 0.8002238 2
# 4 0.20132735 0.9272001 2
# 5 0.36124443 0.6396998 1
# 6 0.74261194 1.0990851 3
# 7 0.97872844 1.1648617 3
# 8 0.49811371 0.7221819 2
# 9 0.01331584 1.1915689 3
# 10 0.25994613 0.4076475 1
Data:
set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))
I would like to create a column that sums the adjacent value and 80% of the previous value from another column. So, if column x is 1, 2, 3...10, I want column z to be 1, 2.8, 5.24, 8.192, etc.
Yet, here is my failed attempt:
x <- c(1:10)
y <- c("")
df <- data.frame(x,y)
df1 <- df %>%
mutate(y = cumsum(x*0.8))
Result:
x y
1 1 0.8
2 2 2.4
3 3 4.8
4 4 8.0
5 5 12.0
6 6 16.8
7 7 22.4
8 8 28.8
9 9 36.0
10 10 44.0
I would use a for loop to do this. It's important to initialize a vector first, especially if you're working with a large data set.
# initialize
newx <- vector("numeric", length(df$x))
newx[1] <- df$x[1]
for(i in 2:length(df$x)){
newx[i] <- df$x[i] + (0.8 * newx[i-1])
}
newx
# [1] 1.00000 2.80000 5.24000 8.19200 11.55360 15.24288 19.19430 23.35544 27.68435 32.14748
With the addition of purrr, you can do:
df %>%
mutate(y = accumulate(x, ~ .x * 0.8 + .y))
x y
1 1 1.00000
2 2 2.80000
3 3 5.24000
4 4 8.19200
5 5 11.55360
6 6 15.24288
7 7 19.19430
8 8 23.35544
9 9 27.68435
10 10 32.14748
Try using the Reduce function:
Reduce(function(last, current) current + last * .8, x = x, accumulate = T)
# [1] 1.00000 2.80000 5.24000 8.19200 11.55360 15.24288 19.19430 23.35544 27.68435 32.14748
I have a dataframe here:
df <- data.frame("Time" = 1:10, "Value" = c(1.7,NA,-999,-999,1.5,1.6,NA,4,-999,8))
"NA" means there is no observation, just leave them there. "-999" means the observation is identified as an outlier.
Now I am trying to replace the "-999" with the average of the nearest values. For example:
The first "-999" should be replaced with (1.7+1.5)/2 = 1.6
The second "-999" should be replaced with (1.7+1.5)/2 = 1.6
The last "-999" should be replaced with (4.0+8.0)/2 = 6
I tried to usenext statement to find the next iteration, and use if statement to decide where to stop. But how can I go up to check the previous iterations? Or is there just another kind of solution to this?
Many thanks.
One approach utilizing dplyr, purrr and tidyr could be:
df %>%
mutate(New_Value = if_else(Value == -999,
map_dbl(.x = seq_along(Value),
~ mean(c(tail(na.omit(na_if(Value[1:(.x - 1)], -999)), 1),
head(na.omit(na_if(Value[(.x + 1):n()], -999)), 1)))),
Value))
Time Value New_Value
1 1 1.7 1.7
2 2 NA NA
3 3 -999.0 1.6
4 4 -999.0 1.6
5 5 1.5 1.5
6 6 1.7 1.7
7 7 NA NA
8 8 4.0 4.0
9 9 -999.0 6.0
10 10 8.0 8.0
Using a few while loops, which bump up how far we lag/lead, we can accomplish this. I am not sure how performant this operation will be on large data sets. But it seems to get the job done for your sample data.
# find where replacements and initialize
where_to_replace <- which(df$Value == -999)
len_replace <- length(where_to_replace)
lag_value <- rep(NA, len_replace)
lead_value <- rep(NA, len_replace)
# more initializing
i <- 1
lag_n <- 1
lead_n <- 1
while(i <= len_replace){
# find appropriate lagged value
# can't use NA or lag value == -999
while(is.na(lag_value[i]) | lag_value[i] == -999){
lag_value[i] <- dplyr::lag(df$Value, lag_n)[where_to_replace[i]]
lag_n <- lag_n + 1
}
# find appropriate lead value
# can't use NA or -999 as lead value
while(is.na(lead_value[i]) | lead_value[i] == -999){
lead_value[i] <- dplyr::lead(df$Value, lead_n)[where_to_replace[i]]
lead_n <- lead_n + 1
}
# reset iterators
i <- i + 1
lag_n <- 1
lead_n <- 1
}
# replacement value
df$Value[where_to_replace] <- (lead_value + lag_value) / 2
# Time Value
# 1 1 1.7
# 2 2 NA
# 3 3 1.6
# 4 4 1.6
# 5 5 1.5
# 6 6 1.6
# 7 7 NA
# 8 8 4.0
# 9 9 6.0
# 10 10 8.0
I created two new helper colums - before and after.
Before fills every NA and -999 with the next value on top and after fills NAs and -999 with the next value underneath. In the next step I over wrote each -999 with the mean of the two values.
df <- data.frame(Time = 1:10,
Value = c(1.7, NA, -999, -999, 1.5,
1.6, NA,
4, -999, 8))
df <- df %>%
mutate(before = recode(Value, `-999` = NA_real_),
after = recode(Value, `-999` = NA_real_)) %>%
fill(before, .direction = "down") %>%
fill(after, .direction = "up") %>%
mutate(Value = case_when(Value == -999 ~ (before + after)/2,
TRUE ~ Value)) %>%
select(Time, Value)
The Output
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Here is a base R option using findInterval
x <- which(df$Value == -999)
y <- setdiff(which(!is.na(df$Value)),x)
ind <- findInterval(x,y)
dfout <- within(df,Value <- replace(Value,x,rowMeans(cbind(Value[y[ind]],Value[y[ind+1]]))))
such that
> dfout
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Just sticking with base R data.frames we can make a function and use sapply over indices of interest.
outliers <- df$Value == -999 # Keep as logical for now
fillers <- which(!is.na(df$Value) & !outliers)
outliers <- which(outliers) # Now convert to indices; FALSE and NA do not appear
filled_outliers <- sapply(outliers, function(x) {
before_ind = max(fillers[fillers < x]) # maximum INDEX before an outlier
after_ind = min(fillers[fillers > x])
0.5*(df$Value[before_ind] + df$Value[after_ind])
})
df[outliers, ] <- filled_outliers
df
Gives:
Time Value
1 1.0 1.7
2 2.0 NA
3 1.6 1.6
4 1.6 1.6
5 5.0 1.5
6 6.0 1.6
7 7.0 NA
8 8.0 4.0
9 6.0 6.0
10 10.0 8.0
I'm looking to create a hybrid of cumsum() and TTR::runSum()where cumSum() runs up until a pre-specified number of datapoints, at which points it acts more like a runSum()
For example:
library(TTR)
data <- rep(1:3,2)
cumsum <- cumsum(data)
runSum <- runSum(data, n = 3)
DesiredResult <- ifelse(is.na(runSum),cumsum,runSum)
Is there a way to get to DesiredResult that doesn't require getting finangly with NAs?
That is what the partial=TRUE argument to rollapplyr does. Here we show this with sum and also with sd and IQR. (Note that the sd of one value is NA and we chose IQR since it is a measure of spread that can be calculated for scalars although it is always 0 in that case.)
library(zoo)
rollapplyr(data, 3, sum, partial = TRUE)
## [1] 1 3 6 6 6 6
rollapplyr(data, 3, sd, partial = TRUE)
## [1] NA 0.7071068 1.0000000 1.0000000 1.0000000 1.0000000
rollapplyr(data, 3, IQR, partial = TRUE)
## [1] 0.0 0.5 1.0 1.0 1.0 1.0
Here are three alternatives.
n <- 3
rowSums(embed(c(rep(0, n - 1), data), n)) # base R
# [1] 1 3 6 6 6 6
library(TTR)
runSum(c(rep(0, n - 1), data), n = n)
# [1] NA NA 1 3 6 6 6 6 # na.omit fixes the beginning
library(zoo)
rollsum(c(rep(0, n - 1), data), k = 3, align = "right")
# [1] 1 3 6 6 6 6