Find indices of slope changes in a vector in R - r

I have a data frame with two columns: (1) datetimes and (2) streamflow values. I would like to create a 3rd column with indicator values to find sudden increases (usually a 0 but it is a 1 when the streamflow shows a big increase).
datetime <- as.POSIXct(c(1557439200, 1557440100, 1557441000, 1557441900,1557442800,
1557443700, 1557444600, 1557445500, 1557446400, 1557447300, 1557448200, 1557449100, 1557450000, 1557450900,
1557451800, 1557452700, 1557453600, 1557454500, 1557455400, 1557456300, 1557457200, 1557458100, 1557459000), origin = "1970-01-01")enter code here
streamflow <- c(0.35, 0.35, 0.36, 0.54, 1.0, 2.7, 8.4, 9.3, 6.2, 3.8, 4.7,
2.91, 2.01, 1.65, 1.41, 1.12, 0.95, 0.62, 0.52, 0.53, 0.53, 0.44, 0.35)
data <- data.table(as.POSIXct(datetime), as.numeric(streamflow))
I am trying to create a function that would identify the datetime of where it jumps from 0.5 to 1 because that is when the event starts. It would then stop indicating it is an event when the streamflow goes below a certain threshold.
My current idea is a function that compares the local slope between two consecutive points in streamflow to a slope of all the values of streamflow within some window, but I don't really know how to write that. Or maybe there is a better idea for how to do what I am trying to do

data = data[, delta := (V2-lag(V2))/lag(V2)][
, ind_jump := delta > 0.5
]
indices <- data[ind_jump==TRUE, V1]
Not related to this, but for some weird reason R gives
(0.54 - 0.36)/0.36 > 0.5
[1] TRUE
while
0.18/0.36 > 0.5
[1] FALSE

Related

Recursive regression using R

I would like to run a recursive regression using my variables residential_ddiff and interest_diff thus testing the stability of the coefficients of my variable interest_diff.
The issue occurs as I want the recursive regression to be run on the 1 lag in the window of [i:(i+10)] observations but keep getting the same error:
Error in merge.zoo(residential_ddiff[i], L(interest_diff, 1)[i:(i + 10)], :
all(sapply(args, function(x) is.zoo(x) || !is.plain(x) || (is.plain(x) && .... is not TRUE
Both time series are n=91 and stored as ts objects. I've tried using both ts and numerical objects in my loop.
I've attached a screenshot of my code. For loop in R
Gratefull for any help, thank you.
I've tried lots of different options. Both trying to coerce using the as.zoo() function as well as defining the current observation of residential_ddiff[i] both same error keep occuring.
Reproducable example:
library(dynlm)
# Creating two datasets
data_1 <- c(3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832, 3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832, 3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832, 3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832)
data_2 <- c(1.41, 0.33, 0.32, 1.53, -1.55, 0.73, 1.41, 0.33, 0.32, 1.53, -1.55, 0.73, 1.41, 0.33, 0.32, 1.53, -1.55, 0.73, 1.41, 0.33, 0.32, 1.53, -1.55, 0.73)
# Storing data in a dataframe
df <- data.frame(data_1, data_2)
# Making sure the dataframe are numeric
df1 <- mutate_all(df, function(x) as.numeric(as.character(x)))
# Creating variable to store coefficients
estimate.store <- matrix(ncol = 6, nrow = nrow(df1)-3)
# Loop begins
for (i in 1:(nrow(df1-3))) {
# Creating af dynamic linear regression with data 1 and the rekursive values of i:(i+3) for the first lag of data_2. Here the main issue arises, I think.
estimation.store <- dynlm(df1$data_1[i] ~ L(df1$data_2,1)[i:(i+3)], as.zoo(df1))
estimate.store[i,] <- c(estimation.store$coef[1], confint(estimation.store)[1,], estimation.store$coef[2], confint(estimation.store)[1,])
}

Is there a way in Base R to replicate what VLOOKUP TRUE in Excel does?

I have a consumption pattern that looks like this:
x <-0:10
y<-c(0, 0.05, 0.28, 0.45, 0.78, 0.86, 0.90, 0.92, 0.95, 0.98, 1.00)
X is in years, and Y is not always monotonically-increasing, although it should be most of the time.
If I needed to estimate how many years would elapse before 80% is consumed, in Excel, I would use the VLOOKUP TRUE function which would return 78%, then I would lookup the next value in the series (86%) and then linearly interpolate to get 4.25 years. It's laborious but it gets the job done.
Is there an easy way to compute this in R, in a user-defined function that I can apply to many cases?
Thanks!
x <- 0:10
y <- c(0, 0.05, 0.28, 0.45, 0.78, 0.86, 0.90, 0.92, 0.95, 0.98, 1.00)
estimate_years <- function(x, y, percent) {
idx <- max(which(y < percent))
(percent - y[idx]) / (y[idx+1] - y[idx]) * (x[idx+1] - x[idx]) + x[idx]
}
estimate_years(x, y, 0.80) ## 4.25
Although the approx calculation is cool,
exact linear interpolation here is easy.
idx is the next smaller position for y and x.
idx+1 thus is the next equal/bigger position for y and x in relation to percentage.
Through triangular calculation, where
k = part / total
which is
(percent - y[idx]) / (y[idx+1] - y[idx])
and applying k * total_x
represented here by k * (x[idx+1] - x[idx]) - the result of the linear interpolation
and adding last smaller years
x[idx], we obtain the result.
You could try with approx
resolution <- 1000
fn <- approx(x, y, n=resolution)
min(fn$x[fn$y > 0.8])
[1] 4.254254
The better you need your estimate to be, use a higher number for resolution

Continously calculate the an initial investment by its return vector

I´d need some help to create a vector that contains the value of an investment in every point in time.
Imagine, I have the return (in%) of a single stock to 10 different consecutive months. Then I got an intital value of $100 and consecutively multiply the return of period t with the value of the Investment of period t-1. The output must be a vector because I want to plot the results.
Unfortunately, I have no idea to create a code - probably its a for loop?
The Monthly return:
c(-0.09, -0.11, -0.2, -0.45, -0.11, 0.2, -0.27, -0.15, -0.24,
0.16)
Value of Investment respectively:
100*(1+(-0.09))=91
91*(1+(-0,11))= 80,99
...
Desired Output Vector:
c(91, 80.99, 64.792, …)
I´m not quite sure how to compute this vector with a loop, function or other method.
I´m very glad about any help! Cheers!
r <- c(-0.09, -0.11, -0.2, -0.45, -0.11, 0.2, -0.27, -0.15, -0.24,
0.16)
100*cumprod(1 + r)
# [1] 91.00000 80.99000 64.79200 35.63560 31.71568 38.05882 27.78294 23.61550 17.94778 20.81942
or
Reduce(function(x, r) x*(1 + r), r, init = 100, accumulate = T)
# [1] 100.00000 91.00000 80.99000 64.79200 35.63560 31.71568 38.05882 27.78294 23.61550
# [10] 17.94778 20.81942

r- The confuse details of findCorrelation() (caret package) when setting exact=True

According to the findCorrelation() document I run the official example as shown below:
Code:
library(caret)
R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32,
0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
0.85, 0.32, 0.91, 0.36, 1),
.Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
findCorrelation(R1, cutoff = .6, exact = TRUE, names = TRUE
,verbose = TRUE)
Result:
> findCorrelation(R1, cutoff = .6, exact = TRUE, names = TRUE, verbose = TRUE)
## Compare row 1 and column 5 with corr 0.85
## Means: 0.648 vs 0.545 so flagging column 1
## Compare row 5 and column 3 with corr 0.91
## Means: 0.53 vs 0.49 so flagging column 5
## Compare row 3 and column 4 with corr 0.65
## Means: 0.33 vs 0.352 so flagging column 4
## All correlations <= 0.6
## [1] "x1" "x5" "x4"
I have no idea how the computation process works, i. e. why there are first compared row 1 and column 5, and how the mean is calculated, even after I have read the source file.
I hope that someone could explain the algorithm with the help of my example.
First, it determines the average absolute correlation for each variable. Columns x1 and x5 have the highest average (mean(c(0.85, 0.56, 0.32, 0.86)) and mean(c(0.85, 0.9, 0.36, 0.32)) respectively), so it looks to remove one of these on the first step. It finds x1 to be the most globally offensive, so it removes it.
After that, it recomputes and compares x5 and x3 using the same process.
It stops after removing three columns since all pairwise correlations are below your threshold.

Sequentially re-ordering sections of a vector around NA values

I have a large set of data that I want to reorder in groups of twelve using the sample() function in R to generate randomised data sets with which I can carry out a permutation test. However, this data has NA characters where data could not be collected and I would like them to stay in their respective original positions when the data is shuffled.
With help on a previous question I have managed to shuffle the data around the NA values for a single vector of 24 values with the code:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42)
example.data[!is.na(example.data)] <- sample(example.data[!is.na(example.data)], replace = F, prob = NULL)
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 0.12 NA NA 0.42 1.00
Extending from this, if I have a set of data with a length of 24 how would I go about re-ordering the first and second set of 12 values as individual cases in a loop?
For example, a vector extending from the first example:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98, NA, 0.05)
Where example.data[1:12] and example.data[13:24] are shuffled separately within their own respective groups around their NA values.
The code I am trying to work this solution into is as follows:
shuffle.data = function(input.data,nr,ns){
simdata <- input.data
for(i in 1:nr){
start.row <- (ns*(i-1))+1
end.row <- start.row + actual.length[i] - 1
newdata = sample(input.data[start.row:end.row], size=actual.length[i], replace=F)
simdata[start.row:end.row] <- newdata
}
return(simdata)}
Where input.data is the raw input data (example.data); nr is the number of groups (2), ns is the size of each sample (12); and actual.length is the length of each group exluding NAs stored in a vector (actual.length <- c(9, 8) for the example above).
Would anyone know how to go about achieving this?
Thank you again for your help!
I agree with Gregor's comment that it may be a better approach to work with the data in another form. However, what you need to accomplish can still be done easily enough even if all the data is in one vector.
First make a function that shuffles only non-NA values of an entire vector:
shuffle_real <- function(data){
# Sample from only the non-NA values,
# and store the result only in indices of non-NA values
data[!is.na(data)] <- sample(data[!is.na(data)])
# Then return the shuffled data
return(data)
}
Now write a function that takes in a larger vector, and applies this function to each group in the vector:
shuffle_groups <- function(data, groupsize){
# It will be convenient to store the length of the data vector
N <- length(data)
# Do a sanity check to make sure there's a match between N and groupsize
if ( N %% groupsize != 0 ) {
stop('The length of the data is not a multiple of the group size.',
call.=FALSE)
}
# Get the index of every first element of a new group
starts <- seq(from=1, to=N, by=groupsize)
# and for every segment of the data of group 'groupsize',
# apply shuffle_real to it;
# note the use of c() -- otherwise a matrix would be returned,
# where each column is one group of length 'groupsize'
# (which I note because that may be more convenient)
return(c(sapply(starts, function(x) shuffle_real(data[x:(x+groupsize-1)]))))
}
For example,
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00,
0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98,
NA, 0.05)
set.seed(1234)
shuffle_groups(example.data, 12)
which results in
> shuffle_groups(example.data, 12)
[1] 0.12 0.83 NA 1.00 0.47 0.64 0.25 0.33 NA NA 0.90 0.42 0.47 NA
[15] 0.05 1.00 0.56 0.62 NA 0.73 NA 0.98 NA 0.12
or try shuffle_groups(example.data[1:23], 12), which results in Error: The length of the data is not a multiple of the group size.

Resources