I'm new to R and I am attempting to take a set of time series and run them through a Conditional Inference Tree to help classify the shape of the time series. The problem is that not all of the time sereis are of the same number of periods. I am trying to expand each time series to be 30 periods long, but still maintain the same "shape". This is as far as I have got
Require(zoo)
test<-c(606,518,519,541,624,728,560,512,777,728,1014,1100,930,798,648,589,680,635,607,544,566)
accordion<-function(A,N){
x<-ts(scale(A), start=c(1,1), frequency=1)
X1 <- zoo(x,seq(from = 1, to = N, by =(N-1)/(length(x)-1) ))
X2<-merge(X1, zoo(order.by=seq(start(X1), end(X1)-1, by=((N-1)/length(x))/(N/length(x)))))
X3<-na.approx(X2)
return(X3)}
expand.test<-accordion(test,30)
plot(expand.test); lines(scale(test))
length(expand.test)
The above code, scales the time series and then evenly spaces it out to 30 periods and interpolates the missing values. However, the length of the returned series is 42 units and not 30, however it retains the same "shape" as the orignal time series. Does anyone know how to modify this so that the results produced by the function accordian are 30 periods long and the time series shape remains relatively unchanged?
I think there's a base R solution here. Check out approx(), which does linear (or constant) interpolation with as many points n as you specify. Here I think you want n = 30.
test2 <- approx(test, n=30)
plot(test2)
points(test, pch="*")
This returns a list test2 where the second element y is your interpolated values. I haven't yet used your time series object, but it seems that was entirely interior to your function, correct?
Related
I am trying to find a function that matches two time series such that the datetime corresponds to reality.
So I need a function that minimizes the distance between the two curves shown above and outputs a new dataframe that has TAIR time-shifted towards the values of tre200h0.
From my bare eyes, it looks like this shift is about 22h.
ggplot
Best,
Fabio
I don't know a function that does this job for me.
Solved by Ric Villalba in the comments to OG Question.
Two R base functions to analyze time series lags are acf and pacf. i.e. given you have x and y you can use acf(y-x) and seek the zeroes in the plot (if your series have adequate seasonal behaviour), or, if you prefer, acf(y-x, plot=F) and get the data. Try which.min( acf(x-y)$acf^2 ).
Of course, it is a simplification of otherwise complex matter
So, I apologise in advance for my poor attempt at explaining myself. I am rather lost.
Summary:
I am working with the eyelinker package in R to analyse pupil size data in a time-series fashion.
I have managed to create a set of intervals where blinks start and end (extendedBlinks, they extend 150 milliseconds each direction (1000Hz).
# Define set of intervals for blinks
Blk <- cbind(df$blinks$stime, df$blinks$etime)
# Extend blinks (100 milliseconds each way)
extendedBlinks <- Intervals(Blk) %>% expand(150, "absolute")
head(extendedBlinks)
output:
Object of class Intervals
6 intervals over R:
[4485724, 4486141]
[4485984, 4486657]
[4486549, 4486853]
[4486595, 4487040]
[4486800, 4489142]
[4498990, 4499339]
In my dataframe, I have PSL (Pupil Size Left), PSR (Pupil Size Right), and time (relative to the eyetracker, and has the same values as the intervals shown above.
So, I want to get the PSL/PSR (for the sake of the example, let's just stick to getting the PSL).
I've tried many things, nothing seems to work for me. I want to replace the given values in y1 with extendedBlinks[1,1] and extendedBlinks[1,2] respectively (and then iterate over the intervals to interpolate the blinks.
# Interpolation
x1 <- c(extendedBlinks[1,1],extendedBlinks[1,2])
y1 <- c(500, 550)
interp <- approx(x1,y1, n = extendedBlinks[1,2]-extendedBlinks[1,1])
plot(interp)
Again, sorry for the poorly worded question. I'll edit as I receive feedback to try and make it clearer.
Any ideas?
Cheers!
I have a series of daily values, y. For each day, di (i.e., each row), I would like to calculate the (graph) area, ai, of the region between the curve and the horizontal line y = yi between di and the most recent previous occurrence of the value yi. Sketch below. Because observations occur at regular, discrete timesteps (daily), the calculated area, ai, is equivalent to the sum of the daily differences between each daily y and yi (black bars in figure). I'm interested only in valleys, so the calculated area, ai, can be set to 0 when y is decreasing (yi - yi-1 <= 0).
Toy data below. Expected result shown in dat$a.
dat$a[6] was calculated from 55 - 50;
dat$a[7] was calculated from (60-55)+(60-50). And so on.
dat = data.frame(d = seq.Date(as_date("2021-01-01"),as_date("2021-01-10"),by = "1 day"),
y = c(100,95,90,70,50,55,60,75,85,90),
a = c(0,0,0,0,0,5,15,65,115,145))
My first thought was to calculate the area between the curve and the horizontal line y = yi between days di and the the most recent previous occurrence of the value yi, using perhaps geiger::area.between.curves(), but I couldn't work out how to identify most recent previous occurrence of the value yi.
[In case the context helps, the actual data are daily values of the area (m2) of a wetland not submerged by water. When the water rises, a portion of the wetland that had been dry for some time becomes wet. Here, I'm trying to calculate the extent of the reflooding in m2-days. A portion of the wetland that has been dry for a long time but becomes reflooded will contribute many m2-days to the sum.]
I'm most comfortable in the tidyverse, and such answers are greatly preferred. I am not familiar with data.table.
Thanks in advance
Update
I was able to able to achieve my desired calculation in Excel, though it's brutally inelegant. Couple hundred rows in an example, linked below. Given that my real data are 180k rows, my poor machine hated the 18 million calculated cells. Though I can move on with my analysis, I am still very interested in an R solution. My implemented approach differs subtly from my imagined R approach in that it's summing 'horizontal rectangles', so to speak, each of the same (small) y-unit height, rather than 'vertical rectangles', each of unit width.
Here's the file.
Since the question is missing complete information we will compute the the area under the curve assuming that a day is one unit. Modify as appropriate for your specific problem.
library(pracma)
nr <- nrow(dat)
dat0 <- dat[c(1, 1:nr, nr), ]
dat0[c(1, nr), "y"] <- 0
with(dat0, abs(polyarea(as.numeric(d), y)))
I have about 50 datasets that include all trades within a timeframe of 30 days for about 10 pairs on 5 exchanges. All pairs are of the same asset class, meaning they are strongly correlated and expect to have similar properties, but are on different scales. An example of this data would be
set.seed(1)
n <- 1000
dates <- seq(as.POSIXct("2019-08-05 00:00:00", tz="UTC"), as.POSIXct("2019-08-05 23:59:00", tz="UTC"), by="1 min")
x <- data.frame("t" = sort(sample(dates, 1000)),"p" = cumsum(sample(c(-1, 1), n, TRUE)))
Roughly, I need to identify the relevant local minima and maxima, which happen daily. The yellow marks are my points of interest. Unlike this example, there is usually only one such point per day and I consider each day separately. However, it is hard to filter out noise from my actual points of interest.
My actual goal is to find the exact point, at which the pair started to make a jump and the exact point, at which the jump is over. This needs to be as accurate as possible, as I want to observe which asset moved first and which asset followed at which point in time (as said, they are highly correlated).
Between two extreme values, I want to minimize the distance and maximize the relative/absolute change, as my points of interest are usually close to each other and their difference is quite large.
I already looked at other questions like
Finding local maxima and minima and Algorithm to locate local maxima and also this algorithm that has the same goal. However, my dataset is extremely noisy. I already reduced the dataset to 5-minute intervals, however, this has led to omitting the relevant points in the functions to identify local minima & maxima. Therefore, this was a not good solution given my goal.
How can I achieve my goal with a quite accurate algorithm? Manually skimming through all the time-series is not an option, since this would require me to evaluate 50 * 30 time-series manually, which is too time-consuming. I'm really puzzled and trying to find a suitable solution for a week.
If more code snippets are demanded, I'm happy to share, however they didn't give me meaningful results, which would be opposed to the idea of providing a minimum working example, therefore I decided to leave them out for now.
EDIT:
First off, I updated the plot and added timestamps to the dataset to give you an idea (the actual resolution). Ideally, the algorithm would detect both jumps on the left. The inner two dots because they're closer together and jump without interception, and the outer dots because they're more extreme in values. In fact, this maybe answers the question whether the algorithm is allowed to look into the future. Yes, if there's another local extrema in the range of, say, 30 observations (or 30 minutes), then ignore the intermediate local extrema.
In my data, jumps have been from 2% - ~ 15%, such that a jump needs to be at least 2% to be considered. And only if a threshold of 15 (this might be adaptable) consecutive steps in the same direction before / after the peaks and valleys is reached.
A very naive approach was to subset the data around the global minimum and maximum of a day. In most cases, this has denoised data and worked as an indicator. However, this is not robust when the global extrema are not in the range of the jump.
Hope this clarifies why this isn't a statistical question (there are some tests to determine whether a jump has happened, but not for jump arrival time afaik).
In case anyone wants a real example:
this is a corresponding graph, this is the raw data of the relevant period and this is the reduced dataset.
Perhaps as a starting point, look at function streaks
in package PMwR (which I maintain). A streak is
defined as a move of a specified size that is
uninterrupted by a countermove of the same size. The
function works with returns, not differences, so I add
100 to your data.
For instance:
set.seed(1)
n <- 1000
x <- 100 + cumsum(sample(c(-1, 1), n, TRUE))
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.12)
abline(v = s[, 1])
abline(v = s[, 2])
The vertical lines show the starts and ends of streaks.
Perhaps you can then filter the identified streaks by required criteria such as length. Or
you may play around with different thresholds for up
and down moves (though this is not really recommended
in the current implementation, but perhaps the results
are good enough). For instance, up streaks might look as follows. A green vertical shows the start of a streak; a red line shows its end.
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.05)
s <- s[!is.na(s$state) & s$state == "up", ]
abline(v = s[, 1], col = "green")
abline(v = s[, 2], col = "red")
I have two vectors, x and y.
x is a vector where each entry represents a month for a period of several years, so I have (let's say) 10 years of data, then length(x) = 120 and so on.
(I have used the "posix.ct" command so they really are "months" in that sense, but couldn't I just have x as a numerical vector like c(1:n) or something, since I already know which month and which year a certain element of c(1:n) corresponds to? i.e if x = c(1:n), I know that x[13] is february of the second year and so on..)
y is a vector where each elements is an observation of a particular variable at a certain month.
So the observed data is grouped like this (january,0.123), (february,2.125) and so on.
I have two vectors for the months;
x1 = seq(as.POSIXct("YYYY-MM-DD", tz="GMT"),
as.POSIXct("YYYY-MM-DD", tz="GMT"),
by="month")
x2 = c(1:length(x1))
What I want to do is to run ksmooth:
plot(x1,y)
smooth = ksmooth(x2,y,"normal")
lines(smooth)
The reason that I use x1 in the plot() command is that I don't know how to otherwise get the x-axis in time.
R should automatically find a decent smoothing parameter when I haven't specified anything. The result is that ksmooth$y is equal to the input vector y! Also, a vertical bar is produced in the plot. If I replace x2 by x1 in the code above, ksmooth$y is NA for all values except for the first and last, which equal those of the input y.
So i try some bandwidths:
h = 0.1: now smooth$y = y, as before. A vertical bar is produced (it is the same color as I specified in the lines() command, so it must have to do with the ksmooth command.)
h = 10: get some non-strange results for smooth$y, however, a vertical bar is produced as before.
Then, I tried the crazy idea of very large bandwidths;
h = 1e+06: This produced nothing when I used x1 and x2 as in the code above. When I changed x2 to x1 however, I get some good results. For h = 1e+09 (that's huge!!) I get a very nice result. (I get a curve that fits the data and looks nice)
But h = 1e+09, is that reasonable? in all the examples I have looked h is something betweeen 0.1 and 10, give or take. heard something about a rule of thumb: h should equal n^(-1/5) where n is the number of data points.
I think the one thing that you are missing is that R doesn't find a decent smoothing parameter when you haven't specified anything, it just uses a bandwidth of 0.5, which is totally useless in your case.
The other thing you might be missing is that in ksmooth the bandwidth parameter is in terms of x. When ksmooth takes an x value of Date, it converts it to a numeric, which is the number of seconds. Therefore, your bandwidth will be measured in seconds, an undesirable result. When ksmooth takes an x value of months, it will default to a bandwidth of 0.5 months, also undesirable.
What you want to do is specify a reasonable bandwidth for the x that you are using. Here is an example:
x1 = seq(as.POSIXct("2000-01-01", tz="GMT"),
as.POSIXct("2010-12-31", tz="GMT"),
by="month")
x2 = c(1:length(x1))
set.seed(1)
y = runif(length(x1))
plot(x1,y,type='l')
smooth = ksmooth(x2,y,"normal")
lines(x1,smooth$y,col='blue',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=2)$y,col='red',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=10)$y,col='green',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=20)$y,col='orange',lwd=2)